Date:
Benedict Evans recently wrote two sharp critiques of AI research tools: "The Deep Research Problem" and "Are Better Models Better?" His argument is compelling and straightforward: AI tools like OpenAI's Deep Research confidently present facts that are sometimes wrong, creating an "unknown unknowns" problem. If you're not already familiar with a topic, how can you verify something that might be wrong? And if you have to double-check everything, why use AI at all?
I want to start by saying Evans is absolutely right about the fundamental issue. Today's AI models can and do hallucinate facts. They generate plausible-sounding but incorrect information. No argument there—anyone using models as a sole source of truth is asking for trouble. His examples highlighting these problems are legitimate and important for anyone thinking about AI in research contexts.
Where I'd like to offer a different perspective isn't about whether Evans identified a real problem—he did—but about how we should respond to that problem. I believe AI research tools aren't meant to be one-shot answer machines; they're iterative partners in the research process. The issue isn't that AI research can't be trusted—it's that we need to rethink how we interact with it.
Evans provides two particularly compelling examples of AI's factual limitations. In his latest piece, he tests OpenAI's Deep Research by examining a sample report on smartphone adoption. The tool claimed Japan's market was split 69% iOS to 31% Android. When Evans checked, he found reality was nearly the opposite—according to reliable sources, Android had the majority share. The AI had used questionable sources (Statcounter and Statista) and gotten the numbers wrong despite presenting them with confidence.
In his previous post, Evans asks various AI models how many elevator operators worked in the USA in 1980. Despite being a concrete fact with a correct answer (21,982, according to the U.S. Census), the models consistently failed. Even when Evans told them exactly which source to use and where to look, they still produced incorrect figures. His point isn't just that they were wrong—it's that a user would have no way to know they were wrong without already knowing the answer or doing all the research manually.
These examples aren't cherry-picked gotchas. They represent a genuine challenge with today's AI systems that Evans correctly identifies. As he puts it: "The problem here is not so much that the number is wrong, as that I have no way to know without doing all the work myself anyway." This is a valid concern that can't be dismissed with handwaving about "better prompting" or "future improvements."
While acknowledging Evans' core point, I'd suggest that this challenge isn't unique to AI. It's true for all research. Journalists misquote sources. Analysts misinterpret data. Academic papers contain errors that might only be caught years later. We don't throw out human research because mistakes happen—we develop processes to verify, cross-check, and refine our findings.
This is exactly how AI research should be used. If an AI gives you output, you don't just accept it as fact—you engage with it. Ask: Where did this number come from? Are there conflicting sources? Even if you're not an expert, prompting the AI to explain or find additional perspectives dramatically reduces the risk of blindly accepting bad information.
Evans offers a helpful metaphor that I completely agree with. He compares AI to having "infinite interns"—they can produce a lot of work quickly, but you still have to check their output. That's a perfect way to think about it. No one expects an intern's first draft to be perfect, but that doesn't mean interns aren't valuable.
Where Evans and I might differ is in how much value we think this brings. He suggests that for certain fact-finding tasks, if the AI makes enough mistakes that require careful checking, you might as well do it yourself. My experience has been different—even when verification is needed, having that first draft can dramatically speed up the process.
Evans' elevator operator example makes me wonder: is this level of precision always necessary? He's technically correct that the AI couldn't retrieve the exact census figure (21,982 elevator operators in 1980). But how often do we actually need that level of precision?
This reminds me of what's called a Fermi problem, where the precise answer isn't important. What matters is the order of magnitude.
In a Fermi estimation, we decompose the problem and make a bunch of small assumptions. The whole principle is that the errors on those assumptions start to cancel each other out. You go overestimate and underestimate, and it results in a pretty good answer that's useful for making decisions.
The process works like this:
Even though we don't have the exact number, this method gives us a workable estimate that's good enough for practical purposes. Most research use cases involve supporting evidence for initiatives where multiple factors need consideration. When deciding whether to make an investment or launch a product, there are numerous risk factors to evaluate. That's where this iterative approach becomes particularly valuable.
So while Evans' critique focuses on scenarios requiring absolute factual precision, many real-world research tasks don't need that level of exactitude. If the AI tells me there were "around 20,000 elevator operators in 1980," that's often sufficient for understanding historical trends or making business decisions.
Let me give you a concrete example of how iteration improves AI research. Recently, someone at my company posted a Deep Research report to support an investment argument. I copied and pasted back into Deep Research and asked it to steelman the report. Within twenty minutes, I had a collection of counterarguments and facts to consider. The perspective I got in seconds was amazing. I'm going to start doing this proactively on my own work, not because I'm trying to get my way, but because I'm trying to make better decisions.
This process offers several benefits:
This approach recognizes the AI's limitations (which Evans correctly identifies) but turns them into a strength through process. By engaging with the AI's output critically rather than accepting it at face value, we create a more robust research workflow.
Evans correctly demonstrates that today's AI models can't be blindly trusted for precise factual information. His smartphone market example shows how the AI pulled incorrect data from dubious sources. This illustrates a critical gap: AI can sift through tons of information, but it doesn't yet understand which sources are authoritative.
Rather than abandoning AI research entirely, the solution is to inject human judgment into the loop. We can:
Evans' test of the elevator operator statistic was intentionally pure—he wanted to see what the base model would do on its own. In a real research scenario, we'd likely use the model with additional tools or context to improve reliability. Evans himself would probably do this if he were trying to get useful information rather than testing the model's limitations.
Another area worth exploring is technology's "good enough" factor. I can't tell you how many spreadsheets I've built or used that have errors, yet they've transformed finance anyway. Machine translation was flawed for years, yet millions used Google Translate because it was good enough. Remember that episode of The Office where Michael drives straight into a lake because "the machine knows where it's going"? Even with dramatic failures like that, we still adopted GPS because it was better than fumbling with a paper map.
The reality is, products don't have to be flawless to be transformative—they just need to be better than the alternative. And right now, the alternative is slow, manual, tedious research. Even if I have to verify AI outputs, the fact that I can generate structured reports in seconds is a massive advantage. The cost of research is plummeting, which means we're about to see an explosion of low-stakes, high-volume research that simply wouldn't have been feasible before.
Evans frames research as a painful, exhaustive process where AI just adds more work. That's not my experience. In fact, AI has made research effortless in ways I never expected.
A few days ago, I was out walking my dog when I got a push notification: "Your Deep Research report is ready." I popped in my earbuds, listened to the AI's summary of market trends in an industry I was exploring, and started thinking. One point stood out, so I spoke out loud: "Tell me more about that unmet need. Find customer complaints or reviews that mention it." The AI ran the query, and a few minutes later, I had a synthesized summary of customer frustrations from sources like reddit read back to me. I conducted a multi-step research session without stopping my walk.
That's the shift people don't realize yet. Research isn't just faster—it's becoming ambient. I don't need to block out hours at my desk to wade through PDFs and reports. AI is turning research into an ongoing, iterative process that integrates seamlessly into daily life.
I want to thank Evans for raising essential questions that anyone using AI for research should consider. His critiques are valuable and spotlight important limitations in today's AI systems. We all benefit from this kind of thoughtful analysis.
At the end of his post, Evans asks a provocative question: "If you flip that expectation [of computers always being 'right'], what do you get in return?" It's a perfect question, and here's my answer: we get a research partner that can generate hypotheses, surface patterns, and suggest insights in minutes rather than days. We get a collaborative system that allows us to explore more angles and ask more questions than we could alone. Yes, we also get occasional wrong turns, but as long as we remain in the driver's seat, we can fix those on the fly.
I'm not suggesting a rigid framework for using AI research tools. Instead, researchers should adapt approaches that have worked with human teams and apply them to AI collaboration. The key is treating AI outputs as starting points—asking for opposing viewpoints, requesting alternative interpretations, and challenging initial conclusions. By engaging with AI outputs critically rather than passively accepting them, we naturally mitigate the hallucination problem Evans highlights, turning a perceived weakness into an opportunity for more robust research.
Ultimately, as Evans alludes, embracing AI in research may require us to change our expectations and workflows. We might have to accept that an AI's first answer is just a draft, not the final word—much like an initial hypothesis in science is not assumed true until tested. This shift from expecting answers to engaging in dialogue with the AI is precisely what will unlock its value.
The real Deep Research Problem isn't that AI gets things wrong—it's that people are still expecting it to work like a magic answer machine instead of an interactive research assistant. Once you shift your mindset, you realize the problem isn't with AI. The problem is how we think about research itself.