If you’ve used Google lately, you’ve seen it: that big AI-generated blob at the top of the search results, courtesy of Gemini. Google calls it AI Overviews. Users have called it a lot of other things since it launched in 2024, mostly because it has a habit of being confidently wrong.
To be fair, it has gotten better. But “better” is a low bar when your starting point was telling people to eat glue. A new analysis from The New York Times, done in collaboration with AI startup Oumi, tried to put a number on just how accurate AI Overviews actually is. The headline number: 90 percent accuracy. Sounds decent, right? Until you realize that means one in ten answers is wrong.
Let that sink in. Google processes billions of searches every day. If even a fraction of those trigger an AI Overview, and one in ten of those is garbage, you’re looking at hundreds of thousands of incorrect answers per hour. Millions per day. That’s not a rounding error.
The test used a benchmark called SimpleQA, which OpenAI released back in 2024. It’s basically a list of over 4,000 questions with verifiable answers—no subjective stuff, no gray areas. Oumi fed those questions into AI Overviews and checked whether the answers held up.
When they first ran the test last year, with Gemini 2.5 under the hood, the accuracy was 85 percent. After the Gemini 3 update, it climbed to 91 percent. So yes, it’s improving. But the miss rate is still high enough that if you extrapolate across all Google searches, the absolute number of wrong answers is staggering.
I’ve been watching this rollout since day one, and the pattern is familiar: launch half-baked AI feature, get roasted, quietly fix some stuff, claim progress. The issue is that Google’s core product is supposed to be information retrieval. If I can’t trust the first thing I see on the results page, what’s the point?
To be clear, the 90 percent figure is for a specific benchmark, not real-world usage. Real-world queries are messier, more ambiguous, and harder to verify. So the actual error rate could be higher or lower depending on the topic. But SimpleQA is a reasonable proxy, and a 10 percent error rate on straightforward factual questions is not something to celebrate.
The Times piece notes that Oumi is itself an AI company, so take the methodology with a grain of salt. But the numbers align with what I’ve seen anecdotally. I still catch AI Overviews hallucinating citations or misattributing quotes on a regular basis. It’s better than it was, but it’s not trustworthy yet.
Google’s bet is that users will tolerate a certain level of inaccuracy in exchange for speed and convenience. Maybe that’s true for casual queries. But for anything important—medical advice, financial data, news events—a 10 percent failure rate is unacceptable. And Google knows it, which is why they keep tweaking the system behind the scenes.
For now, I’d recommend treating AI Overviews like a helpful but unreliable intern. Double-check everything. And if you see something that sounds wrong, it probably is.
Comments (0)
Login Log in to comment.
Be the first to comment!