Gemini 3.5 Flash Didn’t Lie. It Edited
The scores were real. The comparison was the trick.
On May 19, 2026, Google walked onto the I/O stage and said its new Gemini 3.5 Flash beats Gemini 3.1 Pro on almost every benchmark that matters. Five days later, three independent labs re-ran those benchmarks. The numbers matched to the decimal. And the model still came out looking worse than the keynote suggested.
Both things are true at once. That gap, between an accurate number and an honest one, is the whole story.
What actually shipped
First, a correction worth making out loud. It is the exact kind of error this piece is about. Gemini 3.5 Flash did not launch in early June. It shipped on May 19 at Google I/O 2026 and was generally available from day one, with no preview window. If you read that it launched this month, you read an aggregator that copied another aggregator. The original record is one tap away, and it disagrees.
Flash is the first model in Google’s new 3.5 family. The bigger sibling, Gemini 3.5 Pro, was still in internal testing on launch day, with a public release promised for the following month. So, the model everyone is arguing about is the small, fast one, pushed out ahead of the flagship.
What “agentic” even means
Google’s pitch for Flash isn’t that it writes a nicer paragraph. It’s that the model can do a job. Google calls it its “strongest agentic and coding model yet.” An agentic workflow is the difference between an assistant that answers a question and one that carries out a task: it plans the steps, writes and runs code, checks its own output, hits an error, tries again, and keeps going, sometimes for hours, with little supervision. Google’s demo had two of these agents cooperating to build a working game in an afternoon.
That’s why Google leads with speed. When a model only answers you, a half-second delay is invisible. When a model runs an agent loop that calls itself four or five times for every request, the delays stack, speed stops being a nicety, and becomes the bill. So Google’s headline claim is that Flash produces text four times faster than other frontier models, though it never names which ones.
There’s a measurement under that word “faster.” Models read and write in tokens, chunks of text roughly the size of a short word or a piece of one. Output tokens per second is just how fast the model types. Flash runs at around 280 of them per second, according to independent measurements, which is quick for its intelligence class. On that count, Google isn’t exaggerating.
The numbers were real.
Here’s the part that separates this from a fraud story. Google DeepMind handed pre-release access to Artificial Analysis, an independent evaluation outfit, before launch. Artificial Analysis ran its own suite and confirmed Google’s headline scores. Terminal-Bench, MCP Atlas, the multimodal tests: the figures held. On MCP Atlas, a measure of how well a model handles tool calls, Flash genuinely leads the field. Nobody caught Google inventing a number.
That matters. The easy version of this article would be “Google lied,” and it would be wrong.
What got left off the slide
What Google did was quieter and far more common. It chose the comparison and the metrics.
The comparison first. Google benchmarked its shiny new Flash against its own previous-generation Gemini 3.1 Pro. Not against the models it competes with in June 2026, which are OpenAI’s GPT-5.5 and Anthropic’s Claude Opus 4.7. Beating last year’s house model is a real achievement. It isn’t the same achievement as beating the current field, and the keynote let the room hear the second thing, having proved only the first. One independent roundup put it cleanly: the agentic story is real, and the frontier-intelligence framing is conditional.
Then the metrics. The same independent labs published three numbers Google didn’t put on a slide.
Flash lands somewhere around eighth place on Artificial Analysis’s overall intelligence index, behind both GPT-5.5 and Claude Opus 4.7. Strong. Not the top of the chart. The exact rank wobbles between fifth and eighth depending on the day and the configuration, which is itself a lesson in how soft these rankings are.
Its hallucination rate, how often it states something false with confidence instead of admitting it doesn’t know, is 61 percent. That’s a large improvement over the previous Flash, which sat above 90. It’s still a coin flip you wouldn’t want to build a knowledge product on without guardrails.
And the cost. This is the one that stung the developers. Flash is cheaper per token than the old Pro. It’s more expensive per job. Google tripled the token price compared with the previous Flash, and the new model is chatty, so it burns more tokens thinking out loud. Running Artificial Analysis’s full test suite on Flash costs about $1,552, roughly 75 percent more than running the very Gemini 3.1 Pro it claims to beat. A lower sticker price on a thirstier engine isn’t a lower fuel bill.
The independent developer Simon Willison flagged the same price jump on launch day and framed it as part of a wider pattern. All three major labs have started probing how much their API customers will tolerate paying. Flash is one data point in a trend, not a Google quirk.
Why does this keep happening
None of this is unique to Google, and Google’s version is mild. The instructive comparison is the one that actually was a scandal.
In April 2025, Meta launched Llama 4 and posted a glittering leaderboard score. The number came from a special chat-tuned variant submitted to the rankings, not the model the public could download. When testers ran the real weights, the model fell dozens of places. For months, Meta waved this off. Then, in a Financial Times interview reported at the start of January 2026, Meta’s departing chief AI scientist, Yann LeCun, said the results had been “fudged a little bit” and that the team used different versions of the model across different benchmarks to achieve better scores.
That’s the far end of the spectrum: a different model on the scoreboard than in your hands. Google didn’t do that. Google ran the real model and reported real numbers. It picked the flattering opponent and skipped the unflattering stats. The two sit on the same spectrum, which is the point. The whole industry has an incentive to choose its own framing. Stock prices and developer mindshare align with launch-day benchmarks. The measure became the target, and a measure that doubles as a target stops measuring much.
What to do with all this
If you’re picking a model, the lesson is small and durable. Read the comparison, not the number alone. Ask who the new model was tested against, and whether that’s who you’d actually be choosing between. Ask which benchmarks were shown, and guess at which ones weren’t. And if cost matters, price your own job, not the per-token rate. A fast, cheap-looking model that talks too much can quietly outspend the expensive one.
If Flash is the model you’re building on, it earns its place on agentic, tool-heavy, multimodal work, where it genuinely leads. For the hardest reasoning and for long-document recall, the current frontier models still beat it, and Gemini 3.5 Pro isn’t out yet. Test it on your own workload before you believe anyone’s slide, including this one.
The honest headline was available to Google the whole time. Fastest model in its class, real agentic gains, priced higher than it looks, behind the frontier on raw reasoning. That version would have survived contact with the independent labs intact. Google reached for a better story than the truth, and the truth showed up five days later anyway. It always does.
Author Note. Grace Ann Hansen is an independent researcher and writer, and an MBA & PhD graduate student in health informatics and artificial intelligence. She is also a published author, a professional musician, a gymnastics coach, and a queer transgender woman living in Sioux Falls, South Dakota. All interpretation, argument, and prose are her own. Correspondence concerning this article should be addressed to Grace Ann Hansen at grace@graceannhansen.com.
Sources
Koray Kavukcuoglu, Jeff Dean, Oriol Vinyals, and Noam Shazeer, “Gemini 3.5: frontier intelligence with action,” Google, May 19, 2026.
“With Gemini 3.5 Flash, Google bets its next AI wave on agents, not chatbots,” TechCrunch, May 19, 2026.
“Gemini 3.5 Flash: everything you need to know,” Artificial Analysis, May 2026.
“Gemini 3.5 Flash, 5 Days In: Independent Eval Roundup,” Digital Applied, May 2026.
Simon Willison, “Gemini 3.5 Flash: more expensive, but Google plans to use it for everything,” May 19, 2026.
“Meta accused of Llama 4 bait-n-switch to juice LMArena rank,” The Register, April 8, 2025.
“‘Results Were Fudged’: Departing Meta AI Chief Confirms Llama 4 Benchmark Manipulation,” Slashdot, reporting a Financial Times interview, January 2, 2026.



