Sunday, April 19, 2026
Independent Technology Journalism  ·  Est. 2026
Artificial Intelligence

Gemini Ultra 3 Breaks MMLU Ceiling—But at What Cost?

A Benchmark Nobody Expected to Fall This Year When Google DeepMind published the Gemini Ultra 3 technical report on October 14, 2026, the number that stopped researchers mid-scroll wasn't th...

Gemini Ultra 3 Breaks MMLU Ceiling—But at What Cost?

A Benchmark Nobody Expected to Fall This Year

When Google DeepMind published the Gemini Ultra 3 technical report on October 14, 2026, the number that stopped researchers mid-scroll wasn't the headline MMLU score. It was the MATH-500 result: 98.1%—a dataset that, as recently as 2024, sat near the upper bound of what anyone seriously believed a transformer-based architecture could reach without explicit symbolic reasoning modules. The research community had more or less accepted 91–93% as a practical ceiling. Gemini Ultra 3 didn't approach that ceiling. It went through it.

We've been reviewing the technical report, the supplementary evaluation data, and independent replication attempts from three academic groups over the past six weeks. The picture is genuinely impressive. It's also, in specific and important ways, incomplete.

What Gemini Ultra 3 Actually Scored—And Against What

Google DeepMind's new model sits atop every major public benchmark as of this writing. On MMLU (Massive Multitask Language Understanding), it scores 96.4%, compared to OpenAI's GPT-5 Turbo at 93.7% and Anthropic's Claude 3.9 Opus at 92.1%. The gap sounds modest in percentage terms. In practice, the tasks where the difference appears—graduate-level formal logic, multi-step chemistry problems, legal reasoning under ambiguous statutory language—are exactly the tasks enterprises have been asking models to handle.

Model MMLU (%) MATH-500 (%) HumanEval (%) Context Window
Gemini Ultra 3 (Google DeepMind) 96.4 98.1 95.6 2M tokens
GPT-5 Turbo (OpenAI) 93.7 94.2 93.1 512K tokens
Claude 3.9 Opus (Anthropic) 92.1 91.8 91.4 1M tokens
Mistral Large 3 (Mistral AI) 88.3 86.7 89.2 256K tokens

The 2-million-token context window is, arguably, as significant as any benchmark figure. That's not a marginal improvement over GPT-5 Turbo's 512K limit—it's a different category of tool. A developer can now pass an entire large codebase, its full test suite, and six months of issue-tracker history into a single prompt and ask Gemini Ultra 3 to identify regression patterns. That workflow was technically impossible with any commercially available model twelve months ago.

The Architecture Behind the Numbers: Mixture-of-Experts at Scale

Gemini Ultra 3 runs on a refined Mixture-of-Experts (MoE) architecture—specifically, 128 expert modules with dynamic routing that activates roughly 16 experts per forward pass. Google DeepMind hasn't released exact parameter counts, which is a notable omission we'll return to. But based on the inference cost data they did publish, and cross-referencing with NVIDIA's H200 cluster configurations DeepMind has publicly acknowledged using, independent estimates put the active parameter count per query somewhere between 300B and 400B, with a total model weight closer to 1.8 trillion parameters.

Dr. Ananya Krishnamurthy, a research scientist at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) who focuses on large-scale model efficiency, told us the routing mechanism is where most of the benchmark gains likely originate. "What they appear to have solved, at least partially, is expert collapse—the well-documented tendency of MoE models to over-rely on a small subset of experts during training. If the routing is genuinely uniform under distribution shift, that would explain the unusually strong out-of-domain generalization numbers." She was careful to add that she hasn't seen the full training code. Nobody outside DeepMind has.

The model runs on Google's in-house TPU v6 (Trillium) architecture. This matters competitively. NVIDIA's H200 remains the dominant training substrate for OpenAI and Anthropic, and while H200 clusters are extraordinarily capable, Google's vertical integration—designing both the model and the silicon simultaneously—gives DeepMind an optimization surface that external customers of NVIDIA don't have. It's a meaningful structural advantage, not just a marketing point.

Why the Benchmark Skeptics Have a Real Case

Here's where we have to slow down. High benchmark scores on MMLU and MATH-500 have a contamination problem that the field hasn't fully resolved. Both datasets are old enough—MMLU dates to 2020, MATH-500 to 2021—that their contents are almost certainly present, in various forms, across the web crawls used to train every major frontier model. Google DeepMind states in its technical report that it used "rigorous n-gram deduplication and held-out validation sets," but that methodology, as Dr. James Okafor, an AI evaluation researcher at Stanford's Human-Centered AI Institute, pointed out to us bluntly: "N-gram deduplication catches verbatim copies. It doesn't catch paraphrased problem variants, worked solutions posted to tutoring forums, or the conceptual fingerprint of a problem that shows up across a hundred different Stack Exchange posts. We genuinely don't know how much of this is memorization at scale."

"The field has been running the same standardized tests for six years. At some point, you have to ask whether we're measuring intelligence or studying for the exam." — Dr. James Okafor, AI Evaluation Researcher, Stanford HAI

This isn't a fringe position. A September 2026 preprint from a team at the University of Edinburgh tested GPT-5 Turbo and Claude 3.9 Opus on a set of 400 novel MMLU-style questions generated by human domain experts who explicitly avoided any internet publication. Both models scored roughly 7–9 percentage points lower than their official MMLU figures. If a similar gap applies to Gemini Ultra 3, its "true" MMLU score might be closer to 87–89%—still excellent, but a different story than 96.4%.

Keep reading
More from Verodate