Debates over AI performance benchmarks—and how they are presented—have sparked controversy in the tech world.
Disappointing to see the incentives for the grok team to cheat and deceive in evals.
— Boris Power (@BorisMPower) February 20, 2025
Tl;dr o3-mini is better in every eval compared to grok 3.
Grok 3 is genuinely a decent model, but no need to over sell. https://t.co/sJj5ByVikp
xAI accused of misrepresenting Grok 3’s performance
An OpenAI employee has accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. Igor Babushkin, a co-founder of xAI, has strongly defended the company’s claims, insisting that the data was reported accurately. However, as is often the case with AI benchmarks, the truth appears more complex.
Completely wrong. We just used the same method you guys used 🤷♂️ pic.twitter.com/exLcS0z2xI
— Igor Babuschkin (@ibab) February 20, 2025
The controversy centres around a post on xAI’s blog, where the company shared a graph displaying Grok 3’s performance on AIME 2025. This benchmark consists of difficult maths questions from a recent invitational mathematics exam. While some experts question AIME’s suitability as a true measure of AI ability, it is still widely used to assess AI models’ maths skills.
In the graph, xAI claimed that two versions of Grok 3—Grok 3 Reasoning Beta and Grok 3 mini Reasoning—outperformed OpenAI’s best available model, o3-mini-high, on AIME 2025. However, OpenAI employees quickly pointed out that xAI failed to include an important detail: o3-mini-high’s score at “cons@64.”
AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination
— Dimitris Papailiopoulos (@DimitrisPapail) February 8, 2025
AIME 2025 part I was conducted yesterday, and the scores of some language models are available here: https://t.co/WqTCjhXZze thanks to @mbalunovic, @ni_jovanovic et al.
I have to say I was impressed,…
The missing metric: What is cons@64?
The term “cons@64” stands for “consensus@64.” This means an AI model is given 64 chances to answer each problem, and its most frequently chosen answers are considered final. This method can significantly improve a model’s benchmark scores, making it a crucial factor in evaluating AI performance. Omitting this detail in a comparison graph can create a misleading impression that one model is superior when that may not be the case.
When looking at AIME 2025 results at “@1” (which only considers the model’s first attempt at answering a question), both Grok 3 Reasoning Beta and Grok 3 mini Reasoning scored lower than OpenAI’s o3-mini-high. Grok 3 Reasoning Beta also slightly underperformed compared to OpenAI’s o1 model running at a “medium” computing setting. However, xAI has marketed Grok 3 as the “world’s smartest AI.”
The broader issue of AI benchmarks
In response to the criticism, Babushkin argued on X (formerly Twitter) that OpenAI has also presented benchmark results in ways that could be considered misleading. However, OpenAI’s past comparisons mainly involved its models rather than direct competition with other companies.
Hilarious how some people see my plot as attack on OpenAI and others as attack on Grok while in reality it’s DeepSeek propaganda
— Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025
(I actually believe Grok looks good there, and openAI’s TTC chicanery behind o3-mini-*high*-pass@"""1""" deserves more scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic
An independent AI researcher compiled a more comprehensive chart showing nearly every model’s performance at cons@64 to provide a clearer picture. This chart provided a more accurate comparison but highlighted another issue: the computational cost behind these results.
As AI researcher Nathan Lambert pointed out, one of the most critical factors remains unknown—the amount of computing power and money required for each model to achieve its highest score. This raises a larger concern about AI benchmarks in general. While they can provide useful insights, they often fail to fully capture a model’s true strengths and weaknesses, leaving room for misinterpretation.