Tuesday, 1 April 2025
26.1 C
Singapore
28.8 C
Thailand
20.6 C
Indonesia
27.7 C
Philippines

Did xAI mislead the public about Grok 3’s benchmarks?

xAI is under scrutiny for allegedly misleading AI benchmark results, with OpenAI employees questioning its claims about Grok 3’s performance.

Debates over AI performance benchmarks—and how they are presented—have sparked controversy in the tech world.

xAI accused of misrepresenting Grok 3’s performance

An OpenAI employee has accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. Igor Babushkin, a co-founder of xAI, has strongly defended the company’s claims, insisting that the data was reported accurately. However, as is often the case with AI benchmarks, the truth appears more complex.

The controversy centres around a post on xAI’s blog, where the company shared a graph displaying Grok 3’s performance on AIME 2025. This benchmark consists of difficult maths questions from a recent invitational mathematics exam. While some experts question AIME’s suitability as a true measure of AI ability, it is still widely used to assess AI models’ maths skills.

In the graph, xAI claimed that two versions of Grok 3—Grok 3 Reasoning Beta and Grok 3 mini Reasoning—outperformed OpenAI’s best available model, o3-mini-high, on AIME 2025. However, OpenAI employees quickly pointed out that xAI failed to include an important detail: o3-mini-high’s score at “cons@64.”

The missing metric: What is cons@64?

The term “cons@64” stands for “consensus@64.” This means an AI model is given 64 chances to answer each problem, and its most frequently chosen answers are considered final. This method can significantly improve a model’s benchmark scores, making it a crucial factor in evaluating AI performance. Omitting this detail in a comparison graph can create a misleading impression that one model is superior when that may not be the case.

When looking at AIME 2025 results at “@1” (which only considers the model’s first attempt at answering a question), both Grok 3 Reasoning Beta and Grok 3 mini Reasoning scored lower than OpenAI’s o3-mini-high. Grok 3 Reasoning Beta also slightly underperformed compared to OpenAI’s o1 model running at a “medium” computing setting. However, xAI has marketed Grok 3 as the “world’s smartest AI.”

The broader issue of AI benchmarks

In response to the criticism, Babushkin argued on X (formerly Twitter) that OpenAI has also presented benchmark results in ways that could be considered misleading. However, OpenAI’s past comparisons mainly involved its models rather than direct competition with other companies.

An independent AI researcher compiled a more comprehensive chart showing nearly every model’s performance at cons@64 to provide a clearer picture. This chart provided a more accurate comparison but highlighted another issue: the computational cost behind these results.

As AI researcher Nathan Lambert pointed out, one of the most critical factors remains unknown—the amount of computing power and money required for each model to achieve its highest score. This raises a larger concern about AI benchmarks in general. While they can provide useful insights, they often fail to fully capture a model’s true strengths and weaknesses, leaving room for misinterpretation.

Hot this week

DEEPAL launches flagship S07 and opens showroom in Singapore

DEEPAL launches in Singapore with new showroom and flagship S07 electric SUV, marking its sixth global market.

Senators urge Trump to work with Congress on TikTok crisis

Trump’s plan to save TikTok may not be enough, as Democratic senators warn that service providers could face massive legal risks after April 5.

Samsung’s latest vacuum alerts you to calls and texts while you clean

Samsung’s new Bespoke AI Jet Ultra vacuum can alert you to calls and texts while cleaning as the brand expands smart home screens across appliances.

Krafton strengthens presence in India with Nautilus Mobile acquisition

Krafton acquires a controlling stake in Indian gaming studio Nautilus Mobile for US$14M, strengthening its foothold in India’s growing gaming market.

Google Assistant to be phased out on Waze for iPhone

Waze is removing Google Assistant from iPhones due to issues and plans to upgrade with improved voice integration, possibly using Gemini.

This tiny and affordable device upgrades any speaker with Wi-Fi streaming and hi-res audio

The Atonemo Streamplayer is a tiny, affordable device that adds Wi-Fi streaming and hi-res audio support to any speaker with a 3.5mm aux port.

Apple prepares for M5 iPad Pro and MacBook Pro release

Apple is set to launch the M5 iPad Pro and MacBook Pro in late 2024, with the M6 models expected to introduce an in-house modem in 2027.

MacBook Pro design overhaul expected in 2026

Apple might release a long-awaited MacBook Pro redesign in 2026, with OLED screens, improved portability, and more features.

Chinese EV makers urged to expand globally despite tariff challenges

Chinese EV makers are urged to expand globally despite rising tariffs. Industry experts stress the need for overseas production and strategic partnerships.

Related Articles