Metaโs new AI model, Maverick, made headlines after it climbed to the second spot on LM Arena, a popular AI performance leaderboard where human reviewers compare and rate responses from various models. At first glance, this seems like a major success. But if you look closer, itโs not as clear-cut as it seems.
The version of Maverick that earned high marks in the LM Arena rankings isnโt the same version that developers like you can access today. This has raised questions across the AI community about fairness, transparency, and how benchmark results are presented.
LM Arena model is not the one you get
Meta clearly stated in its announcement that the Maverick model submitted to LM Arena was an โexperimental chat version.โ The company goes further on the official Llama website, revealing that the version tested was โLlama 4 Maverick optimised for conversationality.โ
In other words, Meta fine-tuned a special version of the model to perform better in chat-style interactionsโsomething that naturally gives it an edge in a test like LM Arena, where human reviewers prefer smooth, engaging conversations.
But hereโs the issue: this version isnโt available to developers. The model you can download and use is a more standard, general-purpose version of Maverickโoften called the โvanillaโ variant. That means youโre not getting the same results that earned Meta a top spot on the leaderboard.
Why this matters to developers
Why does this difference matter? After all, companies often tweak their products for marketing purposes. But when it comes to AI models, benchmarks like LM Arena help developers, researchers, and businesses decide which model to use.
If a company releases one version of a model for testing but provides a less capable version to the public, it skews the expectations. You could end up basing your development plans on results that the model you get canโt match.
Some researchers on X (formerly Twitter) have even pointed out that the public version of Maverick behaves noticeably differently than the LM Arena one. It doesnโt use emojis as often, and its answers tend to be shorter and less conversational. These are clear signs that the models are not the same.
Okay Llama 4 is def a littled cooked lol, what is this yap city pic.twitter.com/y3GvhbVz65
— Nathan Lambert (@natolambert) April 6, 2025
Benchmark results should reflect real-world use
The bigger concern here is about how benchmarks are used. Many in the AI field already agree that LM Arena isnโt perfect. Itโs a valuable tool, but it doesnโt always provide a full or fair picture of what a model can do in every situation.
Most companies have avoided tuning their models precisely to score better on LM Arena. Or if they have done so, they havenโt made it public. Metaโs decision to test a customised version and promote its ranking without making the same model widely available sets a worrying precedent.
for some reason, the Llama 4 model in Arena uses a lot more Emojis
— Tech Dev Notes (@techdevnotes) April 6, 2025
on together . ai, it seems better: pic.twitter.com/f74ODX4zTt
Benchmarks should help you understand a modelโs strengths and weaknesses across various tasksโnot just how well it performs in one specific setup. When companies tailor their models to game these benchmarks, it can lead to confusion and disappointment.
If you plan to use Maverick, remember that the version Meta showcased isnโt precisely what youโll get. Testing models and focusing on specific use cases is important rather than relying too heavily on leaderboard rankings.