Wednesday, 16 April 2025
25.9 C
Singapore
31 C
Thailand
21.1 C
Indonesia
28.7 C
Philippines

Meta’s plain Maverick AI model falls behind in benchmark rankings

Meta’s plain Maverick AI model underperforms against rivals after benchmark controversy, raising concerns over testing fairness.

Earlier this week, you might have noticed that Meta found itself in hot water over how it used its Maverick AI model. The tech giant had submitted an experimental version of the model, one that hasn’t been released to the public, to a popular AI benchmark platform called LM Arena. The goal was to show off high performance, but the move didn’t sit well with many in the community.

As a result, LM Arena’s team issued an apology and quickly updated their policies. They also decided to score only the original, unmodified version of Meta’s Maverick going forward. As it turns out, that version doesn’t perform very well compared to its competitors.

Meta’s vanilla Maverick doesn’t impress

When LM Arena ranked the model’s basic version— “Llama-4-Maverick-17B-128E-Instruct”—it ended up below several other major AI models. These included OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro. Most of those models have already been around for some time, which makes Maverick’s lower performance even more noticeable.

So, why did the experimental version do so well while the standard one lagged behind? Meta explained that the version they originally submitted—“Llama-4-Maverick-03-26-Experimental”—was tuned specifically to work well in conversations. This fine-tuning for “conversationality” likely helped the model perform better on LM Arena.

That’s because LM Arena relies on human reviewers to compare model responses and decide which is better. A model made to chat smoothly is more likely to be chosen in this kind of comparison, even if it’s not smarter or more useful in other ways.

Benchmark games raise questions

This situation has sparked a debate about how benchmarks are used in artificial intelligence. LM Arena, while popular, isn’t always considered a reliable way to measure how good a model is. It’s useful for some things, like comparing how well different models hold a conversation, but it doesn’t give a full picture of how they perform in real-world situations.

There’s also concern about fairness. When a company fine-tunes a model to score higher on a benchmark, it might not reflect how the model behaves in your hands as a developer or user. It can lead to false impressions and make it hard to judge whether a model is useful for your needs.

In short, scoring high on a single benchmark doesn’t always mean a model is better. It just means it’s better at that specific test.

Meta responds and looks ahead

In a response to the controversy, a Meta spokesperson said hat the company is always testing different versions of its models. In their words, the version that performed well was “a chat-optimised version we experimented with that also performs well on LM Arena.”

Meta has now released the open-source version of Llama 4, which includes the plain Maverick model. The company looks forward to seeing how developers customise the model to suit their needs.

“We’re excited to see what they will build and look forward to their ongoing feedback,” the spokesperson said.Now that the original version is out in the wild, it’s up to the wider AI community to test it and see whether it can shine in areas beyond friendly chats.

Hot this week

Okta introduces new platform updates to secure non-human identities in an AI-driven future

Okta expands platform to secure AI agents and non-human identities, offering a unified identity security fabric for businesses.

Armis warns AI is intensifying global cyberwarfare threat amid rising tensions

Armis warns AI is escalating the cyberwarfare threat globally, urging organisations to adopt more proactive cybersecurity strategies.

Musk’s xAI and X merger reveals his grand plan – if you’re willing to bet on it

Elon Musk’s xAI merges with X in a bold move to unify his empire, promising big returns—but only if his future-focused vision pays off.

NVIDIA brings agentic AI reasoning to enterprises with Google Cloud

NVIDIA and Google Cloud team up to bring agentic AI to enterprises with enhanced security, performance, and confidential computing features.

Meta adds Stripe CEO Patrick Collison and banking expert Dina Powell McCormick to its board

Meta adds Stripe CEO Patrick Collison and banking executive Dina Powell McCormick to its board ahead of major legal and political challenges.

OPPO unveils Agentic AI strategy at Google Cloud Next 2025

OPPO outlines its Agentic AI strategy at Google Cloud Next 2025, focusing on personalised experiences, AI Search, and secure user data processing.

Vertex Growth invests €10M in Dolphin Semiconductor to support global expansion

Vertex Growth commits €10M to Dolphin Semiconductor, boosting R&D and expansion, with a focus on market growth in Asia and beyond.

Waymo and Uber to launch driverless taxi service in Atlanta this summer

Waymo and Uber will launch robotaxi rides in Atlanta this summer, inviting users to join an interest list for early access.

Apple explains how it plans to improve AI by privately using your data

Apple plans to boost its AI tools by using private on-device comparisons with synthetic data while securing your personal information.

Related Articles

Popular Categories