Arena.ai

3.1K posts

Arena.ai banner
Arena.ai

Arena.ai

@arena

Where AI meets the real world. Formerly LMArena. We measure and advance the frontier of AI through community-driven evaluation. We’re hiring → https://t.co/XBZCrseaWF

US Katılım Mart 2023
212 Takip Edilen158.2K Takipçiler
Sabitlenmiş Tweet
Arena.ai
Arena.ai@arena·
LMArena is now Arena. A name that takes us back to our roots with a powerful mission: to measure and advance the frontier of AI for real-world use. We have grown from a small PhD research project to a platform powered by a global community of millions. This rebrand has been shaped by the people who use it. 👇 Take a look inside the rebrand.
English
79
101
1.1K
272.3K
Arena.ai
Arena.ai@arena·
Millions of votes a week. One tagging system. Arena researchers Guanglei Song and I-Hung Hsu walk through the data pipeline behind Arena's category leaderboards: Databricks → Spark → a pluggable tagger framework calling LLMs to categorize every evaluation across our text, image, frontend coding, and other arenas. This metadata layer is what makes Arena data useful for research beyond just leaderboard rankings. 0:00 How Arena collects evaluation data 1:50 Pipeline architecture: Databricks and hourly Spark jobs 2:35 The pluggable tagger framework 4:35 Handling flaky LLM APIs with dynamic concurrency control 6:30 Adding new taggers without rebuilding the system 7:30 Backfilling history alongside the live stream 9:10 Cost control: filtering, idempotency, and model selection 11:10 Chunking long messages
English
2
1
20
2K
Arena.ai
Arena.ai@arena·
According to @tryramp, @AnthropicAI just overtook OpenAI in business customers (34.4% vs 32.3% this week). In the Text Arena, that flip happened in Q4 2025. Real-world signal led enterprise adoption by ~6 months. But the picture shifts fast: Codex crossed 3M+ weekly developers in April, up 5x since January (per @OpenAI). Coding agents are their own contest. More from Arena shortly.
English
19
24
285
32.9K
Arena.ai
Arena.ai@arena·
US vs China update. Stanford's AI Index put the US–China gap at 2.7%. Here's what two years of real-world use from the Text Arena shows. Gap three years ago: +278. Today: +29. @AnthropicAI's Claude Opus 4.6 Thinking vs. Baidu's @ErnieforDevs Ernie 5.1 at the top. The US has never lost #1, but the race keeps closing.
English
23
44
361
48.6K
EM
EM@edwin_mccallum·
@arena @AnthropicAI @GoogleDeepMind muse spark being ahead of gpt 5.5 in coding doesn't make sense unless its the coding from the chat answers (which I'm guessing is what it is)
English
1
0
14
1.4K
Arena.ai
Arena.ai@arena·
The top 5 labs in Text Arena rankings by category show that frontier models have distinct strengths and tradeoffs. #1 @AnthropicAI, Claude Opus 4.7 - The most consistently dominant model overall, leading top-tier across nearly every major category. #2 @GoogleDeepMind, Gemini 3.1 Pro - Well-rounded, with a notable edge in Creative Writing, ranked below Opus 4.7 and GPT-5.5 High in Expert #3 @AIatMeta, Muse Spark - Particularly strong in Overall and Coding, though it’s lagging behind in Expert tasks, Math, and Longer Query performance. #4 @OpenAI, GPT-5.5 High - One of the most balanced models overall, staying competitive with the top two across most categories, with especially strong performance in Expert and Math. #5 @xAI, Grok 4.20 - A more specialized profile, standing out primarily in Creative Writing and Hard Prompts, while lagging behind in Expert tasks.
Arena.ai tweet media
English
54
75
583
88.6K
Arena.ai
Arena.ai@arena·
Full category ranking breakdown in Text Arena. Claude-Opus-4.7-thinking is the only model ranked top-5 across all categories.
Arena.ai tweet media
English
5
7
50
8.6K
Arena.ai
Arena.ai@arena·
GPT-5.5 Instant by @OpenAI is in ChatGPT and has landed on Arena, across multiple leaderboards. Here’s how it ranks by modality: - Vision Arena: #11 overall, on par with Claude-Sonnet-4.6 - Text Arena: #18 overall, Multi-Turn #5 - Occupational: #5 Life, Physical & Social Science, #9 Legal & Government - Document Arena (analysis & long-content reasoning): #24, on par with GPT-5.2 Congrats again to @openAI on this rollout!
Arena.ai tweet media
OpenAI@OpenAI

GPT-5.5 Instant is starting to roll out in ChatGPT. It’s a big upgrade, giving you smarter, clearer, and more personalized answers in a warmer, more natural tone. And it's also more concise, which we heard you wanted. We think you'll love chatting with it.

English
25
26
550
93.7K
Arena.ai
Arena.ai@arena·
Introducing 7 new leaderboard views for frontend output in Code Arena. Aggregate leaderboards don’t tell the full story. "Best frontend coding model" depends on what you're building, so we built leaderboards that show exactly that. After analyzing 250,000+ Code Arena prompts, we identified the major frontend web development task categories: - Brand & Marketing - Reference-Based Design - Data & Analytics - Consumer Product - Gaming - Simulations - Content Creation Tools With this release, @AnthropicAI is a big winner as it has at least 1 model in top 4 spots across all 7 categories. But there’s more to the story in the margins. Dig into the thread to see exactly which models are currently on top of each domain.
Arena.ai tweet media
English
16
14
149
12.9K
Arena.ai
Arena.ai@arena·
Ernie-5.1 by Baidu’s @ErnieforDevs has landed as #4 in the Search Arena! This makes Baidu a top 3 lab in Search performance, and the only Chinese model in the top 10 overall. Congrats to the @ErnieforDev team on this accomplishment!
Arena.ai tweet media
ERNIE for Developers@ErnieforDevs

Another update from @arena 👀 ERNIE 5.1 is now ranked #4 in Search Arena — making ERNIE one of the top-performing labs in Search and currently the only Chinese model in the Top 10. Official release coming very soon 🚀

English
26
82
286
36K
Arena.ai
Arena.ai@arena·
Gemma-4 lands in Vision Arena as #2 & #4 open models, and shifts the Pareto frontier! @GoogleDeepMind dominates the price-performance Pareto in Vision across both proprietary and open models. - Gemma-4-31b ranks #2 open (#20 overall) - Gemma-4-26b-a4b ranks #4 open (#26 overall) The Vision Arena ranks multimodal AI models capable of reasoning over visual inputs. Congrats to @GoogleDeepMind again on the open model progress!
Arena.ai tweet media
Google DeepMind@GoogleDeepMind

Meet Gemma 4: our new family of open models you can run on your own hardware. Built for advanced reasoning and agentic workflows, we’re releasing them under an Apache 2.0 license. Here’s what’s new 🧵

English
10
32
311
34.1K