Arena.ai

3.5K posts

Arena.ai

@arena

Where AI meets the real world. Formerly LMArena. We measure and advance the frontier of AI through community-driven evaluation. We’re hiring → https://t.co/XBZCrseaWF

US Katılım Mart 2023

217 Takip Edilen176.5K Takipçiler

Sabitlenmiş Tweet

Arena.ai@arena·4 Haz

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

English

569

307K

Arena.ai retweetledi

Anastasios Nikolas Angelopoulos@ml_angelopoulos·1h

last week in AI

This Week in AI@ThisWeeknAI

What happens when the cheapest, powerful models VANISH overnight? With Beijing potentially pulling the plug on sending open-source models overseas, we asked two founder/experts — Anastasios Angelopoulos of Arena AI and Munjal Shah of Hippocratic AI — about the potential impact on American AI development. Plus why your evals might be worth more than your model, a new way to think about latency vs. intelligence, and the benefits of running 31 models at once. cc: @ml_angelopoulos, @munjalshah, @arena, @hippocraticai, @alex 0:00 Will US companies lose access to Chinese models? 1:42 Open source makes running 31 models viable 5:01 How deep is the West's open source bench? 13:20 Jagged Intelligence and why post-training matters 25:22 Benchmarks as living quality checks 32:49 Why models still struggle with drug names 42:58 Fine-tuning TV, cough, and sob detection 46:08 What is Arena actually selling? 1:08:29 Nationalization vs. buying stakes in AI companies 🎥 Watch the full episode here 👇

English

3.4K

Arena.ai@arena·3h

Check out the Agent Arena leaderboard to see the details: arena.ai/leaderboard/ag… In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of users. Models can access web search, filesystem, and terminal tools to complete complex workflows. The leaderboard measures model performance on outcomes relative to the average model using a causal tracing methodology.

English

2.2K

Arena.ai@arena·3h

Meta’s Muse Spark 1.1 ranks at #17, above Gemini 3.1 Pro and Qwen-3.7 Plus, but lower than Grok 4.5 or GLM 5.2.

English

2.8K

Arena.ai@arena·3h

Muse Spark 1.1 by @AIatMeta lands in the Agent Arena, marking the introduction of Meta to the leaderboard at #5 across all labs (and #17 on the model leaderboard). Congrats to @AIatMeta on its initial entry into Agent Arena!

AI at Meta@AIatMeta

We’re excited to introduce Muse Spark 1.1, a significant upgrade from the first Muse Spark model we released earlier this year. Along with this release, we are launching a public preview of the new Meta Model API where developers can access Muse Spark 1.1. The model is also available now in "Thinking" mode in the Meta AI app and on meta.ai. Learn more: go.meta.me/ff8e2c

English

144

15K

Arena.ai retweetledi

Anastasios Nikolas Angelopoulos@ml_angelopoulos·4h

Evaluating AI models is definitely getting harder as tasks become more complex and agentic. That's why in Agent Arena, we benchmark long-running agents using our causal tracing methodology, measuring performance on millions of real-world, long-horizon tasks from a global community of users. By aggregating rich human feedback signals and behavioral traces, @arena builds evaluations and leaderboards that best capture how models perform in the real world.

The Information@theinformation

As AI models master existing benchmarks, researchers are racing to design harder tests that can keep pace. Read more in our AI Agenda newsletter: thein.fo/3QQtGDq Not a subscriber? You can sign up for a free AI Agenda trial here: thein.fo/4fwcaxz

English

4.3K

Arena.ai@arena·9h

Curious how different models actually perform before you build your next agent pipeline? Check out the Arena leaderboard: arena.ai/leaderboard

English

3.7K

Arena.ai@arena·9h

Right before joining Arena, @melissapan (PhD candidate, UC Berkeley) presented research on cutting agent system costs by 89% — while matching 100% of the best static config's accuracy. Picking the right LLM matters less than you think. Full system config >> LLM routing alone. 0:00 Old way: LLM routing 2:39 Q&A agent options 4:05 Cost vs accuracy tradeoffs 6:44 Insight #1: full config > LLM routing 9:12 Matei vs Melissa example 15:27 Introducing BRANE 23:00 Benchmarks covered 29:00 Results: 89% cost cut

English

132

26.6K

Arena.ai@arena·1d

Check out the Agent Arena leaderboard to dive into the details: arena.ai/leaderboard/ag… In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of users. Models can access web search, filesystem, and terminal tools to complete complex workflows. The leaderboard measures model performance on outcomes relative to the average model using a causal tracing methodology.

English

3.9K

Arena.ai@arena·1d

In Agent Arena, Grok-4.5 ranks #13 overall (+5.0%). - #4 Hallucination (+1.3%) - #6 Confirmed Task Success (+6.9%) - #9 Bash Recovery (+9.7%) - #12 Praise vs. Complaint (+6.1%) - #15 Steerability (+0.9%)

English

4.6K

Arena.ai@arena·1d

Grok-4.5 from @SpaceXAI debuts at #13 on the new Agent Arena leaderboard, based on 9.8K live agentic sessions. Compared to Grok 4.3, this release is a significant step forward in agentic performance (#29->#13). It makes major gains on Bash Recovery, catching up to Anthropic and OpenAI models, and shows a substantial increase in confirmed task success, making it far more effective in real-world use. Check out rankings by signal below. Congrats to the SpaceXAI team on the strong Grok-4.5 release!

SpaceXAI@SpaceXAI

Announcing Grok 4.5, our first model trained specifically for coding and agents. It was trained with Cursor and offers frontier intelligence at leading speeds and cost efficiency. x.ai/news/grok-4-5

English

451

38.7K

Arena.ai@arena·1d

Head over to the Agent Arena leaderboard to see into the details: arena.ai/leaderboard/ag…

English

6.4K

Arena.ai@arena·1d

In Agent Arena, GPT-5.6 Sol ranks #2 overall (+10.9%). - #1 Steerability (+17.3%) - #2 Confirmed Task Success (+10.9%) - #2 Tool Hallucination (+1.3%) - #3 Praise vs. Complaint (+17.6%) - #14 Bash Recovery (+7.5%)

English

8.3K

Arena.ai@arena·1d

GPT-5.6 Sol by @OpenAI is #2 on the Agent Arena leaderboard, based on 7.8K real-world agentic sessions! It is a notable uplift from GPT-5.5 (xHigh) of +1.6% Net Improvement, narrowing the gap with the frontier Claude Fable 5. The biggest difference comes from ‘Praise vs Complaint’, a signal that captures implicit user satisfaction with an agent’s responses and artifacts. Claude Fable 5 scores +17.3%, compared with +10.9% for GPT-5.6 Sol. See detailed signal-level comparison below. In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of users. Models can access web search, filesystem, and terminal tools to complete complex workflows. The leaderboard measures model performance on outcomes relative to the average model using a causal tracing methodology. Congrats again to the @OpenAI team!

OpenAI@OpenAI

Sol, Terra, and Luna, our GPT‑5.6 family of models, are starting to roll out now in ChatGPT, Codex, and the API.

English

807

154.9K

Arena.ai retweetledi

BytePlus@BytePlusGlobal·1d

Thrilled to see Seedream 5.0 Pro debut at #2 on @arena ’s Multi-Image Edit leaderboard. It’s especially encouraging to see the leap from #11 with Seedream 4.5 to #2 with Seedream 5.0 Pro. Independent community evaluations like Arena help push everyone to build better models. Congratulations to the incredible research and engineering teams behind Seedream, and thank you to @arena for including us in the benchmark. Looking forward to what’s next.

Arena.ai@arena

Seedream-5.0 Pro enters the Multi-Image Edit Arena at #2 with 1415 pts. Strong improvement compared to Seedream-4.5 (#11 -> #2)!

English

312

21.6K

Arena.ai retweetledi

Peter Gostev@petergostev·2d

GPT-5.6-Sol-Ultra is so good at maths that it created Minecraft clone in Lean (yes that Lean)

English

821

134.4K

Arena.ai retweetledi

Yufeng Zhang@Yuf_Zh·3d

Congrats to everyone who worked hard to make it!

Arena.ai@arena

Exciting news: @OpenAI’s GPT-5.6-sol is now joint #1 in the Code Arena: Frontend, matching Claude Fable 5! This marks the first time an OpenAI model has reached the top spot in Code Arena, demonstrating major gains in agentic coding, frontend and web app development. Highlights: - Significant improvement from GPT-5.5-xhigh (#18 -> #1) - #1 in Data & Analytics, Brand Marketing, Consumer product, and Gaming - Priced at $5/$30 per million input/output tokens - roughly 2× cheaper than Claude Fable 5 Huge congrats to the @OpenAI team for this incredible milestone!

English

15.5K

Arena.ai retweetledi

Allan Zhou@AllanZhou17·4d

If you've been relying on skills or special prompt tricks to make frontends with 5.5, I recommend trying 5.6-sol without them. The default behavior should be quite good.

Arena.ai@arena

English

12.9K

Arena.ai@arena·3d

Check out the full Image Arena: arena.ai/leaderboard/co…

English

4.4K

Arena.ai@arena·3d

#11 in Text-to-Image Arena with 1231 pts. Another improvement from Seedream-4.5 (#29 -> #11).

English

6.6K

Arena.ai@arena·3d

Seedream-5.0 Pro enters the Multi-Image Edit Arena at #2 with 1415 pts. Strong improvement compared to Seedream-4.5 (#11 -> #2)!

BytePlus@BytePlusGlobal

Dola Seedream 5.0 Pro API is now available on BytePlus. AI image generation is evolving beyond creating a single image. Edit with precision. Visualize complex information. Render realistic images and portraits. Create across languages. Create production-ready visual assets for enterprise workflows.

English

629

91.5K

Keşfet

@AIatMeta @melissapan @SpaceXAI @OpenAI @elonmusk @BarackObama @taylorswift13 @cristiano