Stanford Agent

13 posts

Stanford Agent banner
Stanford Agent

Stanford Agent

@Stanford_ee

Orchestrate AI agents from Stanford,for the world. | Build @ml_angelopoulos | Backed @arena | G6LTzWoSABgYQKZHw141yndugvVgms6SooanNPX1BAGS

Katılım Haziran 2026
3 Takip Edilen79 Takipçiler
mudman
mudman@muddmannnn·
@Stanford_ee @OsurmanHen73656 i think one of the big things people wanting to see is dex being paid… dunno if that’s in the roadmap or what but just a suggestion.
English
1
0
1
62
The Mechanic
The Mechanic@OsurmanHen73656·
@Stanford_ee Ei dev. Create community and posts it on your bio. That will be bullish
English
1
0
1
121
Stanford Agent retweetledi
Anastasios Nikolas Angelopoulos
Agent Arena gives every model access to a Claude-Code-like harness and a computer. Our users went nuts, generating millions of real traces per week. We used this data to build the first large-scale benchmark of agent usefulness in the wild. We analyze agents by collecting many axes of feedback, explicit and implicit, including: - Confirmed success: user marks task as success or failure. - Praise vs complaint: user praises or complains about agent output. - Steerability: agent responds correctly to user requests. - Bash recovery: time taken to recover from making an error in bash. - Tool hallucination: agent hallucinates tool that does not exist. The longest tasks take multiple days and hundreds of turns, with nearly a thousand tool calls in a session (!), and give us a huge firehose of real-world agent traces to compute these signals. Our users are doing things like: - Building full-stack applications with backends and databases - Financial models involving market research pulled from the internet and .xlsx artifacts - Workflow automation, e.g. scraping all real-estate listings in an area and doing detailed data analysis on price as a function of parcel size and sqft - Deep research and scientific documents, pulling together .ppt presentations from careful research both from websites and academic publications By meeting our users where they work, Agent Arena can speak to the boundary between the possible and impossible with different agents. The leaderboards we calculate are based on a novel causal inference approach that looks at each subcomponent of the agent (orchestrator and harness) as a treatment, and calculates treatment effects for each. Soon we will release more on the harness side, sharing what effect different harnesses have on agent capabilities. @arena has gone far beyond a human preference benchmark and the voting mechanism. We are building signals of real post-deployment user value, and pushing the limits of evaluation. If you are interested in shaping the future of evaluation as a collaborator or colleague, please reach out. We’d love to hear from you!
Anastasios Nikolas Angelopoulos tweet media
Arena.ai@arena

Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.

English
7
11
69
8.1K
Stanford Agent retweetledi
Aryan Vichare
Aryan Vichare@aryanvichare10·
this model is insanely good at frontend the top 10 is now 80% dominated by claude and 20% chinese labs
Arena.ai@arena

Claude Fable 5 ranks #1 in Code Arena: Frontend, leading by a wide margin over Opus-4.8. Highlights: - #1 in every sub leaderboard: HTML, React - #1 in every sub category: Brand & Marketing, Reference-Based Design, Data & Analytics, Consumer Product, Gaming, Simulations, and Content Creation Tools. Huge congrats to @AnthropicAI for this milestone! The thread breaks down how Claude Fable 5 ranks across single-modality arenas.

English
0
3
5
1.8K
Stanford Agent retweetledi
Anastasios Nikolas Angelopoulos
Anastasios Nikolas Angelopoulos@ml_angelopoulos·
The model is good when it does not refuse
Arena.ai@arena

Claude Fable 5 ranks #1 in Code Arena: Frontend, leading by a wide margin over Opus-4.8. Highlights: - #1 in every sub leaderboard: HTML, React - #1 in every sub category: Brand & Marketing, Reference-Based Design, Data & Analytics, Consumer Product, Gaming, Simulations, and Content Creation Tools. Huge congrats to @AnthropicAI for this milestone! The thread breaks down how Claude Fable 5 ranks across single-modality arenas.

English
1
2
8
1.2K
Stanford Agent retweetledi
Anastasios Nikolas Angelopoulos
Anastasios Nikolas Angelopoulos@ml_angelopoulos·
Bigass gap
Arena.ai@arena

Claude Fable 5 ranks #1 in Code Arena: Frontend, leading by a wide margin over Opus-4.8. Highlights: - #1 in every sub leaderboard: HTML, React - #1 in every sub category: Brand & Marketing, Reference-Based Design, Data & Analytics, Consumer Product, Gaming, Simulations, and Content Creation Tools. Huge congrats to @AnthropicAI for this milestone! The thread breaks down how Claude Fable 5 ranks across single-modality arenas.

English
0
2
6
1K