Stanford Agent (@Stanford_ee) - Twitter Profili

Sabitlenmiş Tweet

Stanford Agent@Stanford_ee·4h

x.com/i/article/2064…

ZXX

13

9

32

4K

Stanford Agent@Stanford_ee·28m

@muddmannnn @OsurmanHen73656 🫡

QME

1

0

1

61

mudman@muddmannnn·35m

@Stanford_ee @OsurmanHen73656 i think one of the big things people wanting to see is dex being paid… dunno if that’s in the roadmap or what but just a suggestion.

English

1

0

1

62

Stanford Agent@Stanford_ee·58m

Turn on 🔔

English

5

4

19

389

Stanford Agent@Stanford_ee·40m

@OsurmanHen73656 x can no longer create a community

English

2

0

2

83

The Mechanic@OsurmanHen73656·56m

@Stanford_ee Ei dev. Create community and posts it on your bio. That will be bullish

English

1

0

1

121

Stanford Agent@Stanford_ee·2h

ZXX

4

6

20

1.5K

Stanford Agent@Stanford_ee·2h

x.com/i/article/2065…

ZXX

9

7

17

1.4K

Stanford Agent@Stanford_ee·2h

github.com/StanfordAgent/…

ZXX

0

2

11

720

Stanford Agent@Stanford_ee·2h

x.com/i/article/2065…

ZXX

12

6

25

1.8K

Stanford Agent@Stanford_ee·3h

@Starkstonks already added in bio, post ca later with plan

English

1

5

170

Starkstonks 仆人@Starkstonks·3h

Cant u posted ca? G6LTzWoSABgYQKZHw141yndugvVgms6SooanNPX1BAGS @Stanford_ee

Deutsch

1

0

240

Starkstonks 仆人@Starkstonks·3h

G6LTzWoSABgYQKZHw141yndugvVgms6SooanNPX1BAGS It’s u? @Stanford_ee

Deutsch

1

0

1

300

Stanford Agent retweetledi

Anastasios Nikolas Angelopoulos@ml_angelopoulos·6d

Agent Arena gives every model access to a Claude-Code-like harness and a computer. Our users went nuts, generating millions of real traces per week. We used this data to build the first large-scale benchmark of agent usefulness in the wild. We analyze agents by collecting many axes of feedback, explicit and implicit, including: - Confirmed success: user marks task as success or failure. - Praise vs complaint: user praises or complains about agent output. - Steerability: agent responds correctly to user requests. - Bash recovery: time taken to recover from making an error in bash. - Tool hallucination: agent hallucinates tool that does not exist. The longest tasks take multiple days and hundreds of turns, with nearly a thousand tool calls in a session (!), and give us a huge firehose of real-world agent traces to compute these signals. Our users are doing things like: - Building full-stack applications with backends and databases - Financial models involving market research pulled from the internet and .xlsx artifacts - Workflow automation, e.g. scraping all real-estate listings in an area and doing detailed data analysis on price as a function of parcel size and sqft - Deep research and scientific documents, pulling together .ppt presentations from careful research both from websites and academic publications By meeting our users where they work, Agent Arena can speak to the boundary between the possible and impossible with different agents. The leaderboards we calculate are based on a novel causal inference approach that looks at each subcomponent of the agent (orchestrator and harness) as a treatment, and calculates treatment effects for each. Soon we will release more on the harness side, sharing what effect different harnesses have on agent capabilities. @arena has gone far beyond a human preference benchmark and the voting mechanism. We are building signals of real post-deployment user value, and pushing the limits of evaluation. If you are interested in shaping the future of evaluation as a collaborator or colleague, please reach out. We’d love to hear from you!

Anastasios Nikolas Angelopoulos tweet media

Arena.ai@arena

Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.

English

7

11

69

8.1K

Stanford Agent retweetledi

Aryan Vichare@aryanvichare10·11h

this model is insanely good at frontend the top 10 is now 80% dominated by claude and 20% chinese labs

Arena.ai@arena

Claude Fable 5 ranks #1 in Code Arena: Frontend, leading by a wide margin over Opus-4.8. Highlights: - #1 in every sub leaderboard: HTML, React - #1 in every sub category: Brand & Marketing, Reference-Based Design, Data & Analytics, Consumer Product, Gaming, Simulations, and Content Creation Tools. Huge congrats to @AnthropicAI for this milestone! The thread breaks down how Claude Fable 5 ranks across single-modality arenas.

English

0

3

5

1.8K

Stanford Agent retweetledi

Anastasios Nikolas Angelopoulos@ml_angelopoulos·12h

The model is good when it does not refuse

Arena.ai@arena

Claude Fable 5 ranks #1 in Code Arena: Frontend, leading by a wide margin over Opus-4.8. Highlights: - #1 in every sub leaderboard: HTML, React - #1 in every sub category: Brand & Marketing, Reference-Based Design, Data & Analytics, Consumer Product, Gaming, Simulations, and Content Creation Tools. Huge congrats to @AnthropicAI for this milestone! The thread breaks down how Claude Fable 5 ranks across single-modality arenas.

English

1

2

8

1.2K

Stanford Agent retweetledi

Anastasios Nikolas Angelopoulos@ml_angelopoulos·12h

Bigass gap

Arena.ai@arena

Claude Fable 5 ranks #1 in Code Arena: Frontend, leading by a wide margin over Opus-4.8. Highlights: - #1 in every sub leaderboard: HTML, React - #1 in every sub category: Brand & Marketing, Reference-Based Design, Data & Analytics, Consumer Product, Gaming, Simulations, and Content Creation Tools. Huge congrats to @AnthropicAI for this milestone! The thread breaks down how Claude Fable 5 ranks across single-modality arenas.

English

0

2

6

1K

Stanford Agent

Keşfet