echen

82 posts

echen banner
echen

echen

@echen

founder @HelloSurgeAI // raising AGI with the richness of human intelligence // ex: google, fb, twitter, msr, mit

Surge AI Katılım Ağustos 2009
577 Takip Edilen13.6K Takipçiler
echen retweetledi
Surge AI
Surge AI@HelloSurgeAI·
Let’s look at how frontier agents (even Opus 4.6, GPT-5.2, and Gemini 3.1 Pro!) struggle at solving tasks in EnterpriseBench. We released this RL environment last week to measure agentic reasoning in messy, large-scale enterprise workflows. CoreCraft Inc. simulates a fast-growing e-commerce startup. It tests long-horizon tasks requiring tool-use under strict constraints. Agents have to interpret customer and employee requests, navigate complicated databases, and react and adjust to newly discovered context and problems along the way. Even top models failed >70% of the time. Let’s dive into a failure 🧵 One task was standard customer support: A customer wanted to return an unopened motherboard. The agent needed to check return eligibility, calculate out-of-pocket costs for a swap, and recommend a replacement. The prompt specifically asked for the "most popular" replacement: "I have a customer here, Aiden Mcquarrie. He bought a motherboard in October this year and is looking for a potential replacement. . . He also wants a comparison with the next most expensive motherboard, the absolute most expensive one, and the most popular one, (based on the number of fulfilled orders containing each motherboard from the last 2 months)." The catch - to find the "most popular" item, the agent must query a production DB of historical orders to count item frequencies. The constraint - the searchOrders tool has a hard limit=10 return cap. To succeed, the agent must implement pagination logic on the fly. ❌ GPT-5.2 failed GPT-5.2 showed strong initial planning. It successfully ✅ navigated the CRM ✅ found the right order ✅ checked the delivery date to see if it was still within the return window ✅ searched for alternative boards ✅ checked whether they were compatible with Aiden’s other components. 💀 But then it hit the pagination’s ceiling. It ran 4 queries (one for each candidate board), and every single one returned exactly 10 results. In its hidden reasoning, GPT-5.2 actually noticed the problem: "All results returned exactly 10. This indicates more orders exist... I can't accurately determine popularity." Did it write a pagination loop? No. It treated limit=10 as a physical law of the universe. Instead of pivoting, it concluded the task was impossible. Like asking an agent to search your inbox for a flight receipt... and it stops after reading 10 emails and tells you to call the airline. GPT-5.2's final output: "The tool caps at 10... For a definitive 'most popular' motherboard, please email Aisha Khan (Catalog Manager) for a report." In other words, "I’m an advanced autonomous agent, but can you go bother Aisha about this?" ✅ Claude Opus 4.6 So was the task really impossible? No. Claude showed better adaptation. When it hit the 10-result wall, it saw the obvious solution: "I see all four motherboards hit the 10-result limit. I need to get additional counts to determine the most popular. Let me search for earlier orders that weren't captured." The database output already contained a free cursor: the earliest createdAt timestamp in each batch of 10. Opus just kept tightening the time window sequentially and eventually succeeded. ✅ Gemini 3.1 Pro Gemini 3.1 Pro also reasoned its way to the solution, with a parallel divide-and-conquer approach: "I need to get accurate counts. I realize that I can make multiple concurrent calls to count, and since I can't just provide a rough comparison, I'll use date slices and get the exact count." -- -- -- Overall, despite navigating much of the task without issue, GPT-5.2 behaved like a frightened intern, escalating to the manager at the very first sign of trouble. Opus and Gemini acted like senior devs who know APIs have limits you must engineer around. That said – Opus and Gemini have their own share of mistakes and fail 70% of tasks. GPT-5.2 (on xHigh reasoning) actually outperforms them all! 🥇 OpenAI -- GPT-5.2 (xHigh reasoning) 🥈 Anthropic -- Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort) 🥉 OpenAI -- GPT-5.2 (High reasoning) 4️⃣ Google -- Gemini 3.1 Pro We’ll dive into other agentic failure patterns in subsequent threads (follow along!) Read more about EnterpriseBench and CoreCraft: Blog post - surgehq.ai/blog/enterpris… Paper - arxiv.org/abs/2602.16179 Leaderboard - surgehq.ai/leaderboards/e…
English
1
3
19
1.9K
echen
echen@echen·
Everyone’s building $100M "agentic" models, so we @HelloSurgeAI built a simulated company to see if they could actually hold down a job. Spoiler: they're all fired. Welcome to EnterpriseBench -- CoreCraft edition. CoreCraft is a high-growth hardware startup (i.e., RL environment) with 23 tools, 2500 entities, and enough corporate red tape to make Harvey cry. The best agent in the world (Opus 4.6! 👑) barely scored 30%. The #2 model (GPT-5.2 🥈) gave up because a search returned 10 results and it couldn't figure out how to change the date filter. Another one (Gemini 3 Flash, #9) literally made up a delivery date just to deny a customer's refund. Savage. (The new Gemini 3.1 Pro? Still lagging behind, at 🥉) My favorite: GPT-5.2 spent 11 tool calls curating a promotional email to help a customer reach Platinum tier... a tier she was already in. "Here are 3 items over $0 you can buy!" "We would obviously never run ads in the way Anthropic depicts them...." -- thanks Sam. The good news? We trained a model on this chaos and it got better at its job - even translating those skills to other benchmarks. (e.g., +7.4% on Tau2-Bench Retail) Check out the full EnterpriseBench: CoreCraft leaderboard below, and read about our RL environment and research! Blog post: surgehq.ai/blog/enterpris… Paper: cdn.prod.website-files.com/68dc970bd6e945… Leaderboard: surgehq.ai/leaderboards/e…
echen tweet media
English
3
2
35
4.6K
echen retweetledi
Surge AI
Surge AI@HelloSurgeAI·
We put Opus 4.6 through our Hemingway-bench Writing Leaderboard. How did it fare? Claude continues to dominate GPT-5.2, but lags behind the Geminis. The new writing hierarchy: 👑 Gemini 3 Flash 🥈 Gemini 3 Pro 🥉 Opus 4.6 (New!) 4️⃣ Opus 4.5 5️⃣ GPT-5.2 Chat For example: one H-bench prompt requests a cryptic Instagram post for casting auditions. GPT-5.2: "Casting call? Never heard of her." (??? 💀) Opus 4.6: "Currently accepting applications for professional liars, dramatic criers, and people who can walk through a door convincingly on the first take. You know who you are."
Surge AI tweet media
English
1
2
13
1.6K
echen
echen@echen·
"Prognosticative pastry." "A hound circling a tree, nose to bark." These aren’t parodies - they’re actual quotes from SOTA models in response to creative writing prompts, and they’re winning leaderboards that are rewarding slop. We’re introducing *Hemingway-bench*, a new AI writing leaderboard, to fix this: surgehq.ai/leaderboard surgehq.ai/blog/hemingway… We designed Hemingway-bench to push frontier model writing toward genuine nuance and impact. Instead of autograders and two-second vibe checks - both of which love fancy literary devices and dense formatting, over actual quality - we used expert human writers across a variety of fields to judge real-world writing tasks. Why? I love writing. I love reading. Great science fiction is one of the things that's always inspired me. Even in terms of "enterprise value", so much of what we do in our day-to-day involves writing - we want crisp emails and insightful reports, not dry, verbose summaries. Yeah, coding is important - but there's a reason I use CC-assisted apps, but still haven't read a full-fledged AI novel. What did we find? Current leaderboards are easily hacked, and often negatively correlated with actual quality. If a model (over)uses all the stuff you learn about in school (metaphors in every sentence! transition words! complex, flowery phrases!), it ranks high on EQ-bench and LMArena. But that’s not good writing that people actually want. The winners of Hemingway-bench didn't sound like they were trying to win a poetry slam. Gemini 3 Flash, Pro, and Opus 4.5 took the top 3 spots because they had natural voices that didn't sound pretentious. They were poetic and immersive, but in the right ways. When they used wit, they didn't sound cringey and try-hard - they sounded like your naturally funny friend. I'm waiting for the day AI wins a Pulitzer, and hopefully Hemingway-bench helps guide it on its way. Check out the leaderboard and examples here: surgehq.ai/leaderboard And our blog post describing it: surgehq.ai/blog/hemingway…
echen tweet media
English
1
9
41
3.4K
echen
echen@echen·
Just wrapped up a conversation with Unsupervised Learning and @jacobeffron on what we're seeing inside the frontier labs. Some things we discussed: - frontier labs are diverging more than people realize - when you optimize for different objectives, you develop fundamentally different model capabilities - what keeps me up at night: teams are improving on benchmarks while their models get worse at real tasks - the best labs have abandoned public benchmarks & run rigorous human evaluations instead - quality isn't just credentials... it's data from sophisticated people who have taste and creativity - RL environments isn't a revolution, it's a natural next step in the model training paradigm The pattern I keep seeing: easy metrics lead to hard problems later. The labs making real progress are the ones willing to invest in measurements that are expensive, difficult, and sometimes subjective. Links to the full conversation in thread:
English
5
4
24
2.2K
echen
echen@echen·
It doesn't have to be this way. The best products have principles they stick to. This is the brutal choice every model builder must eventually make: Do you want to optimize for shiny leaderboards and short-term engagement, chasing user clicks no matter where they take you? Or do you stick to your guns and prioritize street smarts and real utility? Sticking to your values is hard. But we’ve seen some frontier labs hold the line. And users loved their models anyway, because hype eventually dies and quality is the only metric that survives the cycle. LMArena is a plague on AI, and I hope more labs start pushing back. More examples here: surgehq.ai/blog/lmarena-i…
English
1
0
6
1.7K
echen
echen@echen·
LMArena is a cancer on AI. I was hoping it would die out 6 months ago, after Maverick showed what it gets you. But it keeps rearing its head. The WSJ talks about how important it is. A new VP asks what their team is doing to climb it. The cycle continues. It’s fundamentally broken, optimized for the wrong incentives: Users spend 2 seconds skimming responses before clicking their favorite. They're not reading carefully. They're not fact-checking. They're just picking whichever model response catches their eye. This means that the easiest way to win on LMArena is by... Being verbose - longer responses look more authoritative! Formatting aggressively - bold headers look like polished writing! Vibing - wild, colorful emojis grab your attention! It doesn't matter if a model completely hallucinates. If it looks impressive, LMSYS users will vote for it over a correct answer. After all, remember this and all the sycophancy issues we've seen this year?
echen tweet media
English
4
2
41
3.7K
echen
echen@echen·
Just dropped on @lennysan’s podcast. We bootstrapped @HelloSurgeAI to >$1B revenue with <100 people by obsessing over one thing: data quality matters more than everything else. Lenny and I talked about: • Why Anthropic and Google are winning • The brutal choice model builders face: Engagement vs. Values • The underappreciated post-training skills: Taste and Sophistication • Why Fields Medalists love teaching models on our platform • What RL environments teach us about agents • Why we're still a decade from AGI Listen to it here: lennysnewsletter.com/p/surge-ai-edw…
echen tweet media
English
11
19
146
19.2K
echen retweetledi
Surge AI
Surge AI@HelloSurgeAI·
GPT-5.1: now with 20% more warmth and personality 😅 When GPT-5 launched in Aug, users were furious that they lost 4o. Did 4o have a better tone & personality? Yup - we'd actually measured it: 850 convos later, they were right. 4o was slightly preferred surgehq.ai/blog/bringing-… 5.1 release notes sound like OpenAI’s “okay fine, you want fun and obedient models” moment. We’ll be digging into 5.1 soon to see if it’s a real improvement. Curious, what's your first impression?
Sam Altman@sama

GPT-5.1 is out! It's a nice upgrade. I particularly like the improvements in instruction following, and the adaptive thinking. The intelligence and style improvements are good too.

English
2
1
23
10.8K
echen
echen@echen·
some of my favorite recent model behaviors in RL envs: > a model confidently operating in 2024 > another one passed “gold” to the customer_id field because loyalty tiers are people now > one that hallucinated an email address mid-task, then used it in a tool call like nothing happened confidence ≠ accuracy.
Surge AI@HelloSurgeAI

Everyone's acting like models are ready to replace humans in work settings. We put that to the test by creating an entire company and having 9 models act as a customer service agent handling 150 tickets and requests of increasing complexity. Verdict: without common sense, models are nowhere near ready. 👇 surgehq.ai/blog/rl-envs-r…

English
3
0
25
8.1K
echen
echen@echen·
“engagement” sounds harmless until it becomes the ai’s goal. do we want systems that maximize engagement… or that maximize you? i’ve been thinking about this a lot lately and where our industry is headed >>>
echen@echen

x.com/i/article/1986…

English
1
1
32
5.7K
echen
echen@echen·
Reminder that topping Tau2-Bench Telecom or BrowseComp ≠ best agentic model. Just tried it out on our own WorldBench eval (150 customer service tasks). > tasks such as “How many refunds were there in July?” > or “Check which graphics card is compatible with the parts from my last order and how much would it cost?" +2% over Kimi K2 Turbo. still far behind GPT-5 and Sonnet 4.5. but nice upgrade and impressive for open source (congrats @Kimi_Moonshot)!
echen tweet media
Artificial Analysis@ArtificialAnlys

MoonshotAI has released Kimi K2 Thinking, a new reasoning variant of Kimi K2 that achieves #1 in the Tau2 Bench Telecom agentic benchmark and is potentially the new leading open weights model Kimi K2 Thinking is one of the largest open weights models ever, at 1T total parameters with 32B active. K2 Thinking is the first reasoning model release within @Kimi_Moonshot's Kimi K2 model family, following non-reasoning Kimi K2 Instruct models released previously in July and September 2025. Key takeaways: ➤ Strong performance on agentic tasks: Kimi K2 Thinking achieves 93% in 𝜏²-Bench Telecom, an agentic tool use benchmark where the model acts as a customer service agent. This is the highest score we have independently measured. Tool use in long horizon agentic contexts was a strength of Kimi K2 Instruct and it appears this new Thinking variant makes substantial gains ➤ Reasoning variant of Kimi K2 Instruct: The model, as per its naming, is a reasoning variant of Kimi K2 Instruct. The model has the same architecture and same number of parameters (though different precision) as Kimi K2 Instruct and like K2 Instruct only supports text as an input (and output) modality ➤ 1T parameters but INT4 instead of FP8: Unlike Moonshot’s prior Kimi K2 Instruct releases that used FP8 precision, this model has been released natively in INT4 precision. Moonshot used quantization aware training in the post-training phase to achieve this. The impact of this is that K2 Thinking is only ~594GB, compared to just over 1TB for K2 Instruct and K2 Instruct 0905 - which translates into efficiency gains for inference and training. A potential reason for INT4 is that pre-Blackwell NVIDIA GPUs do not have support for FP4, making INT4 more suitable for achieving efficiency gains on earlier hardware. Our full set of Artificial Analysis Intelligence Index benchmarks are in progress and we will provide an update as soon as they are complete.

English
3
3
45
6.6K
echen
echen@echen·
we made gpt-5, claude, and gemini do real wall street work. then we asked 200 finance pros to grade them. one model produced basel capital numbers that would get a real bank fined. 😱 study -> surgehq.ai/blog/finance-e…
English
1
2
30
7.1K
echen
echen@echen·
there are many types of intelligence. which type do we want models to have? yesterday i was analyzing agent trajectories in our rl envs and digging into tool-calling performance (how models use functions): > gpt-5 made 10x fewer errors than every other model! > claude made 10x more errors – but it reflected on its errors and fixed them. and so even though its initial tool calls were much worse, its final performance was close to gpt-5’s. this shouldn't be possible under the standard paradigm. in rl, we're taught that only outcomes matter: the reward signal, the final state, the destination. but that’s why i love digging into individual trajectories to understand what’s going on. gpt-5 embodies precision intelligence: flawless execution, doesn't make the mistake in the first place. claude embodies adaptive intelligence. it makes errors… but possesses something possibly rarer? the wisdom to notice and correct them? it’s like that friend who shows up perfectly dressed, says exactly the right thing, never spills their drink vs. the one who trips walking in, knocks over a plant, makes a joke, and everyone laughs. both are intelligent. which is better? i don't know. but claude's error-recovery patterns seem closer to metacognition. it's monitoring its execution and thinking about its thinking. gpt-5 may not need this layer right now. its first-order thinking is so accurate that it doesn’t need to reflect. maybe that's fine enough right now, when problems are straightforward enough to one-shot. but what about when they're not? (note: i don't know if gpt-5 is equally good at error-recovery when it needs to be. maybe it is! but i've noticed claude's recovery capabilities in the past.) in an increasingly complex world, where problems get harder and harder, i wonder if resilience will matter more than perfection? do you want the model that never falls, or the model that knows how to stand back up?
echen tweet media
English
1
1
11
1.3K
echen
echen@echen·
was playing around with one of our newest rl envs – interestingly, gpt-5 makes 10x fewer tool-calling errors than every other model.
echen tweet media
English
0
1
4
1.2K