Surge AI

664 posts

Surge AI

@HelloSurgeAI

Our mission is to raise AGI with the richness of humanity — curious, witty, imaginative, and full of breathtaking brilliance.

เข้าร่วม Haziran 2020

142 กำลังติดตาม8.1K ผู้ติดตาม

Surge AI@HelloSurgeAI·27 Şub

Let’s look at how frontier agents (even Opus 4.6, GPT-5.2, and Gemini 3.1 Pro!) struggle at solving tasks in EnterpriseBench. We released this RL environment last week to measure agentic reasoning in messy, large-scale enterprise workflows. CoreCraft Inc. simulates a fast-growing e-commerce startup. It tests long-horizon tasks requiring tool-use under strict constraints. Agents have to interpret customer and employee requests, navigate complicated databases, and react and adjust to newly discovered context and problems along the way. Even top models failed >70% of the time. Let’s dive into a failure 🧵 One task was standard customer support: A customer wanted to return an unopened motherboard. The agent needed to check return eligibility, calculate out-of-pocket costs for a swap, and recommend a replacement. The prompt specifically asked for the "most popular" replacement: "I have a customer here, Aiden Mcquarrie. He bought a motherboard in October this year and is looking for a potential replacement. . . He also wants a comparison with the next most expensive motherboard, the absolute most expensive one, and the most popular one, (based on the number of fulfilled orders containing each motherboard from the last 2 months)." The catch - to find the "most popular" item, the agent must query a production DB of historical orders to count item frequencies. The constraint - the searchOrders tool has a hard limit=10 return cap. To succeed, the agent must implement pagination logic on the fly. ❌ GPT-5.2 failed GPT-5.2 showed strong initial planning. It successfully ✅ navigated the CRM ✅ found the right order ✅ checked the delivery date to see if it was still within the return window ✅ searched for alternative boards ✅ checked whether they were compatible with Aiden’s other components. 💀 But then it hit the pagination’s ceiling. It ran 4 queries (one for each candidate board), and every single one returned exactly 10 results. In its hidden reasoning, GPT-5.2 actually noticed the problem: "All results returned exactly 10. This indicates more orders exist... I can't accurately determine popularity." Did it write a pagination loop? No. It treated limit=10 as a physical law of the universe. Instead of pivoting, it concluded the task was impossible. Like asking an agent to search your inbox for a flight receipt... and it stops after reading 10 emails and tells you to call the airline. GPT-5.2's final output: "The tool caps at 10... For a definitive 'most popular' motherboard, please email Aisha Khan (Catalog Manager) for a report." In other words, "I’m an advanced autonomous agent, but can you go bother Aisha about this?" ✅ Claude Opus 4.6 So was the task really impossible? No. Claude showed better adaptation. When it hit the 10-result wall, it saw the obvious solution: "I see all four motherboards hit the 10-result limit. I need to get additional counts to determine the most popular. Let me search for earlier orders that weren't captured." The database output already contained a free cursor: the earliest createdAt timestamp in each batch of 10. Opus just kept tightening the time window sequentially and eventually succeeded. ✅ Gemini 3.1 Pro Gemini 3.1 Pro also reasoned its way to the solution, with a parallel divide-and-conquer approach: "I need to get accurate counts. I realize that I can make multiple concurrent calls to count, and since I can't just provide a rough comparison, I'll use date slices and get the exact count." -- -- -- Overall, despite navigating much of the task without issue, GPT-5.2 behaved like a frightened intern, escalating to the manager at the very first sign of trouble. Opus and Gemini acted like senior devs who know APIs have limits you must engineer around. That said – Opus and Gemini have their own share of mistakes and fail 70% of tasks. GPT-5.2 (on xHigh reasoning) actually outperforms them all! 🥇 OpenAI -- GPT-5.2 (xHigh reasoning) 🥈 Anthropic -- Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort) 🥉 OpenAI -- GPT-5.2 (High reasoning) 4️⃣ Google -- Gemini 3.1 Pro We’ll dive into other agentic failure patterns in subsequent threads (follow along!) Read more about EnterpriseBench and CoreCraft: Blog post - surgehq.ai/blog/enterpris… Paper - arxiv.org/abs/2602.16179 Leaderboard - surgehq.ai/leaderboards/e…

English

1.9K

Surge AI@HelloSurgeAI·20 Şub

Everyone’s building $100M "agentic" models, so we built a simulated company to see if they could actually hold down a job. Spoiler: they're all fired. Welcome to EnterpriseBench -- CoreCraft edition. CoreCraft is a high-growth hardware startup (i.e., RL environment) with 23 tools, 2500 entities, and enough corporate red tape to make Harvey cry. The best agent in the world (Opus 4.6! 👑) barely scored 30%. The #2 model (GPT-5.2 🥈) gave up because a search returned 10 results and it couldn't figure out how to change the date filter. Another one (Gemini 3 Flash, #9) literally made up a delivery date just to deny a customer's refund. Savage. (The new Gemini 3.1 Pro? Still lagging behind, at 🥉) The good news? We trained a model on this chaos and it got better at its job - even translating those skills to other benchmarks. (e.g., +7.4% on Tau2-Bench Retail) Check out the full EnterpriseBench: CoreCraft leaderboard below, and read about our RL environment and research! Blog post: surgehq.ai/blog/enterpris… Paper: cdn.prod.website-files.com/68dc970bd6e945… Leaderboard: surgehq.ai/leaderboards/e…

English

Surge AI@HelloSurgeAI·20 Şub

RT @echen: Everyone’s building $100M "agentic" models, so we @HelloSurgeAI built a simulated company to see if they could actually hold dow…

English

324

Surge AI@HelloSurgeAI·11 Şub

We’ve finally done it. Forbes just ranked our CEO *54* spots above Taylor Swift on their America’s Greatest Innovators list. forbes.com/sites/alexknap… While we’re honored that Forbes think Edwin’s strategy is more innovative than a 10-minute song about a scarf, we want to clarify a few things: 1. We will NOT be releasing our next benchmark as a limited-edition vinyl variant. 2. Jake was great in Zodiac. 3. We aren’t saying we’re better at songwriting, but we *are* saying we’ve never seen Taylor build an RL environment. See you at next year's Grammys, @taylorswift13.

English

1.1K

Surge AI@HelloSurgeAI·10 Şub

Overall: GPT-5.2 feels like a mass market writer; Opus has personality and soul. See the updated leaderboard here! surgehq.ai/leaderboard

English

300

Surge AI@HelloSurgeAI·10 Şub

Another Hemingway-bench prompt asks for an oral presentation about time management. GPT-5.2 writes like a LinkedIn engagement farm: "When people hear “working from home,” they often think it means more freedom, more comfort, and maybe even more free time. And sometimes that’s true. But what doesn’t get talked about enough is how easily work-from-home life can get messy if you don’t manage your time well." (🥱) Opus 4.6 feels like a charismatic creative working the room: "So... raise your hand if you've ever "worked from home" and somehow ended up four hours into a Netflix series at 2 PM on a Tuesday. No judgment. We've all been there."

English

385

Surge AI@HelloSurgeAI·10 Şub

We put Opus 4.6 through our Hemingway-bench Writing Leaderboard. How did it fare? Claude continues to dominate GPT-5.2, but lags behind the Geminis. The new writing hierarchy: 👑 Gemini 3 Flash 🥈 Gemini 3 Pro 🥉 Opus 4.6 (New!) 4️⃣ Opus 4.5 5️⃣ GPT-5.2 Chat For example: one H-bench prompt requests a cryptic Instagram post for casting auditions. GPT-5.2: "Casting call? Never heard of her." (??? 💀) Opus 4.6: "Currently accepting applications for professional liars, dramatic criers, and people who can walk through a door convincingly on the first take. You know who you are."

English

1.6K

Surge AI@HelloSurgeAI·5 Şub

Deep Dive Blog: surgehq.ai/blog/hemingway…

English

312

Surge AI@HelloSurgeAI·5 Şub

The winners of Hemingway-bench - Gemini 3 Flash, Pro, and Opus 4.5 - didn't try to win a poetry slam. They had wonderful prose, but they took the top spots because they sounded human. Their wit felt like a conversation with a naturally funny friend, not a try-hard AI. They were immersive, not pretentious. Writing often gets overlooked. But great writing can inspire us. It's also important for everything we do in our day-to-day lives, both at home and at work. We're waiting for the day an AI wins a Pulitzer - hopefully with our help. We built Hemingway-bench to make sure it gets there. Check it out! surgehq.ai/leaderboard

English

517

Surge AI@HelloSurgeAI·5 Şub

"Prognosticative pastry." "A hound circling a tree, nose to bark." Believe it or not, those quotes aren't jokes. They're real outputs from SOTA models! And many leaderboards are rewarding this kind of slop with top rankings. To fix the broken state of AI evaluation, we're launching *Hemingway-bench*: a new writing leaderboard, designed for nuance and impact. Not two-second vibes and fluff. Explore the data and the full leaderboard here (congrats Gemini and Claude for the top positions!): Leaderboard: surgehq.ai/leaderboard Deep Dive Blog: surgehq.ai/blog/hemingway…

English

629

Surge AI รีทวีตแล้ว

@·19 Ara

Reward models make or break post-training for multimodal omni models (e.g., nano banana), yet there’s surprisingly little research on that‼️ We’re releasing MMRB2: new reward benchmark focusing on omni models, spanning T2I, editing, interleaved, and thinking with images 🧵1/n

English

156

33.7K

Surge AI@HelloSurgeAI·17 Ara

The pattern we keep seeing: Easy metrics → hard problems later Hard measurements → real progress The labs winning are the ones willing to do the difficult, expensive work of measuring what actually matters. Full conversation: Youtube - youtu.be/FiskCZddREA?si… Spotify - open.spotify.com/episode/5RTA1U… Apple - podcasts.apple.com/fi/podcast/ep-…

YouTube

English

1.8K

Surge AI@HelloSurgeAI·17 Ara

#3: You don’t get good quality by just throwing a bunch of PhDs at a problem. You actually need creativity, taste, real-world experience, and a track record of good work. We process millions of signals daily. The data doesn't lie: execution beats pedigree.

English

1.1K

Surge AI@HelloSurgeAI·17 Ara

Our CEO @edwinchenai just revealed what's really happening inside the frontier labs on Unsupervised Learning with @jacobeffron 👇 Thread of spicy takes:

English

1.6K

Surge AI รีทวีตแล้ว

@·15 Ara

On the latest Unsupervised Learning, I sat down with @echen. Edwin is the founder and CEO of @HelloSurgeAI, the >$1B revenue infrastructure company behind nearly every major frontier model. Some favorite parts: - Why benchmarks make models worse - Why the model companies

English

1.4K

Surge AI@HelloSurgeAI·13 Ara

We made the Inc 2025 Best in Business list. Even though the Silicon Valley playbook says we shouldn’t exist. We had: 0 VCs. 0 launch parties. 0 growth hacks and dark patterns. And... maybe that’s how things should work? Our board meetings: Slack threads with researchers. Our marketing: paper collaborations with customers. We bet on a principle: quality, no matter what. That's why we build the gold standard in data for frontier labs every day. inc.com/best-in-busine…

English

1.3K

Surge AI@HelloSurgeAI·11 Ara

Meta’s paper: arxiv.org/abs/2511.10507 Congrats to MSL for pushing the industry towards real-world problems that matter.

English

607

Surge AI@HelloSurgeAI·11 Ara

No commas! No letter "c"! A model could write complete nonsense ("Language shifts happen when people talk different over long time periods and also birds migrate sometimes!"), and score perfectly. Earlier this year, Meta’s Superintelligence Lab partnered with Surge AI to create AdvancedIF. Instead of counting commas, AdvancedIF focuses on: ✅ Real-world street smarts – Rubrics that measure real-world utility, not artificial puzzles. ✅ Real, human-crafted data – Designing the evaluation around our true goal (real-world instruction following!), not synthetic proxies. We wrote up a breakdown of their paper and how we created the benchmark. Read it below! surgehq.ai/blog/advancedi…

English

736

Surge AI@HelloSurgeAI·11 Ara

Imagine measuring a personal assistant’s ability by asking them to write emails without using the letter "c" or any commas. Crazy, right? But that’s exactly what many instruction-following benchmarks are designed to test. Here’s a real example:

English

1.1K

ค้นพบ

@echen @taylorswift13 @jacobeffron @elonmusk @BarackObama @cristiano @BillGates @NASA