echen

96 posts

echen

@echen

founder @HelloSurgeAI // raising AGI, not just building it // $1B+ revenue, 0 VC

Surge AI Katılım Ağustos 2009

576 Takip Edilen13.8K Takipçiler

Sabitlenmiş Tweet

echen@echen·6 Tem

love this table from @theinformation

English

805

82K

echen@echen·30 Nis

gdp.pdf was just accepted to CVPR 2026 can frontier models handle the three-letter document type that runs the world? we partnered with hundreds of expert surgers - ER physicians, construction engineers, corporate litigators - to find out… …and every frontier model scored under 15% some of our benchmarks measure the soul. others measure the enterprise. both matter. paper: cdn.prod.website-files.com/68dc970bd6e945… blog: surgehq.ai/blog/gdp-pdf-c… dataset: huggingface.co/datasets/surge… leaderboard: surgehq.ai/leaderboards/g…

English

577

echen@echen·24 Nis

spent a couple days in DC last week for Semafor's World Economy Summit! the best parts: 1) hearing all the chatter about Mythos, 2) seeing how differently DC talks about AI vs SF, 3) the imposing architecture

English

728

echen@echen·23 Nis

Great chatting with @AravSrinivas and @Jason on This Week in AI about: — what Jeff Dean would build if he had 1000 years to explore art, history, poetry, and physics, and what that means for AI coding — why frontier models are starting to sound like 2002 tabloids — why I'm excited for the future of Apple AI

This Week in AI@ThisWeeknAI

"LM Arena is a cancer on AI. Labs have entire teams dedicated to hacking it." Edwin Chen (@echen), CEO of Surge AI, on why the industry's favorite benchmark is broken and how Surge hit $1.2 billion in revenue without ever raising. Aravind Srinivas (@AravSrinivas), CEO of Perplexity, on Apple's AI advantage, Claude Code economics, the endgame of coding, and Perplexity Computer. They join @Jason on This Week in AI Episode 10: 00:00 Intro to Aravind Srinivas and Edwin Chen 05:25 Edwin on Surge: School for AGI 10:47 What Apple's next CEO should do 21:20 "The iPhone is not getting disrupted by AI" 23:55 Bootstrapping Surge past $1B without raising 30:58 Claude Code as a loss leader 33:30 Are we in the endgame for coding? 41:34 30% headcount growth, 5x revenue 50:29 "People don't buy models, they buy products" 58:00 "LM Arena is a cancer on AI" 1:05:41 Model Council and orchestrating frontier models Full episode on YT, Spotify, and Apple Podcasts below: @perplexity_ai @HelloSurgeAI

English

17.8K

echen@echen·18 Nis

Surge AI just made the Forbes AI 50 list. 99% of the rest of the list raised billions in VC. We got there with $0. We didn't do it by building engagement slop and chasing DAUs. We didn't do it by rewarding sycophancy over truth. The standard Silicon Valley playbook — raise billions, blitzscale, worry about the effects of what you're building later — forces you to cut corners, compromise your principles to hit quarterly targets, and optimize for hype instead of substance. We chose a different path. We did it by doing the most unsexy work in the industry: building the school for AGI. Hiring the world's top doctors, engineers, attorneys, scientists, and writers to teach models how to actually think. Designing the curriculum that determines what intelligence becomes. Grading models on the standard of real work, not vibes. Building the full education — reasoning, wisdom, creativity, and taste — not just the standardized exam. You don't need hyper-growth VCs to build the world-changing things that only you could build. You just need an uncompromising commitment to your principles and work so good that your customers keep coming back. Years ago, we bet that AGI deserves more than a textbook education. We bet that the only way to build true intelligence is to raise it on the best of humanity — on the brilliance, rigor, and taste of the most talented experts in the world. We bet that independence and patience would beat headlines and hype. We bet on our technology and the quality of our product. We bet that researchers would notice and care. You can choose a different path. We're just getting started. forbes.com/lists/ai50/

English

146

24.3K

echen@echen·16 Nis

🎤 Who run the world? 🎤 Gir—PDFs. PDFs run the world. This week, we launched GDP.pdf: a new, expert multimodal reasoning benchmark. We've spent years measuring AI against the extraordinary: proving theorems, solving AGI. But the global economy doesn't run on the extraordinary. It runs on paperwork. More precisely: unsexy, poorly scanned, densely formatted PDFs. Contracts, invoices, medical records, blueprints – the documents that underlie everything we do in the enterprise. So GDP.pdf tests frontier models on their ability to handle real-world documents across ten professional industries: 🏗️ Construction: Can a model measure load-bearing walls on a blueprint? ⚖️ Law: Can it parse liability caps in a commercial lease? 💵 Finance: Can it calculate margin profiles in a buy-side memo? Every frontier model scored under 15%. With GDP.pdf, we wanted to ask: if a $100B model can’t accurately reason about a drug interaction table in a PDF, is it actually ready to take over the economy? Right now, the answer is no. Check out the blog post and leaderboard below! Blog: surgehq.ai/blog/gdp-pdf-c… Leaderboard: surgehq.ai/leaderboards/g…

English

954

echen@echen·9 Nis

me: closest i've ever been to Jensen 🥹 my mom: you should have become a doctor

Surge AI@HelloSurgeAI

Big news: our CEO @echen has been named #73 on @Forbes' list of the 250 Greatest Living Self-Made Americans. That's above Jensen (#81), Leonardo DiCaprio (#88), and Kendrick (#155). Below Dolly Parton (#7), but that's true of everyone who has ever lived. Edwin built Surge AI from scratch without a single dollar of outside funding — turns out "self-made" is pretty literal when you refuse to take meetings with VCs. He'd rather put the time into making AI better than into a pitch deck. P.S. We're told the ranking criteria included "obstacles overcome," which means surviving Edwin's 2am Slack messages should qualify us too. See you on next year's list. forbes.com/sites/alexknap…

English

echen@echen·3 Nis

Last week, I wrote about the poetry of AI. The dream of models that can parse PDFs, and also help us unlock the nature of the primes. The moonshots. Today, I'm proud to share that Riemann-bench – our benchmark for mathematical moonshots - was just accepted at a #ICLR2026 workshop! We partnered with Ivy League professors, top grad students, and PhD IMO medalists to pull problems straight from their active research. Not textbook exercises – frontier questions that take human experts weeks to solve. When we helped create GSM8K five years ago, GPT-3 couldn't break 20%. Today, on Riemann-bench, every single SOTA model is scoring below 10%. Congrats to the team! Excited to see how the frontier climbs this new mountain. 📄 Paper: cdn.prod.website-files.com/68dc970bd6e945… 📝 Blog: surgehq.ai/blog/riemann-b… 🏆 Leaderboard: surgehq.ai/leaderboards/r…

English

769

echen@echen·26 Mar

Most of the time, I think about AI solving everyday problems. AI that can do my taxes. AI that can parse a goddamn pdf! But I also think there's a poetry in what we’re all building. Models writing Nobel Prize-winning literature that makes us think. Models that could help us understand the nature of the primes. I think moonshots are important too. When I was a kid, I wanted to be a mathematician. Number theory or topology. One of my favorite problems I worked on in school was the Aanderaa–Karp–Rosenberg conjecture - a graph-theoretic problem in graph theory with surprising connections to algebraic topology. Wouldn't it be cool if models could discover those connections too? 5 years ago, we worked with OpenAI to create GSM8K, the first math reasoning benchmark for LLMs. At the time, GPT-3 couldn't break 20%... But the frontier has moved! So I'm excited to introduce Riemann-bench, a new benchmark for moonshot mathematics that we built in collaboration with leading mathematicians around the world - Ivy League professors, PhD IMO medalists, and graduate students at the top of their field. Check it out! surgehq.ai/leaderboards/r…

English

echen retweetledi

Surge AI@HelloSurgeAI·24 Mar

When we built GSM8K with OpenAI five years ago, it represented the absolute frontier of what was possible. Today, the industry has moved so fast that it’s essentially just the first stepping stone. But the moonshot problems - resolving the Riemann Hypothesis, curing cancer, proving (or disproving!) P vs. NP - remain unsolved. We need a new yardstick for the era of reasoning AI agents. Today, we're introducing Riemann-bench: a new moonshot math benchmark to push the frontier of discovery even further: surgehq.ai/leaderboards/r… Riemann-bench is a verifiable benchmark of extreme-tier mathematical problems. Even with the best tools available, frontier models score below 10%. How we built it: - Leading mathematicians - we collaborated with Ivy League professors, graduate students, and PhD IMO Medalists to gather problems from their own research - tasks that often took the authors weeks to solve independently. - 100% private - to ensure a fully unbiased evaluation for frontier labs, the dataset is kept strictly private and uncontaminated. - Unconstrained agents - unlike benchmarks that force models into rigid loops or strict token limits, Riemann-bench evaluates true, unconstrained AI research agents. We want to see how they actually think. - Double-blind verification - every problem undergoes a strict protocol where two independent domain experts have to solve it from scratch. We asked our contributors why they spend so much time training AI. Their answer was deeply human: They believe collaborative AI is the only way they'll see their life's work - the deepest conjectures in their fields - resolved in their lifetime. We hope solving Riemann-bench will bring us one step closer to solving the Riemann hypothesis, ushering in a new era of Fields Medal-winning discoveries, and helping humanity understand the nature of the universe. Check out the full Riemann-bench leaderboard here: surgehq.ai/leaderboards/r… (Note: We've faced significant API errors running the GPT-5.4 family of models, but hope to resolve those soon.)

English

276

44.8K

echen retweetledi

Surge AI@HelloSurgeAI·27 Şub

Let’s look at how frontier agents (even Opus 4.6, GPT-5.2, and Gemini 3.1 Pro!) struggle at solving tasks in EnterpriseBench. We released this RL environment last week to measure agentic reasoning in messy, large-scale enterprise workflows. CoreCraft Inc. simulates a fast-growing e-commerce startup. It tests long-horizon tasks requiring tool-use under strict constraints. Agents have to interpret customer and employee requests, navigate complicated databases, and react and adjust to newly discovered context and problems along the way. Even top models failed >70% of the time. Let’s dive into a failure 🧵 One task was standard customer support: A customer wanted to return an unopened motherboard. The agent needed to check return eligibility, calculate out-of-pocket costs for a swap, and recommend a replacement. The prompt specifically asked for the "most popular" replacement: "I have a customer here, Aiden Mcquarrie. He bought a motherboard in October this year and is looking for a potential replacement. . . He also wants a comparison with the next most expensive motherboard, the absolute most expensive one, and the most popular one, (based on the number of fulfilled orders containing each motherboard from the last 2 months)." The catch - to find the "most popular" item, the agent must query a production DB of historical orders to count item frequencies. The constraint - the searchOrders tool has a hard limit=10 return cap. To succeed, the agent must implement pagination logic on the fly. ❌ GPT-5.2 failed GPT-5.2 showed strong initial planning. It successfully ✅ navigated the CRM ✅ found the right order ✅ checked the delivery date to see if it was still within the return window ✅ searched for alternative boards ✅ checked whether they were compatible with Aiden’s other components. 💀 But then it hit the pagination’s ceiling. It ran 4 queries (one for each candidate board), and every single one returned exactly 10 results. In its hidden reasoning, GPT-5.2 actually noticed the problem: "All results returned exactly 10. This indicates more orders exist... I can't accurately determine popularity." Did it write a pagination loop? No. It treated limit=10 as a physical law of the universe. Instead of pivoting, it concluded the task was impossible. Like asking an agent to search your inbox for a flight receipt... and it stops after reading 10 emails and tells you to call the airline. GPT-5.2's final output: "The tool caps at 10... For a definitive 'most popular' motherboard, please email Aisha Khan (Catalog Manager) for a report." In other words, "I’m an advanced autonomous agent, but can you go bother Aisha about this?" ✅ Claude Opus 4.6 So was the task really impossible? No. Claude showed better adaptation. When it hit the 10-result wall, it saw the obvious solution: "I see all four motherboards hit the 10-result limit. I need to get additional counts to determine the most popular. Let me search for earlier orders that weren't captured." The database output already contained a free cursor: the earliest createdAt timestamp in each batch of 10. Opus just kept tightening the time window sequentially and eventually succeeded. ✅ Gemini 3.1 Pro Gemini 3.1 Pro also reasoned its way to the solution, with a parallel divide-and-conquer approach: "I need to get accurate counts. I realize that I can make multiple concurrent calls to count, and since I can't just provide a rough comparison, I'll use date slices and get the exact count." -- -- -- Overall, despite navigating much of the task without issue, GPT-5.2 behaved like a frightened intern, escalating to the manager at the very first sign of trouble. Opus and Gemini acted like senior devs who know APIs have limits you must engineer around. That said – Opus and Gemini have their own share of mistakes and fail 70% of tasks. GPT-5.2 (on xHigh reasoning) actually outperforms them all! 🥇 OpenAI -- GPT-5.2 (xHigh reasoning) 🥈 Anthropic -- Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort) 🥉 OpenAI -- GPT-5.2 (High reasoning) 4️⃣ Google -- Gemini 3.1 Pro We’ll dive into other agentic failure patterns in subsequent threads (follow along!) Read more about EnterpriseBench and CoreCraft: Blog post - surgehq.ai/blog/enterpris… Paper - arxiv.org/abs/2602.16179 Leaderboard - surgehq.ai/leaderboards/e…

English

2.8K

echen@echen·20 Şub

Everyone’s building $100M "agentic" models, so we @HelloSurgeAI built a simulated company to see if they could actually hold down a job. Spoiler: they're all fired. Welcome to EnterpriseBench -- CoreCraft edition. CoreCraft is a high-growth hardware startup (i.e., RL environment) with 23 tools, 2500 entities, and enough corporate red tape to make Harvey cry. The best agent in the world (Opus 4.6! 👑) barely scored 30%. The #2 model (GPT-5.2 🥈) gave up because a search returned 10 results and it couldn't figure out how to change the date filter. Another one (Gemini 3 Flash, #9) literally made up a delivery date just to deny a customer's refund. Savage. (The new Gemini 3.1 Pro? Still lagging behind, at 🥉) My favorite: GPT-5.2 spent 11 tool calls curating a promotional email to help a customer reach Platinum tier... a tier she was already in. "Here are 3 items over $0 you can buy!" "We would obviously never run ads in the way Anthropic depicts them...." -- thanks Sam. The good news? We trained a model on this chaos and it got better at its job - even translating those skills to other benchmarks. (e.g., +7.4% on Tau2-Bench Retail) Check out the full EnterpriseBench: CoreCraft leaderboard below, and read about our RL environment and research! Blog post: surgehq.ai/blog/enterpris… Paper: cdn.prod.website-files.com/68dc970bd6e945… Leaderboard: surgehq.ai/leaderboards/e…

English

4.9K

echen retweetledi

Surge AI@HelloSurgeAI·10 Şub

We put Opus 4.6 through our Hemingway-bench Writing Leaderboard. How did it fare? Claude continues to dominate GPT-5.2, but lags behind the Geminis. The new writing hierarchy: 👑 Gemini 3 Flash 🥈 Gemini 3 Pro 🥉 Opus 4.6 (New!) 4️⃣ Opus 4.5 5️⃣ GPT-5.2 Chat For example: one H-bench prompt requests a cryptic Instagram post for casting auditions. GPT-5.2: "Casting call? Never heard of her." (??? 💀) Opus 4.6: "Currently accepting applications for professional liars, dramatic criers, and people who can walk through a door convincingly on the first take. You know who you are."

English

echen@echen·4 Şub

"Prognosticative pastry." "A hound circling a tree, nose to bark." These aren’t parodies - they’re actual quotes from SOTA models in response to creative writing prompts, and they’re winning leaderboards that are rewarding slop. We’re introducing *Hemingway-bench*, a new AI writing leaderboard, to fix this: surgehq.ai/leaderboard surgehq.ai/blog/hemingway… We designed Hemingway-bench to push frontier model writing toward genuine nuance and impact. Instead of autograders and two-second vibe checks - both of which love fancy literary devices and dense formatting, over actual quality - we used expert human writers across a variety of fields to judge real-world writing tasks. Why? I love writing. I love reading. Great science fiction is one of the things that's always inspired me. Even in terms of "enterprise value", so much of what we do in our day-to-day involves writing - we want crisp emails and insightful reports, not dry, verbose summaries. Yeah, coding is important - but there's a reason I use CC-assisted apps, but still haven't read a full-fledged AI novel. What did we find? Current leaderboards are easily hacked, and often negatively correlated with actual quality. If a model (over)uses all the stuff you learn about in school (metaphors in every sentence! transition words! complex, flowery phrases!), it ranks high on EQ-bench and LMArena. But that’s not good writing that people actually want. The winners of Hemingway-bench didn't sound like they were trying to win a poetry slam. Gemini 3 Flash, Pro, and Opus 4.5 took the top 3 spots because they had natural voices that didn't sound pretentious. They were poetic and immersive, but in the right ways. When they used wit, they didn't sound cringey and try-hard - they sounded like your naturally funny friend. I'm waiting for the day AI wins a Pulitzer, and hopefully Hemingway-bench helps guide it on its way. Check out the leaderboard and examples here: surgehq.ai/leaderboard And our blog post describing it: surgehq.ai/blog/hemingway…

English

3.6K

Keşfet

@AravSrinivas @Jason @HelloSurgeAI @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates