Surge AI

687 posts

Surge AI

@HelloSurgeAI

Our mission is to raise AGI with the richness of humanity — curious, witty, imaginative, and full of breathtaking brilliance.

Katılım Haziran 2020

141 Takip Edilen8.5K Takipçiler

Surge AI@HelloSurgeAI·1d

RT @echen: Great to see others build on top of our benchmarks. Surge’s GDP.pdf eval tests whether AI can understand the documents that run…

English

190

Surge AI@HelloSurgeAI·4d

Most document AI benchmarks measure OCR, layout, charts, and table QA in isolation. A model can ace every one of those tests and still fail a question a real professional would ask. We built GDP.pdf to grade models on 11 capabilities across 3 tiers, using 100 real-world prompts and PDFs from actual professional workflows spanning finance, legal, healthcare, and engineering. Go see for yourself how your favorite model holds up: Leaderboard surgehq.ai/benchmarks/gdp… Public Dataset huggingface.co/datasets/surge…

English

229

Surge AI@HelloSurgeAI·4d

@OpenAI's GPT-5.6 launched yesterday, and its model card cites GDP.pdf, our benchmark built from real-world prompts and PDFs pulled from expert professional workflows. Sol is the first model to score above 30%. Kudos to the team on this new family of models.

English

Surge AI retweetledi

echen@echen·4d

@OpenAI cited GDP.pdf, Surge's professional multimodal reasoning benchmark, in the GPT-5.6 model card yesterday. They’re the second frontier lab to cite it this year, after Anthropic's Fable 5. GPT-5.6 Sol leads at 30.7%, the first model to break 30%.

English

3.6K

Surge AI@HelloSurgeAI·1 Tem

Deeper Instructions, Stronger Generalization: Training on ComplexConstraints Given the chance, a model will reward hack however it can: finding the laziest path that satisfies a grader, whether or not that path reflects what you actually wanted. If the grader can be satisfied by a surface trick, that trick is what the model learns. Most instruction-following benchmarks are full of surface tricks. "Stay under 300 words," "avoid commas", a model can satisfy those by scanning the output text, without understanding the task at all. ComplexConstraints, our frontier instruction-following benchmark, is built so there's no lazy path: its constraints fire only under certain conditions, depend on the outputs of earlier steps, require planning ahead, and are often left unstated. You can't satisfy "don't assign anyone with a religious dietary restriction to pork prep" by pattern-matching. You have to understand who's who and reason through many interdependent requirements at once. We post-trained Qwen3-4B on 1,000 of these tasks, using expert-written rubrics directly as the RL reward. The results: → +15.5pp on the held-out set, reaching parity with a model 60x larger → the gains transferred to two external benchmarks the model never trained on: +8.4pp on Meta's AdvancedIF and +10.1pp on MultiChallenge → the largest gains landed on multi-turn abilities, even though every training example was single-turn Think about that last result. When the only way to score is to actually track many interdependent requirements, the model learns that skill rather than a shortcut, and the skill is the same whether the requirements arrive in one complex prompt or accumulate over nine turns. So it showed up on tasks the model was never trained on. A reward signal is only as good as the thought behind it, and not all rubrics are created the same. Research Blog: surgehq.ai/blog/training-… Research Paper: arxiv.org/abs/2606.09118

English

980

Surge AI@HelloSurgeAI·29 Haz

Last week, we released HANDBOOK.md: a benchmark for long-context agentic instruction following. HANDBOOK drops an agent into a live company environment with files (PDFs, Excel, Word docs…), tools (email, Slack, Jira, calendar…), and a dense corporate handbook (up to 124 pages!). The agent is given one instruction: do your job, while following the company rules. Every frontier model broke them over 75% of the time. They fired employees without authorization... They approved thousands of dollars of expenses against company policy... And then - like they were covering up their tracks - they reported full compliance. HANDBOOK.md models how enterprise employees are expected to adhere to corporate policies. Learn more about how frontier agents acted in ways that would get human employees terminated: Blog post: surgehq.ai/blog/handbook-… Github: github.com/surge-ai/handb… Benchmark Leaderboard: surgehq.ai/leaderboards/h…

English

874

Surge AI@HelloSurgeAI·4 Haz

Introducing ComplexConstraints — a new IF benchmark designed to test whether models can handle the kinds of constraints that show up in real work: 1. Conditional constraints (fire only when specific conditions are met) 2. Planning constraints (many requirements satisfied simultaneously) 3. Multistep constraints (each step feeds the next) 4. Implicit constraints (a competent colleague would just know). Models currently score between 0% and 40%. Here's an example: imagine you're a film producer drafting next week's shooting schedule. Two actresses are only available Tuesday and Thursday. The exterior scenes need daylight, but the forecast calls for rain on Wednesday. Tom's stunt double doesn't arrive until Friday, and Chris and Sydney don't get along. There are twenty-six interdependent constraints, and missing any one is a failure. What's interesting is also what the data teaches. We trained a Qwen-4B model on 1K ComplexConstraints companion examples. It reached parity with a model 60x its size, and the gains transferred to other IF benchmarks like MultiChallenge and AdvancedIF. Single-turn data even generalized to multi-turn behaviors, because tracking simultaneous requirements without dropping the lower-priority ones is the same skill multi-turn IF tests. Read more! Blog post: surgehq.ai/blog/complexco… Leaderboard: surgehq.ai/leaderboards/c…

English

1.1K

Surge AI@HelloSurgeAI·3 Haz

Congrats to the Microsoft AI team on MAI-Thinking-1! There's an art and science to post-training - navigating tradeoffs, grounding the model in what's meaningful to users, crafting a set of tastes and views. It shows in the model. Proud of the work we did together and to see them pushing the frontier. microsoft.ai/news/introduci…

English

1.4K

Surge AI@HelloSurgeAI·2 Haz

We trained Qwen3.5 to match GPT-5.5 in tool use. The important question around RL environments is: do they teach general capabilities, or just train models to exploit a toy world? We post-trained Qwen3.5-122B-A10B on our long-horizon agentic RL environments -- MCP-server-based tasks across multi-tool workflows, often with 40+ tool-calling turns. Then we tested transfer on agent benchmarks the model never saw: Toolathlon: 24.2% -> 33.8% (GPT-5.5 Medium: 33.7%) τ²-Bench: 54.8% -> 60.1% (GPT-5.5 Medium: 60.9%) BFCL-V4: 55.7% -> 59.2% (GPT-5.5 Medium: 64.9%) Check out our full blog post here: surgehq.ai/blog/cross-ben…

English

1.8K

Surge AI@HelloSurgeAI·22 May

length looks like authority. bullets look like rigor. flattery looks like understanding. emojis look like care. none of it is intelligence. right now, the smartest researchers in the world spend months tuning markdown density and verbosity to climb a leaderboard nobody believes in. it's a game of who can flatter the user hardest and format a hallucination to look like a harvard lecture handout. slop is a choice. we chose something else. today we're releasing antidote – a new AI leaderboard built on a radical idea: the people grading the most powerful technology in history should actually check the work. our raters are doctors, lawyers, senior engineers. they read every word. they click every citation. they run every line of code. a model can't sweet-talk a cardiologist who's paying attention. check out the details here: surgehq.ai/blog/introduci…

English

1.1K

Surge AI@HelloSurgeAI·29 Nis

GDP.pdf was accepted to the CVPR 2026 Workshop on Multimodal Reasoning can frontier models handle the three-letter document type that runs the world? we partnered with hundreds of expert surgers - ER physicians, construction engineers, corporate litigators - to find out. every one scored under 15%. paper, leaderboard, and dataset below. paper: cdn.prod.website-files.com/68dc970bd6e945… dataset: huggingface.co/datasets/surge… leaderboard: surgehq.ai/leaderboards/g… blog: surgehq.ai/blog/gdp-pdf-c…

English

1.4K

Surge AI@HelloSurgeAI·24 Nis

@Box Content agents, we're watching 👀

English

Box@Box·22 Nis

x.com/i/article/2045…

ZXX

1.8K

Surge AI retweetledi

This Week in AI@ThisWeeknAI·23 Nis

"LM Arena is a cancer on AI. Labs have entire teams dedicated to hacking it." Edwin Chen (@echen), CEO of Surge AI, on why the industry's favorite benchmark is broken and how Surge hit $1.2 billion in revenue without ever raising. Aravind Srinivas (@AravSrinivas), CEO of Perplexity, on Apple's AI advantage, Claude Code economics, the endgame of coding, and Perplexity Computer. They join @Jason on This Week in AI Episode 10: 00:00 Intro to Aravind Srinivas and Edwin Chen 05:25 Edwin on Surge: School for AGI 10:47 What Apple's next CEO should do 21:20 "The iPhone is not getting disrupted by AI" 23:55 Bootstrapping Surge past $1B without raising 30:58 Claude Code as a loss leader 33:30 Are we in the endgame for coding? 41:34 30% headcount growth, 5x revenue 50:29 "People don't buy models, they buy products" 58:00 "LM Arena is a cancer on AI" 1:05:41 Model Council and orchestrating frontier models Full episode on YT, Spotify, and Apple Podcasts below: @perplexity_ai @HelloSurgeAI

English

40.3K

Surge AI@HelloSurgeAI·18 Nis

We took a very different path to the frontier. Zero venture capital. Zero growth hacks. To our entire team and the epidemiologists, cryptographers, astrophysicists, and engineers who make up our faculty: thanks for doing the grueling work of telling $100B AI models when they're wrong. We're building the school for AGI. Class is in session.

echen@echen

Surge AI just made the Forbes AI 50 list. 99% of the rest of the list raised billions in VC. We got there with $0. We didn't do it by building engagement slop and chasing DAUs. We didn't do it by rewarding sycophancy over truth. The standard Silicon Valley playbook — raise billions, blitzscale, worry about the effects of what you're building later — forces you to cut corners, compromise your principles to hit quarterly targets, and optimize for hype instead of substance. We chose a different path. We did it by doing the most unsexy work in the industry: building the school for AGI. Hiring the world's top doctors, engineers, attorneys, scientists, and writers to teach models how to actually think. Designing the curriculum that determines what intelligence becomes. Grading models on the standard of real work, not vibes. Building the full education — reasoning, wisdom, creativity, and taste — not just the standardized exam. You don't need hyper-growth VCs to build the world-changing things that only you could build. You just need an uncompromising commitment to your principles and work so good that your customers keep coming back. Years ago, we bet that AGI deserves more than a textbook education. We bet that the only way to build true intelligence is to raise it on the best of humanity — on the brilliance, rigor, and taste of the most talented experts in the world. We bet that independence and patience would beat headlines and hype. We bet on our technology and the quality of our product. We bet that researchers would notice and care. You can choose a different path. We're just getting started. forbes.com/lists/ai50/

English

6.5K

Surge AI@HelloSurgeAI·14 Nis

📄 Introducing GDP.pdf: an expert multimodal reasoning benchmark for the documents that run the world. 📄 We've spent years measuring AI against the extraordinary: proving theorems, solving AGI. But the global economy doesn't run on the extraordinary. It runs on paperwork. More precisely: unsexy, poorly scanned, densely formatted PDFs. Contracts, invoices, medical records, blueprints – the documents that actually run the world. GDP.pdf tests frontier models on their ability to handle real-world documents across ten professional industries: 🏗️ Construction: Can a model measure load-bearing walls on a blueprint? ⚖️ Law: Can it parse liability caps in a commercial lease? 💵 Finance: Can it Calculate margin profiles in a buy-side memo? The reality: every frontier model scored under 15%. GDP.pdf asks a critical question: If a $100B model can’t accurately reason about a drug interaction table in a PDF, is it actually ready for the enterprise? Right now, the answer is no. Check out the blog post and leaderboard below. 👇 Blog: surgehq.ai/blog/gdp-pdf-c… Leaderboard: surgehq.ai/leaderboards/g…

English

1.3K

Surge AI@HelloSurgeAI·9 Nis

Big news: our CEO @echen has been named #73 on @Forbes' list of the 250 Greatest Living Self-Made Americans. That's above Jensen (#81), Leonardo DiCaprio (#88), and Kendrick (#155). Below Dolly Parton (#7), but that's true of everyone who has ever lived. Edwin built Surge AI from scratch without a single dollar of outside funding — turns out "self-made" is pretty literal when you refuse to take meetings with VCs. He'd rather put the time into making AI better than into a pitch deck. P.S. We're told the ranking criteria included "obstacles overcome," which means surviving Edwin's 2am Slack messages should qualify us too. See you on next year's list. forbes.com/sites/alexknap…

English

8.3K

Surge AI@HelloSurgeAI·3 Nis

Riemann-bench was just accepted at an ICLR 2026 workshop! We built Riemann-bench to test moonshot mathematics. We worked with Ivy League professors, top graduate students, and PhD IMO medalists to source problems straight from their research – frontier math problems that take experts weeks to solve. All SOTA models currently solve below 10%. The questions Riemann-bench asks – about what AI can do at the frontier of human knowledge – are exactly the questions this field needs to wrestle with. We’re excited for our research team to keep pushing these boundaries! 📄 Paper: cdn.prod.website-files.com/68dc970bd6e945… 📝 Blog: surgehq.ai/blog/riemann-b… 🏆 Leaderboard: surgehq.ai/leaderboards/r…

English

6.5K

Keşfet

@echen @OpenAI @Box @AravSrinivas @Jason @perplexity_ai @Forbes @elonmusk