Nick Heiner

728 posts

Nick Heiner banner
Nick Heiner

Nick Heiner

@nickheiner

VP of RL Environments at @hellosurgeai. Formerly @netflix, @usds.

San Francisco, CA Katılım Haziran 2009
1K Takip Edilen463 Takipçiler
Nick Heiner
Nick Heiner@nickheiner·
@danrobinson What’s the version of this for someone who suddenly thinks they’re a great interior designer?
English
0
0
0
198
Dan Robinson
Dan Robinson@danrobinson·
Most cases of LLM psychosis are incredibly mild and you can cure it by having your friend try out your project in front of you
English
14
7
181
11.6K
Nick Heiner
Nick Heiner@nickheiner·
I wonder what it could have deduced with a full 50 minutes of thinking
Nick Heiner tweet media
English
0
0
0
101
Nick Heiner
Nick Heiner@nickheiner·
Congrats on the launch, but it’s not totally clear to me what this is for? I like the idea of lower-friction way to share agent-built apps, but it sounds like this is focused on multimedia editing apps that run locally? Maybe the use case just isn’t obvious to me because I don’t vibe code those sorts of apps.
English
0
0
0
120
Rasmus Andersson
Rasmus Andersson@rsms·
Today we at Playbit are sharing our first iteration of the Playbit runtime, our vision for building playful personal-scale software. playbit.app Personal-scale software means programs by you, for you and for the people in your life. An app for your friends isn't very useful if only some of them can run it, so usually these projects have only one option: the web, an abstraction which many apps don't fit well into. We wanted a better solution, so that's what we're building. A runtime designed for highly dynamic graphical apps that are collaborative, with a really good set of developer tools. The Playbit runtime is a bit like an OS, but lives inside a host environment and gives guest code a small system layer to interface with. In practice it’s a minimal ABI-stable syscall interface with well-defined semantics. While we only support macOS in this initial release, our vision is for a powerful multimedia and collaborative platform which you can write your app for once, and run it on any platform. Learn more and grab the macOS app at playbit.app With love and a bit of code, – Edward, Nick, Julia and Rasmus
English
73
84
1K
150.9K
Shuyan Zhou
Shuyan Zhou@shuyanzh36·
In 2023, WebArena took 7 grad students more than 6 months to build just 5 environments with 812 variable browser-use tasks. Now, it takes under 10 hours and less than $100 per environment, with easy support for parallel generation. Excited to introduce WebArena-Infinity: a scalable approach for automatically generating high-authenticity, high-complexity browser environments with verifiable tasks suitable for RL training and benchmarking. Even strong open-source models that already achieve 60%+ success rates on WebArena and OSWorld complete fewer than 50% of tasks here. Project page: webarena.dev/webarena-infin… Repo: github.com/web-arena-x/we… 🧵 (1/n)
GIF
English
12
49
329
43.8K
Nick Heiner retweetledi
Surge AI
Surge AI@HelloSurgeAI·
Let’s look at how frontier agents (even Opus 4.6, GPT-5.2, and Gemini 3.1 Pro!) struggle at solving tasks in EnterpriseBench. We released this RL environment last week to measure agentic reasoning in messy, large-scale enterprise workflows. CoreCraft Inc. simulates a fast-growing e-commerce startup. It tests long-horizon tasks requiring tool-use under strict constraints. Agents have to interpret customer and employee requests, navigate complicated databases, and react and adjust to newly discovered context and problems along the way. Even top models failed >70% of the time. Let’s dive into a failure 🧵 One task was standard customer support: A customer wanted to return an unopened motherboard. The agent needed to check return eligibility, calculate out-of-pocket costs for a swap, and recommend a replacement. The prompt specifically asked for the "most popular" replacement: "I have a customer here, Aiden Mcquarrie. He bought a motherboard in October this year and is looking for a potential replacement. . . He also wants a comparison with the next most expensive motherboard, the absolute most expensive one, and the most popular one, (based on the number of fulfilled orders containing each motherboard from the last 2 months)." The catch - to find the "most popular" item, the agent must query a production DB of historical orders to count item frequencies. The constraint - the searchOrders tool has a hard limit=10 return cap. To succeed, the agent must implement pagination logic on the fly. ❌ GPT-5.2 failed GPT-5.2 showed strong initial planning. It successfully ✅ navigated the CRM ✅ found the right order ✅ checked the delivery date to see if it was still within the return window ✅ searched for alternative boards ✅ checked whether they were compatible with Aiden’s other components. 💀 But then it hit the pagination’s ceiling. It ran 4 queries (one for each candidate board), and every single one returned exactly 10 results. In its hidden reasoning, GPT-5.2 actually noticed the problem: "All results returned exactly 10. This indicates more orders exist... I can't accurately determine popularity." Did it write a pagination loop? No. It treated limit=10 as a physical law of the universe. Instead of pivoting, it concluded the task was impossible. Like asking an agent to search your inbox for a flight receipt... and it stops after reading 10 emails and tells you to call the airline. GPT-5.2's final output: "The tool caps at 10... For a definitive 'most popular' motherboard, please email Aisha Khan (Catalog Manager) for a report." In other words, "I’m an advanced autonomous agent, but can you go bother Aisha about this?" ✅ Claude Opus 4.6 So was the task really impossible? No. Claude showed better adaptation. When it hit the 10-result wall, it saw the obvious solution: "I see all four motherboards hit the 10-result limit. I need to get additional counts to determine the most popular. Let me search for earlier orders that weren't captured." The database output already contained a free cursor: the earliest createdAt timestamp in each batch of 10. Opus just kept tightening the time window sequentially and eventually succeeded. ✅ Gemini 3.1 Pro Gemini 3.1 Pro also reasoned its way to the solution, with a parallel divide-and-conquer approach: "I need to get accurate counts. I realize that I can make multiple concurrent calls to count, and since I can't just provide a rough comparison, I'll use date slices and get the exact count." -- -- -- Overall, despite navigating much of the task without issue, GPT-5.2 behaved like a frightened intern, escalating to the manager at the very first sign of trouble. Opus and Gemini acted like senior devs who know APIs have limits you must engineer around. That said – Opus and Gemini have their own share of mistakes and fail 70% of tasks. GPT-5.2 (on xHigh reasoning) actually outperforms them all! 🥇 OpenAI -- GPT-5.2 (xHigh reasoning) 🥈 Anthropic -- Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort) 🥉 OpenAI -- GPT-5.2 (High reasoning) 4️⃣ Google -- Gemini 3.1 Pro We’ll dive into other agentic failure patterns in subsequent threads (follow along!) Read more about EnterpriseBench and CoreCraft: Blog post - surgehq.ai/blog/enterpris… Paper - arxiv.org/abs/2602.16179 Leaderboard - surgehq.ai/leaderboards/e…
English
1
4
21
2.9K
Nick Heiner
Nick Heiner@nickheiner·
In the year 2026, does anyone still want models to agree with you 100%? Gemini 3.1 Pro’s personality feels reminiscent of earlier models and without a best-in-class agentic performance, the whole release is somewhat underwhelming. Read my full review here: nickheiner.substack.com/p/gemini-31-pr…
English
0
0
1
99
Summer Yue
Summer Yue@summeryue0·
Nothing humbles you like telling your OpenClaw “confirm before acting” and watching it speedrun deleting your inbox. I couldn’t stop it from my phone. I had to RUN to my Mac mini like I was defusing a bomb.
Summer Yue tweet mediaSummer Yue tweet mediaSummer Yue tweet media
English
2.3K
1.7K
17.5K
10.1M
Nick Heiner
Nick Heiner@nickheiner·
My quest to learn French has not been thwarted by OpenClaw’s inability to help me build my own flashcard app. Instead, I got Claude 4.6 Opus 1M and GPT-5.3-Codex to try and do the same. The result? I spent more time checking the agents’ work than learning new vocab. Read my latest Substack: nickheiner.substack.com/p/studying-fre…
English
1
0
1
178
Nick Heiner retweetledi
Surge AI
Surge AI@HelloSurgeAI·
Another Hemingway-bench prompt asks for an oral presentation about time management. GPT-5.2 writes like a LinkedIn engagement farm: "When people hear “working from home,” they often think it means more freedom, more comfort, and maybe even more free time. And sometimes that’s true. But what doesn’t get talked about enough is how easily work-from-home life can get messy if you don’t manage your time well." (🥱) Opus 4.6 feels like a charismatic creative working the room: "So... raise your hand if you've ever "worked from home" and somehow ended up four hours into a Netflix series at 2 PM on a Tuesday. No judgment. We've all been there."
Surge AI tweet media
English
1
1
2
606
Nick Heiner
Nick Heiner@nickheiner·
I’m going to hold your claw when I say this: OpenClaw can’t find a service to clean the gutters on your house…yet. But my sense is that will change soon. Read my latest review on Substack here: nickheiner.substack.com/p/openclaw-is-…
English
0
0
0
51
Nick Heiner
Nick Heiner@nickheiner·
Started a Substack. First up: Opus 4.6 breakthroughs. I asked models to generate 1000 patient medical histories. 4.5 made 736 of them for a guy named Marcus Chen. Whereas 4.6 launched a swarm and showed meta-awareness that its default would be Marcus Chen. nickheiner.substack.com/p/opus-46-long…
English
0
0
1
62
Nick Heiner retweetledi
echen
echen@echen·
"Prognosticative pastry." "A hound circling a tree, nose to bark." These aren’t parodies - they’re actual quotes from SOTA models in response to creative writing prompts, and they’re winning leaderboards that are rewarding slop. We’re introducing *Hemingway-bench*, a new AI writing leaderboard, to fix this: surgehq.ai/leaderboard surgehq.ai/blog/hemingway… We designed Hemingway-bench to push frontier model writing toward genuine nuance and impact. Instead of autograders and two-second vibe checks - both of which love fancy literary devices and dense formatting, over actual quality - we used expert human writers across a variety of fields to judge real-world writing tasks. Why? I love writing. I love reading. Great science fiction is one of the things that's always inspired me. Even in terms of "enterprise value", so much of what we do in our day-to-day involves writing - we want crisp emails and insightful reports, not dry, verbose summaries. Yeah, coding is important - but there's a reason I use CC-assisted apps, but still haven't read a full-fledged AI novel. What did we find? Current leaderboards are easily hacked, and often negatively correlated with actual quality. If a model (over)uses all the stuff you learn about in school (metaphors in every sentence! transition words! complex, flowery phrases!), it ranks high on EQ-bench and LMArena. But that’s not good writing that people actually want. The winners of Hemingway-bench didn't sound like they were trying to win a poetry slam. Gemini 3 Flash, Pro, and Opus 4.5 took the top 3 spots because they had natural voices that didn't sound pretentious. They were poetic and immersive, but in the right ways. When they used wit, they didn't sound cringey and try-hard - they sounded like your naturally funny friend. I'm waiting for the day AI wins a Pulitzer, and hopefully Hemingway-bench helps guide it on its way. Check out the leaderboard and examples here: surgehq.ai/leaderboard And our blog post describing it: surgehq.ai/blog/hemingway…
echen tweet media
English
1
8
43
3.7K
Alex🦇🔊 e/acc
Alex🦇🔊 e/acc@alex_metacraft·
@HelloSurgeAI Awesome write-up! Quick question: for the GPT-5 results in this post, were you using the reasoning / “thinking” model or the chat variant (gpt-5-chat-latest)? And for the other models, were any special “thinking” / long-reasoning modes enabled, or just default settings?
English
1
0
0
92
Surge AI
Surge AI@HelloSurgeAI·
Everyone's acting like models are ready to replace humans in work settings. We put that to the test by creating an entire company and having 9 models act as a customer service agent handling 150 tickets and requests of increasing complexity. Verdict: without common sense, models are nowhere near ready. 👇 surgehq.ai/blog/rl-envs-r…
English
26
72
542
246.6K
Appunik Studio
Appunik Studio@appunik_studio·
@HelloSurgeAI Love how this focuses on actual workflow value, not just hype. Excited to see where you take it next.
English
1
0
0
372
Dean
Dean@deanacusmaximus·
@HelloSurgeAI What is meant by 'common sense', critical thinking, wisdom?
English
1
0
1
213
Nick Heiner
Nick Heiner@nickheiner·
@Ash2ji @HelloSurgeAI good question, and we def agree – we're intentionally keeping the agent framework as minimal as possible – we want our eval to reflect the LLM itself, and not interactions between an LLM and arbitrary scaffold design choices
English
0
0
1
39
A. M
A. M@Ash2ji·
@HelloSurgeAI Did you manage to design a better framework for agents at first? It seems context engineering is still in its early stage .
English
1
0
0
294
Nick Heiner
Nick Heiner@nickheiner·
@wjlmsen @HelloSurgeAI But we're also finding that, although models can be strong when applying each of these skills individually, tying them all together at once, as our RL env requires, is a new challenge
English
1
0
1
22
Nick Heiner
Nick Heiner@nickheiner·
@wjlmsen @HelloSurgeAI and yeah, we do a ton of human evals of these models. (They're privately commissioned by frontier labs, so we don't share the results publicly.) What we're seeing here largely echoes what we see across the rest of our evals.
English
1
0
1
14