Ryan

3.4K posts

Ryan

@_PaperMoose_

CTO @heynoah. Built ARC-AGI 2 evals @gregkamrad.

SF Katılım Ağustos 2017

1.6K Takip Edilen1.3K Takipçiler

Sabitlenmiş Tweet

Ryan@_PaperMoose_·21 Eki

When you deploy an LLM-as-a-Judge, you’re shipping a classifier into production. Each new version is a hypothesis about how the model interprets the world. It’s data science, just expressed in natural language. Here’s what that looked like for a recent client project where we trained an evaluator to detect a specific agent error type (labeled Category 1 failures) before release. Dataset Dev: 104 labeled traces (46 failures, 58 clean) Eval: 95 labeled traces (34 failures, 61 clean) What We Saw v1 established a clear baseline. v2 drove recall higher but overfit to the dev set, collapsing generalization. v3 made surgical adjustments that clarified “when not to trigger,” improving specificity and stability. v10 is when started to see a step change in the eval set performance, a sign the judge was beginning to generalize. Why It Matters I find that teams often fall into the trap of assuming the llm works without verifying it through hard data. This is a big mistake! Look at the numbers below and see for yourself. Even with careful preparation, the model still fails to correctly classify more than 80 percent of actual labeled errors. A few percent of overfit recall here, a small generalization gap there, and suddenly your CI isn’t filtering what you think it is. Treat them like classifiers: versioned, measured, and tuned against held-out data. That’s how you keep agents honest in production. @HamelHusain @sh_reya

English

134

16.4K

Ryan@_PaperMoose_·4d

@benln Would love to come!

English

Ben Lang@benln·5d

Booking out a local cafe on July 20th in San Francisco Grab coffee, Cursor credits, meet the team, and build Limited spots for founders, engineers, designers, PMs

English

406

43.8K

Ryan@_PaperMoose_·4d

@AnnieLiao_2000 @ManusAI Wonder what inspired this

English

156

Annie ❤️‍🔥@AnnieLiao_2000·4d

made a fun little face distortion app: - one shot prompted one @ManusAI - demo link below reply if you want the prompt

English

6.5K

Ryan retweetledi

shako@shakoistsLog·6d

there is a calm, paranoid, diligence, you form after a decade of running evals. eventually it culminates in not really trusting evals or benchmarks in nearly any capacity outside of real world usage. but the academy and market demands them, so...

austin petersmith@awwstn

@benhylak honestly very relatable, getting absolutely stoked about a new eval score and then realizing you contaminated the data

English

Ryan retweetledi

Li Yin@panda_liyin·6d

x.com/i/article/2074…

ZXX

150

53.4K

Ryan@_PaperMoose_·5 Tem

@swyx Really heartfelt and enjoyable post.

English

119

swyx@swyx·5 Tem

x.com/i/article/2073…

ZXX

566

177.4K

Ryan retweetledi

Annie ❤️‍🔥@AnnieLiao_2000·2 Tem

Introducing Solaris AI. Plan and roll out your company’s AI strategy. We built Solaris to help companies accelerate AI adoption across their teams, while keeping humans at the centre of the transformation. Track AI fluency, improve adoption of approved tools and reduce tool sprawl in one platform. We are offering free AI strategy / diagnostic calls for anyone serious about AI adoption. More below ↓

English

154

37.1K

Ryan@_PaperMoose_·1 Tem

@lauradang0 Would love to come

English

Ryan@_PaperMoose_·25 Haz

@lauradang0 This actually helped me enormously

English

585

laura@lauradang0·25 Haz

How I beat brain fog: eating whole foods and swapping out processed snacks My favorite swap right now is replacing regular chips with homemade meat chips. Literally just ground turkey and seasoning

English

227

19.4K

Ryan@_PaperMoose_·23 Haz

@lauradang0 This is so true

English

186

laura@lauradang0·23 Haz

Community will make or break the city you live in

English

4.5K

Ryan@_PaperMoose_·12 Haz

Everyone's talking about agent loops right now. The part nobody says out loud: loops compound, and most directions you can compound in are negative. Our loop was simple. Issue comes in, spin up a bunch of agents, review lightly, trust them, ship. The velocity was real. We could have gone further and pulled straight from the tracker with a human almost fully out of it. Here's what that buys you. You stop understanding your own system, and AI reviewing AI isn't enough to fix it. It's the middle management problem at hyperspeed: you're the CEO reading the ground floor through three layers, except the floor moves faster than you can read it. And because a loop compounds, an undisciplined one doesn't drift a little. It runs confidently in a bad direction and builds your confidence the whole way, until you hit an edge you never saw coming. I used to be AI for everything. Now I think it's a tool that quietly makes your life harder if you point it wrong, and it feels like progress right up until the ravine. The skill isn't running more loops. It's making good decisions with less understanding of what's underneath than you've ever had.

English

Ryan@_PaperMoose_·27 May

@levelsio total pyscho

Polski

@levelsio@levelsio·27 May

Just saw a guy at Munich airport code completely without AI like some kind of maniac 🤯

English

2.1K

140.4K

Ryan@_PaperMoose_·27 May

@SahilBloom I can definitely relate to this. How did you do it?

English

Sahil Bloom@SahilBloom·27 May

I'm convinced that the single greatest challenge for any ambitious person is eliminating the guilt associated with free time and rest.

English

365

1.8K

16.2K

393.1K

Ryan@_PaperMoose_·27 May

@thorstenball Been living in coding agents all year. The thing that decides which ones stick for me isn't raw capability, it's how fast I can review what they produced. That's where my bottleneck moved. Curious how Neo handles the review surface.

English

468

Thorsten Ball@thorstenball·27 May

Amp Neo is now available to everyone. Time to drop the suffix. ampcode.com/news/drop-the-…

English

343

38.4K

Ryan@_PaperMoose_·15 May

reading user transcripts on a sunday is the closest thing founders have to therapy. not because the users say nice things. because you stop being defensive about your product and start hearing what they actually said. distance is a feature. plan for it.

English

Ryan@_PaperMoose_·15 May

ai chat ux failure mode i keep seeing: the agent has too much context and uses all of it. users don't want a system prompt rolled into the response. they want the answer. context should be invisible. the moment the agent says "based on what you told me earlier..." you've shown your hand and lost the magic.

English

Ryan@_PaperMoose_·14 May

startup math: 100% of 1 thing > 75% of 7 things. the second is dead in the water. the math feels wrong because adding features feels like progress. it's the opposite. each new feature dilutes the one thing you were almost great at.

English

Ryan@_PaperMoose_·14 May

the questions you ask politely get answered politely. the questions you ask after four hours of sitting in a room together get answered honestly. most product research stops at hour one. the good stuff starts at hour three. founders who skip the slog get the polite answer and miss the real one.

English

Ryan@_PaperMoose_·13 May

every great agent product i've used has the same property: i forget there's a model behind it. the moment you remember it's an llm, the product has lost. the work is making the seams invisible. most teams optimize for showing the seams off.

English

Ryan@_PaperMoose_·13 May

scheduling isn't the wedge into executive workflow because scheduling is hard. it's the wedge because the meeting that didn't happen is more expensive than the meeting that ran ten minutes long. founders building "calendly for X" miss this. it's not the meeting. it's the meeting NOT happening.

English

Ryan@_PaperMoose_·12 May

LLM costs are now the variable cost. infrastructure is the fixed cost. we've inverted the SaaS economics in two years and most of the playbooks haven't caught up. usage-based pricing isn't a billing model anymore. it's a margin survival strategy.

English

Keşfet

@benln @AnnieLiao_2000 @ManusAI @swyx @lauradang0 @levelsio @SahilBloom @thorstenball