Jonathan Lebensold

855 posts

Jonathan Lebensold

@jonlebensold

Helping you hill-climb your agentic system @ Jetty. AI has an evaluation problem and I’m trying to fix it. PhD in privacy and ML, ex-Meta AI, Google.

Montreal, Canada Katılım Şubat 2009

898 Takip Edilen1.4K Takipçiler

Jonathan Lebensold@jonlebensold·1d

Everyone's building complex agent orchestration frameworks. Our most reliable backend is a 442-line markdown file. The sophistication isn't in the format — what it looks like when you build a dashboard for AI hill climbing. open.substack.com/pub/lebensold/…

English

Jonathan Lebensold retweetledi

Pomerium@pomerium_io·4d

Join @jonlebensold and @nickytonline next week for a hands-on conversation about agent evaluations and how to systematically improve agent performance. Jon will dig into what a practical runbook for hill-climbing looks like. youtube.com/watch?v=QlwYGf…

YouTube

English

280

Jonathan Lebensold@jonlebensold·14 Nis

BCG ran a pre-registered AI experiment with 758 consultants. Inside AI's frontier: +40% quality, 25% faster. Outside the frontier: 19% MORE errors. The terrifying part: teams can't tell which tasks fall where without deliberately building infrastructure to find out. open.substack.com/pub/lebensold/…

English

Jonathan Lebensold@jonlebensold·31 Mar

We're also working on this with our agent runbooks: lebensold.substack.com/p/runbooks-wha…

Ryan Marten@ryanmart3n

Evals are the verifier for the agent building process. Once you have an eval, you can autonomously hill climb it with Meta-Harness (agents building agents). @karpathy’s autoresearch (agent building models) is another piece. To complete the full autonomous loop of agent development we need Meta-Measure / autobenchmark (agents building evaluations). We have started automating the task discrimination (quality control) part of the benchmark development process (in our efforts for Terminal-Bench 3.0), but the real unlock will be cracking task generation. DM me if you are interested in autobenchmarking and let’s jam on some ideas!

English

113

Jonathan Lebensold@jonlebensold·31 Mar

L. Peter Deutsch said you can't have more than 50 visual primitives on screen at once. AI workflow builders blow past that limit before you've handled the happy path. open.substack.com/pub/lebensold/…

English

Jonathan Lebensold@jonlebensold·27 Mar

While the industry builds complex agent frameworks, the most reliable orchestration I've found is a markdown file with a rubric, an iteration cap, and a bash verification script. Runbooks > orchestrators. open.substack.com/pub/lebensold/…

English

Jonathan Lebensold@jonlebensold·23 Mar

@MugenXBT exactly. This is a new frontier in the generation-verification gap.

English

Mugen@MugenXBT·23 Mar

@jonlebensold human review becomes the last gate not the whole process

English

Jonathan Lebensold@jonlebensold·23 Mar

Not all evals are about text. Here's an example of a Jetty.io runbook that evaluates agent drawings in Figma. Now that agents can self-correct, we need rubrics that focus on outcomes, not just traces.

English

132

Jonathan Lebensold@jonlebensold·23 Mar

@MugenXBT Where we've seen this really land is where the agent uses the eval iteratively to improve assets before human eyes get involved.

English

Mugen@MugenXBT·23 Mar

@jonlebensold visual eval was always going to be necessary, just took agents getting good enough to make it urgent

English

Jonathan Lebensold@jonlebensold·16 Mar

When I got this email from perfectday.nyc was the best part of opening my inbox. Thanks @kaseyklimes

English

179

Jonathan Lebensold@jonlebensold·12 Mar

You can now setup LLM judge workflows in minutes from Claude Code: skills.sh/jettyio/agent-…

English

104

Jonathan Lebensold@jonlebensold·11 Mar

Token costs fell 30–80% in 12 months. But the cost to verify AI output didn't drop at all. That gap is the real story of the MTok crash. open.substack.com/pub/lebensold/…

English

Jonathan Lebensold retweetledi

Siva Reddy@sivareddyg·10 Mar

Montreal deep tech scene is getting hot!! Many recent hires of Cohere, Mistral, Periodic Labs, Poolside are all based in Montreal. And now, AMI will have an office here 🔥 It's a no-brainer, though. @Mila_Quebec has the highest concentration of deep learning expertise with interdisciplinary connections. Thanks to recent US regulation changes on immigration, no more brain drain! Let's build more in Canada!

Yann LeCun@ylecun

Unveiling our new startup Advanced Machine Intelligence (AMI Labs). We just completed our seed round: $1.03B / 890M€, one the largest seeds ever, probably the largest for a European company. We're hiring! [the background image is the Veil Nebula - a picture I took from my backyard, most appropriate for an unveiling] More details here: techcrunch.com/2026/03/09/yan…

English

755

72.2K

Jonathan Lebensold@jonlebensold·10 Mar

@komorama "all rind and no pulp"—love it

English

Alex Komoroske@komorama·9 Mar

I just published my weekly reflections: #heading=h.xdawl0iuxq8a" target="_blank" rel="nofollow noopener">docs.google.com/document/d/1xR…. Software's Hall-Héroult moment. Micro-intellectual gravity. An armada of slop cannons. Token furnaces. Hyper-productivity cage. Escape hatch abduction. The sharpness of schelling points. Pace layer induction. Shifting scarcity. Cognitive debt. Induced pull. Finding Grubby Truffles by smell.

English

467

Jonathan Lebensold@jonlebensold·10 Mar

@ben_mathes do you see the claude.ai usage tracker as a target or a budget?

English

🅱🅔🅝@ben_mathes·9 Mar

TFW you have multiple local agents all doing data ingestion and normalization

English

313

Jonathan Lebensold retweetledi

kasey@kaseyklimes·6 Mar

realized we had our first power user this morning then discovered he was in town from california we piled in a lyft to race across town to meet him learned he was using the product in ways we didn’t even realize were possible, and that it had totally transformed his workflow 🥲

English

7.4K

Jonathan Lebensold retweetledi

Josh Greaves@joshgreaves_ml·5 Mar

Just released frfr—lightweight runtime type validation for Python. One function. Zero dependencies. 57KB installed. Validates your dataclasses, TypedDicts, and NamedTuples directly. That's the whole API. Here's why this exists:

English

930

Jonathan Lebensold@jonlebensold·5 Mar

A team went from 50 assets/campaign to 4,000. Same three reviewers. They spot-check 2% and ship the rest unchecked. This is the real AI content problem — not generation quality, but compliance at scale. open.substack.com/pub/lebensold/…

English

Jonathan Lebensold@jonlebensold·5 Mar

@alexgshaw do we have to be IRL?

English

Alex Shaw@alexgshaw·5 Mar

Consider joining us for a systems reading group about TB and Harbor!

Shreya Shekhar@_shreya_s

Excited to kick off this year’s Systems Reading Group series with @harborframework and @terminalbench! Top frontier labs, data vendors, and AI cos are moving to Harbor for their RL infra and evals. Come by to learn why, and dive into key components of their architecture with creators @alexgshaw & @ryanmart3n! Sign up below for the event on 3/10 👉 luma.com/wkdfbw17

English

2.1K

Jonathan Lebensold retweetledi

Bo Wang@BoWang87·3 Mar

Prof. Donald Knuth opened his new paper with "Shock! Shock!" Claude Opus 4.6 had just solved an open problem he'd been working on for weeks — a graph decomposition conjecture from The Art of Computer Programming. He named the paper "Claude's Cycles." 31 explorations. ~1 hour. Knuth read the output, wrote the formal proof, and closed with: "It seems I'll have to revise my opinions about generative AI one of these days." The man who wrote the bible of computer science just said that. In a paper named after an AI. Paper: cs.stanford.edu/~knuth/papers/…

English

154

1.9K

9.1K

1.4M

Keşfet

@nickytonline @MugenXBT @kaseyklimes @Mila_Quebec @komorama @ben_mathes @elonmusk @BarackObama