Jonathan Lebensold

855 posts

Jonathan Lebensold banner
Jonathan Lebensold

Jonathan Lebensold

@jonlebensold

Helping you hill-climb your agentic system @ Jetty. AI has an evaluation problem and I’m trying to fix it. PhD in privacy and ML, ex-Meta AI, Google.

Montreal, Canada Katılım Şubat 2009
898 Takip Edilen1.4K Takipçiler
Jonathan Lebensold
Jonathan Lebensold@jonlebensold·
Everyone's building complex agent orchestration frameworks. Our most reliable backend is a 442-line markdown file. The sophistication isn't in the format — what it looks like when you build a dashboard for AI hill climbing. open.substack.com/pub/lebensold/…
Jonathan Lebensold tweet media
English
1
0
1
60
Jonathan Lebensold retweetledi
Pomerium
Pomerium@pomerium_io·
Join @jonlebensold and @nickytonline next week for a hands-on conversation about agent evaluations and how to systematically improve agent performance. Jon will dig into what a practical runbook for hill-climbing looks like. youtube.com/watch?v=QlwYGf…
YouTube video
YouTube
English
0
1
2
280
Jonathan Lebensold
Jonathan Lebensold@jonlebensold·
BCG ran a pre-registered AI experiment with 758 consultants. Inside AI's frontier: +40% quality, 25% faster. Outside the frontier: 19% MORE errors. The terrifying part: teams can't tell which tasks fall where without deliberately building infrastructure to find out. open.substack.com/pub/lebensold/…
Jonathan Lebensold tweet media
English
0
0
1
80
Jonathan Lebensold
Jonathan Lebensold@jonlebensold·
We're also working on this with our agent runbooks: lebensold.substack.com/p/runbooks-wha…
Ryan Marten@ryanmart3n

Evals are the verifier for the agent building process. Once you have an eval, you can autonomously hill climb it with Meta-Harness (agents building agents). @karpathy’s autoresearch (agent building models) is another piece. To complete the full autonomous loop of agent development we need Meta-Measure / autobenchmark (agents building evaluations). We have started automating the task discrimination (quality control) part of the benchmark development process (in our efforts for Terminal-Bench 3.0), but the real unlock will be cracking task generation. DM me if you are interested in autobenchmarking and let’s jam on some ideas!

English
0
0
1
113
Jonathan Lebensold
Jonathan Lebensold@jonlebensold·
While the industry builds complex agent frameworks, the most reliable orchestration I've found is a markdown file with a rubric, an iteration cap, and a bash verification script. Runbooks > orchestrators. open.substack.com/pub/lebensold/…
Jonathan Lebensold tweet media
English
1
0
1
63
Mugen
Mugen@MugenXBT·
@jonlebensold human review becomes the last gate not the whole process
English
1
0
0
23
Jonathan Lebensold
Jonathan Lebensold@jonlebensold·
Not all evals are about text. Here's an example of a Jetty.io runbook that evaluates agent drawings in Figma. Now that agents can self-correct, we need rubrics that focus on outcomes, not just traces.
English
1
0
4
132
Jonathan Lebensold
Jonathan Lebensold@jonlebensold·
@MugenXBT Where we've seen this really land is where the agent uses the eval iteratively to improve assets before human eyes get involved.
English
1
0
1
20
Mugen
Mugen@MugenXBT·
@jonlebensold visual eval was always going to be necessary, just took agents getting good enough to make it urgent
English
1
0
1
41
Jonathan Lebensold retweetledi
Siva Reddy
Siva Reddy@sivareddyg·
Montreal deep tech scene is getting hot!! Many recent hires of Cohere, Mistral, Periodic Labs, Poolside are all based in Montreal. And now, AMI will have an office here 🔥 It's a no-brainer, though. @Mila_Quebec has the highest concentration of deep learning expertise with interdisciplinary connections. Thanks to recent US regulation changes on immigration, no more brain drain! Let's build more in Canada!
Yann LeCun@ylecun

Unveiling our new startup Advanced Machine Intelligence (AMI Labs). We just completed our seed round: $1.03B / 890M€, one the largest seeds ever, probably the largest for a European company. We're hiring! [the background image is the Veil Nebula - a picture I took from my backyard, most appropriate for an unveiling] More details here: techcrunch.com/2026/03/09/yan…

English
19
49
755
72.2K
Alex Komoroske
Alex Komoroske@komorama·
I just published my weekly reflections: #heading=h.xdawl0iuxq8a" target="_blank" rel="nofollow noopener">docs.google.com/document/d/1xR…. Software's Hall-Héroult moment. Micro-intellectual gravity. An armada of slop cannons. Token furnaces. Hyper-productivity cage. Escape hatch abduction. The sharpness of schelling points. Pace layer induction. Shifting scarcity. Cognitive debt. Induced pull. Finding Grubby Truffles by smell.
English
1
1
7
467
🅱🅔🅝
🅱🅔🅝@ben_mathes·
TFW you have multiple local agents all doing data ingestion and normalization
🅱🅔🅝 tweet media
English
1
0
2
313
Jonathan Lebensold retweetledi
kasey
kasey@kaseyklimes·
realized we had our first power user this morning then discovered he was in town from california we piled in a lyft to race across town to meet him learned he was using the product in ways we didn’t even realize were possible, and that it had totally transformed his workflow 🥲
English
3
2
62
7.4K
Jonathan Lebensold retweetledi
Josh Greaves
Josh Greaves@joshgreaves_ml·
Just released frfr—lightweight runtime type validation for Python. One function. Zero dependencies. 57KB installed. Validates your dataclasses, TypedDicts, and NamedTuples directly. That's the whole API. Here's why this exists:
Josh Greaves tweet media
English
4
6
20
930
Jonathan Lebensold
Jonathan Lebensold@jonlebensold·
A team went from 50 assets/campaign to 4,000. Same three reviewers. They spot-check 2% and ship the rest unchecked. This is the real AI content problem — not generation quality, but compliance at scale. open.substack.com/pub/lebensold/…
Jonathan Lebensold tweet media
English
0
0
0
57
Alex Shaw
Alex Shaw@alexgshaw·
Consider joining us for a systems reading group about TB and Harbor!
Shreya Shekhar@_shreya_s

Excited to kick off this year’s Systems Reading Group series with @harborframework and @terminalbench! Top frontier labs, data vendors, and AI cos are moving to Harbor for their RL infra and evals. Come by to learn why, and dive into key components of their architecture with creators @alexgshaw & @ryanmart3n! Sign up below for the event on 3/10 👉 luma.com/wkdfbw17

English
1
2
22
2.1K
Jonathan Lebensold retweetledi
Bo Wang
Bo Wang@BoWang87·
Prof. Donald Knuth opened his new paper with "Shock! Shock!" Claude Opus 4.6 had just solved an open problem he'd been working on for weeks — a graph decomposition conjecture from The Art of Computer Programming. He named the paper "Claude's Cycles." 31 explorations. ~1 hour. Knuth read the output, wrote the formal proof, and closed with: "It seems I'll have to revise my opinions about generative AI one of these days." The man who wrote the bible of computer science just said that. In a paper named after an AI. Paper: cs.stanford.edu/~knuth/papers/…
Bo Wang tweet media
English
154
1.9K
9.1K
1.4M