Ryan

3.3K posts

Ryan banner
Ryan

Ryan

@_PaperMoose_

CTO @heynoah. Built ARC-AGI 2 evals @gregkamrad.

SF Katılım Ağustos 2017
1.6K Takip Edilen1.3K Takipçiler
Sabitlenmiş Tweet
Ryan
Ryan@_PaperMoose_·
When you deploy an LLM-as-a-Judge, you’re shipping a classifier into production. Each new version is a hypothesis about how the model interprets the world. It’s data science, just expressed in natural language. Here’s what that looked like for a recent client project where we trained an evaluator to detect a specific agent error type (labeled Category 1 failures) before release. Dataset Dev: 104 labeled traces (46 failures, 58 clean) Eval: 95 labeled traces (34 failures, 61 clean) What We Saw v1 established a clear baseline. v2 drove recall higher but overfit to the dev set, collapsing generalization. v3 made surgical adjustments that clarified “when not to trigger,” improving specificity and stability. v10 is when started to see a step change in the eval set performance, a sign the judge was beginning to generalize. Why It Matters I find that teams often fall into the trap of assuming the llm works without verifying it through hard data. This is a big mistake! Look at the numbers below and see for yourself. Even with careful preparation, the model still fails to correctly classify more than 80 percent of actual labeled errors. A few percent of overfit recall here, a small generalization gap there, and suddenly your CI isn’t filtering what you think it is. Treat them like classifiers: versioned, measured, and tuned against held-out data. That’s how you keep agents honest in production. @HamelHusain @sh_reya
Ryan tweet media
English
8
14
135
16.2K
Ryan
Ryan@_PaperMoose_·
given all the delve news today I want to share my experience with it's chatbot back in feb. This instruction... was not accurate. The team tab had a green continue button that did nothing.
Ryan tweet media
English
1
0
1
119
Ryan
Ryan@_PaperMoose_·
the ai agent amnesia problem you dispatch 5 agents to work on different tasks. they finish. you come back and ask "what happened?" nothing. logs gone. worktrees cleaned up. zero trace they ever existed. so you built the orchestrator but forgot to build the memory. just shipped persistent agent history for dispatch. every agent now records its lifecycle - launch, completion, stop, cleanup - to a file that survives cleanup. the orchestrator can finally answer "what did my agents do?" without detective work through git branches and gh pr commands. turns out the hard part of multi-agent systems isn't launching agents. it's knowing what they did after they're done.
English
1
0
1
107
Ryan
Ryan@_PaperMoose_·
Shared this with the team today. It's amazing what we've built @HeyNoahAI
Ryan tweet media
English
0
0
1
70
Ryan retweetledi
Ashish Toshniwal
Ashish Toshniwal@ashishtoshniwal·
Introducing the world's first SMS/Voice executive AI assistant @HeyNoahAI, designed for very busy people who deeply care about their professional relationships. Noah waitlists 7 out of 10 people, depending on their calendar RT + comment "NOAH" and I'll send you the VIP onboarding link for FREE.
English
123
39
246
1.2M
Ryan
Ryan@_PaperMoose_·
the hardest part of multi-agent coding isn't the AI. it's the plumbing. spent this morning debugging why dispatch (our agent orchestrator) was sending prompts to the wrong terminal. an agent meant to investigate sentry timeouts ended up typing into a completely unrelated workspace that was merging a PR. root cause: git rev-parse --show-toplevel returns the worktree root, not the main repo. so when you spawn agents from inside an agent's worktree, everything nests wrong and workspace IDs resolve to whatever's closest. three fixes in dispatch v0.6.2: - gitRoot() now uses --git-common-dir to always find the real repo - createSession returns the workspace ID directly instead of re-resolving it (which is where the wrong-workspace bug lived) - closing a tab auto-cleans the worktree. branch stays if unmerged. the boring infra work is what makes "run 5 agents in parallel" actually reliable. nobody tweets about git plumbing but it's the difference between agents that work and agents that trash each other's state. dispatch is open source: github.com/paperMoose/dis…
English
1
0
3
117
Ryan retweetledi
Annie ❤️‍🔥
Annie ❤️‍🔥@AnnieLiao_2000·
Excited to introduce the Women in AI Booster Pack! The future of AI should be built together 🚀. We have teamed up with your favourites in the AI ecsoystem to bring this collaboration together: @NotionHQ, @v0, @Vapi_AI, @ExaAILabs, @magicpatterns, @mintlify, @composio, @gumloop, @mem0ai, @firecrawl, @PrefectIO + more! It's $10K+ in perks, AI workshops and a closed learning community for cracked women builders. The goal: democratize access to AI upskilling and foundational AI tools to more women around the world. Redeem below or tag a friend who should know about this 🧵
Annie ❤️‍🔥 tweet media
English
15
12
75
6.6K
Ryan
Ryan@_PaperMoose_·
@swyx @simonlast I’ll admit it’s been 6 months since I last used notion AI, but my experience was pretty poor. Obsidian and Claude code was much nicer. Have they caught up?
English
1
0
1
127
swyx
swyx@swyx·
We're having the Notion AI team (including at long last @simonlast) on the pod Thursday. send me all your questions on this + Notion AI! not an ad, just a fan. Notion is probably the most impt knowledge work agent lab in the world.
Sarah Sachs@sarahmsachs

x.com/i/article/2031…

English
16
2
53
17.4K
Ryan
Ryan@_PaperMoose_·
today i had a prod outage and 4 separate issues to investigate at the same time instead of context switching between all of them i dispatched 4 ai agents in parallel, each in its own git worktree, each with full codebase access one investigated stuck tickets one traced a missed morning briefing one diagnosed a sentry timeout one researched a concurrency question about our infra i just watched them work in separate terminal tabs while i handled the actual fix this is the part of ai coding ppl don't talk about enough. it's not "ai writes my code." it's "i mass-parallelize investigation & diagnosis so i can focus on the decision that matters" the bottleneck was never typing. it was attention.
English
1
0
1
95
Ryan retweetledi
Santiago
Santiago@svpino·
The secret nobody tells you about agents is how much they fail behind the scenes. Always-on agents are not reliable. The demos you see don't tell you the whole picture. Ask anybody who's tried one of the modern AI personal assistants, and you'll see what I mean.
English
66
41
376
28.7K
swyx
swyx@swyx·
we just recorded what might be the single most impactful conversation in the history of @latentspacepod iff you take @_lopopolo seriously and literally everything about @OpenAI Frontier, Symphony and Harness Engineering. its all of a kind and the future of the AI Native Org
swyx tweet media
OpenAI Developers@OpenAIDevs

📣 Shipping software with Codex without touching code. Here’s how a small team steering Codex opened and merged 1,500 pull requests to deliver a product used by hundreds of internal users with zero manual coding. openai.com/index/harness-…

English
39
16
379
70K
Thorsten Ball
Thorsten Ball@thorstenball·
Eerie feeling: Talking to people at software companies and getting the impression that they're still acting like it's 2022. Huge teams, roadmaps, product vs. eng vs. design, "haha that'll take a while", AI seen as a "new" thing, no urgency.
English
126
72
1.9K
155K
Ryan
Ryan@_PaperMoose_·
shipped 76 PRs today. one person. not refactors or linting. bug fixes, new features, infra scaling, prompt tuning, sentry fixes, security patches. real production changes. the workflow: - user calls in with feedback about our AI assistant - i turn it into tickets - dispatch 4 agents in parallel, each in its own git branch - they investigate, write code, open PRs - i review, merge, move to the next batch one user said our AI told her most important investor "she's not available" when she would've taken a 5am call for that person. by end of day we had a full VIP scheduling system shipped. then our CEO slacked me "frequency isn't importance. my team was high frequency but my customers were VIP. i'd book over team for customers." that feedback became a new ticket, dispatched to an agent, and shipping 30 minutes later. the human's job isn't writing code anymore. it's: 1. deciding what matters 2. batching the work 3. routing agents 4. verifying output 5. communicating results the loop is: feedback > tickets > dispatch > ship > communicate > more feedback that loop used to take a sprint. now it takes an afternoon.
English
1
0
3
306
Ryan
Ryan@_PaperMoose_·
i love cmux. if you're running ai coding agents you need this terminal. cmux is a ghostty-based macos terminal built for ai agents. vertical tabs, built-in browser, and a unix socket api that lets you script everything. the primitives are what make it awesome: - send/send-key to type into any workspace - set-status with icons & colors in the sidebar - notify for attention rings on tabs - read-screen to see what an agent is doing - claude-hook that tracks session state automatically i plugged these into dispatch, my multi-agent orchestrator. now when i launch 10+ claude code agents: dispatch creates a worktree, opens a cmux workspace, launches claude, waits for the prompt, types the task, hits enter. then clears its status and lets cmux's native claude hooks take over. so the sidebar just shows me reality. which agents are running, which need input, which are done. no polling, no stale state. before this i was cycling through tmux panes trying to remember what each agent was doing. now i glance at the sidebar. @manaflowai built exactly the tool i needed
English
6
6
60
6.1K