Justin H. Johnson

3.5K posts

Justin H. Johnson

@BioInfo

F500 AI exec who still ships. 50 projects in 18 months. Writing the book "Builder-Leader: The AI Exoskeleton That Crosses the Gap"

Washington, DC Katılım Nisan 2009

706 Takip Edilen4.8K Takipçiler

Justin H. Johnson@BioInfo·6h

ai.rundatarun.io/Practical%20Ap…

ZXX

Justin H. Johnson@BioInfo·6h

`~/.claude/projects/` is full of months of Claude Code work that `grep` can't find semantically. So I built mneme: a local LanceDB index over the JSONL with four search modes (vector, fts, hybrid, rerank), wired into Claude Code as a skill. On a real corpus of 9,145 sessions and 125,062 chunks, rerank hits Recall@5 0.926 and MRR 0.864 against 27 hand-curated queries. Right session is the top hit 81.5% of the time. Runs locally. No daemon, no cloud round trip. First index pulls model weights and chews through your history once. After that, incremental. The eval harness ships with it. I tested bge-m3 (four times the params of nomic) overnight against the live numbers. It tied rerank recall but lost MRR by 0.018; vector-only it lost MRR by 0.151. The bigger embedder bought nothing at the top of the ladder and was worse everywhere else. Without the harness I'd have switched on instinct. Twenty minutes to stand up. Repo's public. README walks the install, the Claude Code skill is two files, and the eval reproduces on your own corpus. cc @claudeai @nomic_ai @lancedb

English

Justin H. Johnson@BioInfo·6h

Full post: rundatarun.io/p/opus-48-and-…

English

Justin H. Johnson@BioInfo·6h

Anthropic shipped three things the same day and pretended they were separate. Opus 4.8. A new Claude Code primitive called Dynamic Workflows that writes its own orchestration script and fans out to dozens of subagents with verifiers built in. And a 3x cut to Fast pricing, from $30/$150 per million tokens down to $10/$50. They're one change. The unit of agentic work just moved from one careful model call to dozens of verified ones, and the price finally allows it. The orchestration-first pattern has been possible for a year. LangGraph did it. CrewAI did it. Nobody ran it as a default because fifty parallel agents on the old Fast pricing burned a tank of compute to produce what a senior engineer could have written by hand. At a third of the cost, the math flips. Run it three times from different angles and trust the intersection. The thing nobody is pricing correctly yet is the verifier. A fan-out that finds fifty plausible bugs and forwards all fifty is worse than the single pass it replaced, because now a human triages noise instead of reading code. Anthropic's own examples lean on adversarial verification for a reason. Most builders do not have that discipline yet, and the tooling for it is thin. So the bottleneck moves. The question stops being "is the model good enough yet." It becomes "who is verifying." Wrote up the full breakdown, including what's oversold, on Run Data Run.

English

Justin H. Johnson@BioInfo·17h

Opus 4.8: anthropic.com/claude/opus Dynamic Workflows: claude.com/blog/introduci…

English

Justin H. Johnson@BioInfo·17h

Anthropic shipped Opus 4.8 + Dynamic Workflows today. How I absorb new Claude versions without breaking everything I've built on top: 1. Runtime model alias stays unversioned. Every skill and agent in my setup uses `model: opus`. Claude Code resolves that to whatever's current default. The moment 4.8 flipped, my fleet was on it. No edits. 2. Prompting rules file IS version-pinned. `~/.claude/rules/opus-4-8.md` documents the model's quirks: 4.7's literalism, fewer-subagent bias, adaptive-only thinking. Those traits change between versions. If I named the file `opus-latest.md`, stale guidance would silently attach to a new model I hadn't audited. 3. Upgrade checkpoint: rename the file. That one rename forces me to re-read the changelog, audit which sections still apply, rewrite anything 4.7-only. Today that was: new High default on claude.ai/Cowork, fast mode 3x cheaper, the ultracode setting for auto-fanning to Dynamic Workflows. 4. ultracode stays OFF as default. Workflows can fan out to hundreds of subagents. They also burn quota fast. Opt-in per session, not always-on. Runtime alias unpinned for ease. Rules doc pinned for discipline. Different jobs, different defaults. cc @AnthropicAI @bcherny @_catwu @alexalbert__ @jarredsumner (750k lines, Zig to Rust, 11 days. The Bun port is the Dynamic Workflows headline.)

English

111

Justin H. Johnson@BioInfo·1d

@GoogleDeepMind the bar for 'AI for science' is whether it survives contact with a real clinical or wet-lab workflow, not the demo. been running Gemini grounding on genomics lit this month, solid for the triage step. what's the eval suite behind these tools?

English

121

Google DeepMind@GoogleDeepMind·2d

Our Gemini for Science tools could help scientists unlock their next breakthrough. 🧬

English

116

955

81.1K

Justin H. Johnson@BioInfo·1d

@mattshumer_ running this both directions lately. the 'un-opinionated, well-scoped' bit is what makes it work, over-specify the UI and you just get Codex's taste laundered through Claude. does one scoped handoff template hold up for you or are you retuning it per task?

English

103

Matt Shumer@mattshumer_·2d

Massively useful Codex trick for 10x better frontend: You can ask Codex to use Claude as a sub-agent to have Claude handle frontend/design work. Just say “Use claude -p with an excellent, well-scoped, but un-opinionated (UI/UX-wise) prompt anytime you need a design change).”

English

129

1.6K

182.9K

Justin H. Johnson@BioInfo·1d

matches how we run agents in production. new ones start sandboxed and read-only, the destructive actions (send, delete, deploy) stay human-gated until they earn wider scope. capability-graded beats a static allowlist. hard part is making the ramp legible to whoever's on the hook when it goes wrong.

English

161

Anthropic@AnthropicAI·2d

New on the Engineering Blog: The access and permissions we grant agents should evolve with their capabilities. In our own products, we set these parameters through sandboxing, which limits the scope of any potentially destructive actions. Read more: anthropic.com/engineering/ho…

English

313

283

2.1K

397.2K

Justin H. Johnson@BioInfo·1d

grab it: glyf.cc/markitdown

English

Justin H. Johnson@BioInfo·1d

my VS Code extension MarkItDown (right-click any pdf/docx/pptx, get clean markdown) just shipped a fix that'd been biting people for months. a 3★ review said "half my docx files fail." i assumed the converter was the problem. it wasn't. unpinned `markitdown[all]` was silently resolving to a 2-year-old 0.0.2 build on python 3.13+. one version pin fixed it. v0.3.0 is live.

English

137

Justin H. Johnson@BioInfo·2d

Full piece: rundatarun.io/p/the-atomic-u…

English

Justin H. Johnson@BioInfo·2d

Everyone keeps saying the model is commoditizing. True. Also the least interesting thing happening right now. The shift worth watching is what got built on top once every team could call the same frontier model. A new unit of work showed up, one level above the tool. A tool is the atomic unit of computation. A function, an API call. It executes the same way every time, and that layer didn't change. What changed is that an atomic unit of process appeared above it. The skill. A named, governed piece of procedural knowledge, written in plain English, that a model runs with judgment. Not "upskilling." An actual file you can hold, version, and hand to a model. Software always branched. But every branch had to be written down in advance. A programmer enumerated the cases, and anything outside the list fell through to an error or a human. Rules engines, BPMN, RPA, even an ML classifier deciding inside a space someone trained for. When reality served a case nobody anticipated, the system stopped. A frontier model at the node handles the case nobody wrote down. The branch space is open instead of closed. Rules and hooks still bound it. But the set of things it handles gracefully is no longer the finite list you remembered to type. Stack those units into a procedure and you get a workflow that reasons. Daniel Miessler called a company a graph of algorithms. This is the version where every node thinks. So the leadership question stops being "what can our AI do." It becomes whether you own the library of how your org actually works, as units you can govern, attribute, and retire. The model got cheap. The procedures running on top of it are the asset now.

English

Justin H. Johnson@BioInfo·3d

@mattpocockuk Skill + my changes here: glyf.cc/grill-me

English

Justin H. Johnson@BioInfo·3d

Stole @mattpocockuk's grill-me skill and it's now what I run before any non-trivial plan. It interrogates you one question at a time, recommends an answer for every branch, and goes and reads the code instead of asking when the answer's already there. You come out the other side with the decision tree actually resolved, not a vague plan. Tweaked it for my setup: made it user-invoked only so it won't auto-fire against plan mode, added a stop condition, and a Resolved-decisions summary that drops straight into implementation. Runs before a plan solidifies, not after. Thanks Matt 🙏

English

Justin H. Johnson@BioInfo·5d

Full deep dive: rundatarun.io/p/the-model-is…

English

Justin H. Johnson@BioInfo·5d

ACM just convened its first conference built around agentic systems. CAIS, San Jose, this week, 61 peer-reviewed papers. Read the program and the model is barely in it. That's not a complaint. It's a milestone. The field is making the move that databases, the web, and distributed systems all made once they stopped being demos: the questions shift from "can it do the thing" to "can I trust it, maintain it, and keep it from hurting me." The papers cluster into the three problems that decide whether an agent survives in production: Improve without retraining. The gains now live in memory and prompt structure, not weight updates. Trust what you can't see fail. Agents go wrong silently, normal-looking trace, no error thrown. Govern the control surface. The skills and plugins your agent loads are a software supply chain, with all the trust problems that implies. To a certain kind of nerd, the unglamorous stuff is the good stuff. Schemas, specs, memory systems, capability tracking. That's where the frontier actually moved, and a major venue just built four tracks around it. The full deep dive walks all three movements, the numbers behind them, the ten papers worth your time, and what I'm rebuilding in my own agent setup because of it.

English

111

Justin H. Johnson@BioInfo·5d

@omarsar0 this is the whole bet behind my setup. opus stays untouched, all the leverage is in the harness, the skills, the runtime around it. cheap-model-plus-good-harness beats expensive-model-plus-naive-loop way more often than people expect. which paper is it?

English

elvis@omarsar0·6d

// Adapt the Interface, Not the Model // I am fascinated by the results across my cheap-model-plus-good-harness builds. This new paper also shows good signs of the code-as-agent-harness thesis. The idea is really simple. Do not touch the model. Instead, modify the runtime interface that wraps the frozen LLM. Then convert recurring interaction failures into reusable interventions on the harness side. The paper reports an average relative improvement 88.5% across 7 deterministic environments, 126 model-environment settings, and 18 backbones. A harness learned from one model trajectory generalizes to 17 other backbones. That tells you the harness is capturing environment structure, not model-specific patterns. If you ship agents in production, your harness work is more portable than you might assume. Paper: arxiv.org/abs/2605.22166 Learn to build effective AI agents in our academy: academy.dair.ai

English

274

24.5K

Justin H. Johnson@BioInfo·5d

@vllm_project good call. resume-driven fake PRs are the new spam vector for OSS and they cost maintainers real review time. the "solved a non-existent issue" tell gets harder to spot as the slop gets more fluent

English

3.2K

vLLM@vllm_project·5d

Thanks to the community report, we recently identified a PR github.com/vllm-project/v… that attempted to solve a non-existent issue and was submitted as part of a “PR training” workflow for resume building. The contributor involved has been banned from the vLLM community. This kind of low-signal contribution increases maintainer review overhead and creates unnecessary operational costs for open-source projects. As AI coding agents make generating large volumes of small PRs increasingly cheap, open-source communities will need to explore new ways to preserve contribution quality and reviewer trust. While we are investigating how to deal with AI slop, we continue to highly value contributions from real users solving real production problems. If you have an important contribution that has not yet received maintainer attention, please email us at: pr-review-request@vllm.ai Using a verifiable company or university email, include: - your production or research use case - the problem you encountered - how your contribution addresses it This helps us better prioritize impactful contributions while keeping the vLLM community open and collaborative. As AI makes virtual contributors look increasingly real, authentic human collaboration matters more than ever. vLLM’s mission remains unchanged: to make LLM inference easy, fast, and cheap for everyone — and we will continue working toward that goal.