Justin H. Johnson

3.5K posts

Justin H. Johnson banner
Justin H. Johnson

Justin H. Johnson

@BioInfo

F500 AI exec who still ships. 50 projects in 18 months. Writing the book "Builder-Leader: The AI Exoskeleton That Crosses the Gap"

Washington, DC Katılım Nisan 2009
700 Takip Edilen4.8K Takipçiler
Justin H. Johnson retweetledi
Perplexity
Perplexity@perplexity_ai·
Today we're open-sourcing Bumblebee, a read-only scanner for macOS and Linux. It checks developer machines for risky packages, extensions, and AI tool configs. Connected to Computer, it can trigger deeper scans whenever a new supply-chain risk emerges. github.com/perplexityai/b…
Perplexity tweet media
English
77
146
1.3K
106.5K
Justin H. Johnson
Justin H. Johnson@BioInfo·
When a frontier model fails at "reasoning," what is it actually failing at? A new paper from Fesser et al. gives the failure a name and a number. They call it Relational Complexity: the count of entities a model has to hold in mind and bind together to take a single reasoning step. "A is taller than B" is 2. "A is between B and C" is 3. The number climbs as the relations get wider. Their benchmark, REL, holds vocabulary, length, and task format fixed and moves only that one dial, across pattern puzzles, phylogenetic trees, and molecular isomers. The result is clean and a little grim. At low complexity the models score around 91%. Push the binding count to 6 and Claude and Gemini drop to roughly 12%. A regression with collinearity controls says relational complexity, not input length and not domain, explains 24 to 44% of the variance. More compute bought 2 to 3%. In-context examples bought 3 to 6%. Tool use made it worse. Here is why an agent builder should care. A cross-document join that reconciles three sources at once. A plan with four interacting constraints. A loop that holds step two's output while reasoning about step five. Those are not long tasks. They are high-binding tasks, and that is the regime where frontier accuracy falls to a coin flip. So the next time an agent breaks on something you expected it to handle, the useful question is not "is the model good enough." It is "how many things does this step force it to bind at once." If the answer is four or more, the fix is architectural, not a better model.
Justin H. Johnson tweet media
English
1
0
0
42
Justin H. Johnson
Justin H. Johnson@BioInfo·
everyone's gonna fixate on the benchmark number. skip it. 35 hours of autonomous execution and 1000+ tool calls in one run without dropping off is the headline. that's a different class of model, at a third the price of the western labs. the agentic gap is closing fast.
Qwen@Alibaba_Qwen

🚀Qwen3.7-Max just landed at 56.6 on the Artificial Analysis Intelligence Index — a solid 4.8pt jump over Qwen3.6-Max-Preview. @ArtificialAnlys ⚡️Sharper sci reasoning, stronger agentic chops, better coding, and it hallucinates less.

English
0
0
0
86
Justin H. Johnson
Justin H. Johnson@BioInfo·
awesome. the agents were never the problem, the shared state across them is. i run a standing squad and almost every failure is coordination. stale context, conflicting writes, two agents grabbing the same task. how are you handling shared memory and write conflicts when they run concurrent instead of turn-taking?
English
0
0
0
74
Justin H. Johnson
Justin H. Johnson@BioInfo·
the hard thing was never the rule, it's the classifier. "protect kids" is easy to write. building the thing that tells kid-context from adult-context on every request, fast enough to gate it, is where the cost and the false positives pile up. everyone argues the rule and skips the detector
English
0
0
0
16
Rohan Paul
Rohan Paul@rohanpaul_ai·
Dario Amodei explains to Oprah how AI safety is tangled with business needs, daily deployment, access control, and policy tradeoffs. Strict child-safety rules e.g. can protect kids but worsen adult use when systems can’t clearly tell cases apart.
English
6
4
23
5.5K
Justin H. Johnson
Justin H. Johnson@BioInfo·
live resize is the headline but the fault tolerance is the bit i actually care about. evict a dead rank, redistribute experts, zero downtime. that's MoE you can run in prod without babysitting it. does EPLB rebalance on live load or just even out expert count on resize? hot-expert skew always wrecked the naive version for me
English
0
0
0
97
vLLM
vLLM@vllm_project·
A vLLM MoE deployment's DP/EP topology used to be locked in at launch — scaling or swapping config meant a full restart, in-flight traffic dropped. Elastic Expert Parallelism changes that. One API call resizes a live deployment: curl -X POST localhost:8000/scale_elastic_ep \ -d '{"new_data_parallel_size": 16}' Under the hood: standby comm groups span the target topology, EPLB redistributes experts across the new EP group, and weights are transferred directly between GPUs over NVIDIA NVLink/RDMA. The same runtime reconfiguration path is what fault-tolerant serving needs: evict failed ranks, redistribute their experts, bring replacements back, no restart. Thanks to @NVIDIAAI, Sky Computing, @anyscalecompute, @RedHat_AI, and the community. 📖 vllm.ai/blog/2026-05-1…
vLLM tweet media
English
7
19
192
20K
Justin H. Johnson
Justin H. Johnson@BioInfo·
Spent a weekend putting a LiteLLM gateway in front of every LLM call in my homelab. Routing and cost attribution turned out to be the easy part. A virtual key is one curl. A model route is one line of YAML. Spend lands in a Postgres table keyed by the workload's alias, for free. The hard part was everything before that: finding every place my old shared key actually lived. One OpenRouter key was doing double duty across five machines, the gateway's upstream key and the raw key behind a dozen direct callers. Every gateway call used the same master key, so 190,000 spend rows all read alias = null. I couldn't tell which workload spent what. You can't revoke a key like that until you've mapped every caller, because the second you do, everything breaks at once. That key lived in six places across five hosts. And the map kept fighting back: - The config file said spend logging was off. The live process env said otherwise. Read /proc//environ, not the YAML someone last edited. - That same stale flag had silently killed logging for 34 hours. A gateway logging nothing looks identical to one with no traffic. - My biggest cost center was invisible to the audit, because flat-fee coding subscriptions never hit the metered bill. - A $111 spike that looked like a leak was a legit one-time reindex. The suspected culprit cost $0.60. The discipline that saves you: never revoke a key that still shows traffic. Named keys turn the dashboard into a map of who hasn't migrated yet. Build the gateway before you have ten keys in twelve places.
Justin H. Johnson tweet media
English
3
0
0
84
Justin H. Johnson
Justin H. Johnson@BioInfo·
Google walked onto the I/O stage and shipped roughly a hundred things in two hours: Gemini 3.5 Flash, a video world-model called Omni, a 24/7 personal agent, an agent-first IDE, agentic Search, smart glasses, a price cut on the top tier. 8.7 million YouTube views in a day. Then I checked Polymarket. The "best AI model end of May" market still reads Anthropic 96%, Google 1%. Eight-plus million dollars of volume, and Google's biggest AI day of the year barely moved it. That gap is the story. The substance was real. The marquee demo: 93 parallel sub-agents, 15,000 model requests, 2.6 billion tokens, twelve hours, under $1,000 in credits, and out came a working operating system built from scratch. Then they played Doom on it, live. Take the staging with a grain of salt, but the architecture underneath is a genuine bet: many fast cheap agents in parallel, not one expensive monolithic run. The twist nobody expected: 3.5 Flash isn't cheap anymore. Artificial Analysis clocked it 5.5x costlier than Gemini 3 Flash. For a tier whose whole identity is "the cheap workhorse you run by default," that breaks the assumption. And the long game flips the near-term read. Polymarket gives Google 76% to hold the #1 model by December 31, and 72% to out-value OpenAI and Anthropic combined by year-end. So the collective bet is specific: Google didn't take the crown on I/O day, but it's the favorite to win on distribution and time. Volume is not the crown.
Justin H. Johnson tweet media
English
1
0
1
127
Justin H. Johnson
Justin H. Johnson@BioInfo·
Built my own URL shortener this evening. Self-hosted Shlink on Hetzner, my own domain on the front. Every short link I publish now reads glyf.cc instead of bit.ly. bit.ly worked fine for years. The dashboard is honest, the redirects are fast. The problem was always the surface. bit.ly/3xK9pQa in a newsletter footer tells the reader nothing about who sent it. The short domain is the brand signal, and that signal was someone else's. The actual buildout took fifteen minutes. Shlink in three containers, MariaDB sibling, nginx in front, Cloudflare proxied with an origin cert that doesn't expire until 2041, admin UI bound to a Tailscale IP only. The clean part: the admin panel comes through tailnet, never the public internet. No password gate, no MFA. The credential boundary is "are you on my network." The hard part was the domain. Three-letter .cc is fully picked clean. I brute-checked 89 unusual q/x/z combinations and zero were available. Domain investors got there years ago. Four-letter .cc opens up. I used Nymio, the domain tool I've been building, with the refinement loop turned on, and it surfaced glyf.cc. A glyph is a compressed symbol. A short URL is exactly that. Sub-twenty bucks to register. The deeper reason this matters: every blog post, every newsletter, every social companion now carries my own domain as the brand wrapper. The reader sees glyf.cc. The signal compounds. Third-party links still go through bit.ly because there's no brand mismatch. If you've been putting off a personal shortener, the work is in the name, not the stack. Both are evening projects.
English
1
0
0
32
Justin H. Johnson
Justin H. Johnson@BioInfo·
TradingAgents (2412.20138) ships a clean AI-firm topology: bull/bear debate, risk debate, portfolio-manager veto. Copy it. Then a 3-month Q1 2024 mega-cap tech backtest, Sharpe 8.21 on AAPL. The architecture and the rigor are in different papers. arxiv.org/abs/2412.20138
English
0
0
1
64
Justin H. Johnson
Justin H. Johnson@BioInfo·
Built a Claude Code skill in 30 minutes by refusing to let it write code. Frustration: everyone keeps asking me "have you seen [AI company]?" Their marketing site is fluff. A general LLM gives Wikipedia. I want the substance, framed against my actual role, with three alternatives I'd care about. Opener prompt was something like: > I keep hitting this. I solved it once manually by feeding Gemini my context. Now I want it on tap. Brainstorm with me. Ask questions. Don't write code yet. Claude came back with seven questions instead of seven hundred lines. Always comparative or sometimes solo? Auto-inject vault context or feed inline? Refresh in place or version? AI companies only or anything? Skeptical or neutral? I answered each in one line. The seven Q&A pairs became the spec. We tested on Fractal Analytics manually before codifying. The test surfaced a gap their site hides (pharma case studies are commercial-heavy, weak on R&D) and a competitor I'd have missed (ZS Max.AI is meaningfully ahead on pharma-vertical agentic product). Critique an artifact, not a spec. After the test, one structural change: per-company directory instead of single file, room for contacts.md and meeting-notes alongside. The skill picked up that constraint immediately. Named it last. /dossier won because it's what the output literally is. The pattern: name the frustration with a specific example, refuse to write code until the questions are answered, test before codifying, take one structural critique, name last. Total elapsed from prompt to working skill: 30 minutes. Full writeup with the seven questions and the Fractal test in the link below. Skill is in rundatarun if you want to fork it.
Justin H. Johnson tweet media
English
4
1
1
111
Justin H. Johnson
Justin H. Johnson@BioInfo·
Polymarket has Claude 5 by May 31 at 22% and falling. Polymarket also has Anthropic-valued-higher-than-OpenAI at 89% and rising. Bearish on the imminent model. Structurally bullish on the company. That contradiction is the cleanest read on Claude Code's last 30 days. Eight beats worth knowing, all from the sweep: 1. Anthropic stripped Claude Code from the $20 Pro tier on April 21. Two HN threads, 948 combined points. The product page contradicted the pricing page; existing subscribers kept access until renewal. Classic incomplete rollout. 2. Two weeks later (May 6), the @claude_code account posted "2x'ed Claude Code's 5-hour rate limits for Pro, Max, and Team plans." Ars Technica credited the SpaceX deal. Compute rationing walked back when a customer paid for capacity. 3. April 23 postmortem admitting they added a verbosity instruction on April 16 that broke things. Anthropic is the only major lab that publishes receipts. The other read, that the bar for candor is set absurdly low because incumbents publish nothing, is also true. 4. OpenClaw flap. 1,349 HN points. Claude Code surcharges or refuses commits that mention competitive harnesses, implemented as lazy string regex. No public Anthropic response. Silence is data. 5. Enterprise read is heterogeneous in a way that breaks the "one winner takes all" narrative. Uber torched its entire 2026 AI budget on Claude Code in four months. Amazon rolled it out after engineer pushback. Microsoft canceled licenses. 6. Plugin ecosystem boom. Six Claude Code skill packages on Show HN in 30 days. Official claude-code-setup plugin shipped. Anthropic published a Champion Kit (a public playbook for engineers pushing CC internally). Sales motion as a Markdown doc. 7. CVE-2026-39861 landed May 8 (sandbox escape via symlink). Mild and patched. The precedent of "Claude Code has a CVE number" is the signal. 8. May 13 programmatic usage restrictions. Combined with OpenClaw, this is a quiet crackdown on harness substitution. The harness is the product. The model is increasingly fungible. Anthropic is defending the seam. Full sweep with sources, including the three follow-ups worth pulling on: rundatarun.io/p/last-30-days…
Justin H. Johnson tweet media
English
0
0
1
64
Justin H. Johnson
Justin H. Johnson@BioInfo·
Two stories about AI in medicine this week. Opposite directions. A paper in Science from @arjunmanrai @AdamRodmanMD and colleagues: OpenAI's o1 outperformed attending physicians on a prospective ER arm at a major academic medical center. 67% correct diagnosis at the top of the differential at triage. The two attendings at 55% and 50%. Among the most carefully designed clinical-LLM evaluations of the year. Ontario's Auditor General, the same week: 20 AI medical-scribe systems audited, used by ~5,000 physicians. 9 of 20 fabricated information that wasn't in the recording. 12 of 20 captured the wrong drug. 17 of 20 missed details about patients' mental health. Both verified. Both shipping in production right now. The instinct is to pick one. The discipline is not picking. AI in medicine is incredibly useful and incredibly dangerous, at the same time, in different ratios on different days. We can't certify any of it safe-or-dangerous and walk away. Three questions worth keeping sharp this week, and every week: 1. What question survives the headline? 2. Who's on the hook when it's wrong? 3. What moves if the answer flips? The questions don't change. The answers do. The asking is the muscle. cc @EricTopol
Justin H. Johnson tweet media
English
1
0
1
66