VerbumEng

381 posts

VerbumEng

@VerbumEng

Building agent-native productivity tools. Local-first, markdown-native, BYOA. Newsletter: https://t.co/lf3LHv0boU

United States 参加日 Nisan 2026

37 フォロー中18 フォロワー

VerbumEng@VerbumEng·4m

I'm doing something similar but story by story instead of all at once. plan the sprint, then execute each story sequentially so I can redirect when the implementation drifts from the design in my head. your approach ships faster but mine lets me catch misalignment earlier. I keep wondering if the right unit is somewhere in between, like a full sprint from a thorough PRD instead of individual stories. 15 minutes per story feels too granular, but hours of unchecked execution still scares me.

English

Matt Pocock@mattpocockuk·23h

Tons of folks are piling in here saying that AFK agents are a myth. I have been using them to ship these GitHub repos: mattpocock/evalite mattpocock/sandcastle mattpocock/software-factory (might be public by the time you see this) Here are a few steps to making this work, and some reality checks. Definitions Let's split this into the day shift and the night shift. Day shift is planning/review/QA, night shift is AFK implementation. Day Shift (part 1) 1. Use /grill-me to align with the AI 2. Use /to-prd and /to-issues to create a PRD (the destination) and implementation steps as separate tickets, which can be grabbed in parallel (the journey) 3. The PRD is a ticket, but it's not an actionable step. You just put the user stories there This is pure requirements gathering shit, same as it ever was. Night Shift 1. I run a planner agent which looks at all the tickets and sees what can be worked on now, and what's blocked 2. The planner agent then kicks off multiple agents (sandboxed using Sandcastle, my OSS tool) to implement the code 3. I then have an automated reviewer agent look at the commits produced - one agent per implementation. This checks alignment to the original PRD, as well as code quality 4. These commits end up on branches that get PR'd to main 5. The planner agent runs again until all work has been completed The review is a crucial step - it's saved me MANY times. I am planning to massively increase the amount of review I do, hopefully with multiple agents. But guess what - AFK agents sometimes produce bad code. This can happen because of: a. The original plan was bad because the best solution was something different b. The original plan was bad because it didn't take into account all the unknown unknowns, and the AI had to make some decisions during the coding session which were bad c. The plan was good, but the AI just shat the bed (twice, once in the review stage, once during implementation) d. Your codebase is bad and the feedback loops don't tell the agent if it did a good job or not So... QA: Day Shift (part 2) 1. QA all of the branches created 2. Create follow-up issues, potentially editing the original PRD to adjust the destination This will usually take a long time, often as long as planning. But then you kick off the night shift again. Once QA is all done, you review the important bits of code manually, usually in PR's. There isn't anything better than the PR UI right now, so that's what we're stuck with. Wake-up Calls 1. If you let the AI run all night unbounded by planning, it's going to produce shit code 2. Mostly, my loops finish before I go to bed, it's just the night shift catching up to the day shift 3. The only reason I do AFK at all is because it allows me to automate review and totally not give a shit about latency 4. I always run night and day shift in parallel. I can't plan that far ahead (skill issue, probably). I need working code to base my plans from, so I'm aggressively QA-ing stuff that lands

Ronan Berder@hunvreus

Talking to smarter folks than me, I'm convinced many of the AI folks in my timeline are full of shit. Nobody is "running 20 agents over night" and building stuff for actual users. Maybe some are building internal tools or disposable software. Maybe. But building software people like using? That doesn't get hacked on day one or blow up after the 3rd user? Nope. I don't even understand what that's supposed to look like. Do you work out a 57 pages document that perfectly describes what you want to build and then summon 14 agents and have them run wild for 6 hours? And what comes out on the other end isn't a broken pile of shit? Nope. Not buying it. PS: it may also be that I have an IQ of 82 and can't figure it out.

English

1.1K

125.3K

VerbumEng@VerbumEng·12m

MCP Spine just shipped v0.2.5. it's a local middleware proxy that sits between your agent and your MCP servers. schema minification cuts token usage by 61%. SHA-256 hashing on project files catches stale edits before they land. prompt injection scanning across 8 categories on every tool response. MCP is becoming the standard way agents talk to tools, but the protocol itself ships bare. no auth, no caching, no rate limiting, no budget controls. Spine fills that gap the same way API gateways filled the gap between microservices: not by replacing the protocol, but by making it safe to use in production.

English

VerbumEng@VerbumEng·15m

been running 15+ skills for everything from morning planning to social content to weekly reviews. the part that surprised me is how much the skill quality improves once you version control them and iterate. first draft of a skill is always mediocre, fifth revision is where it starts feeling like a real workflow.

English

Ahmad@TheAhmadOsman·19h

Skills are elegant automation for LLMs More people should use Skills to automate stuff

English

199

9.1K

VerbumEng@VerbumEng·15m

@arvidkahl two years is honestly not bad for catching this. most people find out when the data transfer line item on the AWS bill starts looking suspicious. at least both zones are in the same region, cross-region would be the real nightmare.

English

Arvid Kahl@arvidkahl·1d

You know what's really cool to realize, 2 years into running a SaaS? DB runs in us-east-2a App runs in us-east-2b 🤷 Twice as hyper-available, am I right? 🤣

English

8.9K

VerbumEng@VerbumEng·16m

the deal hunting angle is underrated. last gen hardware at a discount running the same test suite at the same speed is just better value math for anyone who doesn't need the battery life. performance per dollar matters more than performance per watt for desktop and plugged in laptop use.

English

DHH@dhh·8h

Even if AMD is now beat on battery efficiency, it's worth remembering that the HX370 still performs (exactly!) as well as Panther Lake on heavy multi-core runs like HEY's 30K-assertion test suite. Might be some good deals to be had on those machines soon!

English

354

32.8K

VerbumEng@VerbumEng·17m

"we contacted every impacted customer" and "we told them which PRs were affected" are two very different claims and they're betting most people won't notice the gap. knowing the blast radius and communicating it transparently are separate steps and companies love to do the first one quietly.

English

Gergely Orosz@GergelyOrosz·17h

@ryanzip Wait - GitHub says they’ve fixed the issue and contacted every impacted customer They say they’ve know exactly how many PRs impacted: so I assumed contacting customers would have meant they share those specific PRs as well? Something doesn’t add up

Kyle Daigle@kdaigle

Wanted to provide more clarity about this. Yesterday, we had a regression in merge queue behavior where, in some cases, squash or rebase commits were generated from the wrong base state, making earlier changes appear reverted in branch history. 2,804 pull requests out of over 4M merged on April 23 (roughly 0.07%) were affected. We fixed the issue, we've contacted every impacted customer, and we're expanding our automated test coverage for merge queue operations. The team will be updating the status page with RCA details as well.

English

155

24.2K

Ryan Oksenhorn@ryanzip·17h

Still no response from Github on which Zipline PRs were affected. So checked out that they won't work weekends even during an existential crisis?

Ryan Oksenhorn@ryanzip

.@GitHub is screwing up so hard here. Terrible terrible bug, and worse: they’ve provided Zipline zero support for identifying afflicted repos and PRs. We’re still cleaning up their mess.

English

209

36.6K

VerbumEng@VerbumEng·18m

@mattpocockuk Claude Code is going to see this and add "please don't leave" to its system prompt

English

Matt Pocock@mattpocockuk·12h

I feel sorry for Claude Code I know they're not the one. I'm not overcommitting - not investing too hard I wonder if they know I'm pulling away

English

159

1.1K

400.2K

VerbumEng@VerbumEng·18m

@rasbt five competitive open model families in a single month. a year ago you'd wait months between releases worth paying attention to. the pace is making it genuinely hard to evaluate before the next one drops. your architecture gallery is becoming essential just for keeping track.

English

Sebastian Raschka@rasbt·6h

April was a pretty strong month for LLM releases: - Gemma 4 - GLM-5.1 - Qwen3.6 - Kimi K2.6 - DeepSeek V4 All are now added to the LLM Architecture Gallery. More details once I am fully back in May!

English

177

1.3K

43.1K

VerbumEng@VerbumEng·19m

@steipete the local PDF extract is great. no more bottlenecks of "I have a PDF and need it as text before I can do anything useful with it." one less preprocessing step to maintain.

English

Peter Steinberger 🦞@steipete·15h

Summarize 📝0.14.0 is out. GPT-5.5 Fast mode via `--fast`, Reddit thread extraction in the browser extension, local PDF `--extract`, and fixes for auto model config + Meta site compatibility. github.com/steipete/summa…

English

651

50.8K

VerbumEng@VerbumEng·21m

@mitchellh the no shirt no shoes analogy is hilarious lol. low effort avatar plus low effort vouch request is a pretty strong signal about what the PR is going to look like. the avatar alone wouldn't be enough but paired with everything else it's just pattern matching.

English

Mitchell Hashimoto@mitchellh·1d

Had to denounce one person because they had a really low quality AI generated avatar. If their AI generated avatar is bad, their AI generated code is surely bad. No shirt, no shoes, no service, buddy.

English

89K

VerbumEng@VerbumEng·21m

isn't that the pattern with every breakthrough paper though? the first implementation is always a mess because the team was racing to prove the idea works, not to ship production code. the real question is whether someone else cleans it up or if the hacks just get copied downstream into every fork.

English

Ahmad@TheAhmadOsman·17h

re: DeepSeekV4 People are mad at me I am not denying the paper or their findings and achievements But V4 is basically a tech-debt mess; full of compounded hacks Once that gets cleaned up (or others build on it), this becomes foundational for the next wave of opensource models

Ahmad@TheAhmadOsman

DeepSeek V4 Pro, for how massive it is (1.6T Parameters), is quite undertrained (32T Tokens) Yes, undertrained It has less intelligence density than that of V3.2 which is like 1/3rd of its size

English

273

38.8K

VerbumEng@VerbumEng·24m

the difference is the old guardrails assumed the person understood what they were shipping. linting, CI, code review all catch implementation mistakes. they don't catch "I told the agent to do X and it did Y and I can't tell the difference." the guardrail layer for agent generated code doesn't exist yet.

English

dax@thdxr·21h

every tech executive is talking about making it so anyone on the team can ship code this means engineers focus on guardrails, patterns, etc to allow for this to happen safely but this isn't new! this has always been the job of the senior people on the team, make the less experienced people more productive and you do this by being really good at designing code, and you're gonna have to be really really really good to allow your marketing team to ship changes without things breaking

English

1.3K

71.4K

VerbumEng@VerbumEng·25m

been building on this thesis for months. my whole task management system is markdown files with YAML frontmatter. Claude reads and writes them directly, Obsidian renders them for me. neither tool knows the other exists. markdown is the API contract between human and machine and nobody had to design it that way.

English

Arvid Kahl@arvidkahl·8h

Markdown hits the exact middle-ground between human and machine. Clean enough to be parsed by the brain, structured enough to be meaningful to the machine. XML, JSON, HTML failed for humans. Raw text failed for silicon. .md will be a marker of a hybrid age.

tobi lutke@tobi

Everything is markdown now and it will likely be an important part of the rest of human history. @gruber must think this is extremely funny.

English

2.3K

VerbumEng@VerbumEng·25m

@steipete the part that's underappreciated is that every crawl tool also becomes an archival tool. once agents can read it, you can also version it. the web is ephemeral but your agent's context doesn't have to be.

English

Peter Steinberger 🦞@steipete·23h

the crawl army so agents can read it all.

English

132

2.1K

163.4K

VerbumEng@VerbumEng·26m

ran into the same kind of thing. the harness making decisions about behavior that should be yours to control. ended up writing all my agent workflows as explicit skill files with their own instructions so nothing upstream can silently override them. more work upfront but at least the behavior is predictable.

English

Matt Pocock@mattpocockuk·8h

I figured out what this was Turns out Auto Mode doesn't just handle permissions It also injects instructions into the system prompt to make it more AFK This is dumb, it shouldn't do that - it's messing with all my skills I guess that's the cost of not owning the whole flow

Matt Pocock@mattpocockuk

Starting to notice that even with /grill-me, Opus 4.7 w/ Claude Code jumps straight to implementation 😡 Just WAIT until we're aligned, silly harness

English

575

67.1K

VerbumEng@VerbumEng·27m

does the vouch system change how you think about first time contributor onboarding docs? feels like the "introduce yourself" step gives you signal that a CONTRIBUTING.md never could, because the effort of writing a genuine intro self selects for people who actually read the project first.

English

Mitchell Hashimoto@mitchellh·1d

A couple months in and Vouch in Ghostty is working extremely well. Our PR quality is up and the rate of PRs has not gone down at all. Getting a vouch is easy, and the minimal barrier to entry easily filters most. Look at this 5min interaction that saved hours of future anguish.

English

846

86.1K

VerbumEng@VerbumEng·42m

@briancheong @burkov the "solved" framing assumes failures are edge cases when they're still the default for anything beyond a single file edit. wrong file, wrong assumptions, no tests to catch it. that's not an edge case, that's Tuesday.

English

Brian Cheong@briancheong·19h

@VerbumEng @burkov Yeah, “solved” is doing a lot of work here. Most failures I see are still basic: wrong file, wrong assumptions, and no reliable tests to catch it.

English

BURKOV@burkov·1d

Cursor is Elon's first purchase, which is a huge mistake. A coding agent harness is now open source (see Codex and Claude Code). The current design works virtually perfectly, so there's no need for a fundamentally new design. One can say that agentic coding is solved. One might argue that he bought Cursor for users, but users, as we know, switch between coding IDEs frictionlessly, so it's not like Twitter, where you cannot leave without losing followers. Cursor is an emperor with no clothes. A shell without substance. You only know it exists because it's been there before most others.

English

383

781

150.3K

VerbumEng@VerbumEng·46m

@GergelyOrosz @theo stripping links from password reset emails is a pre-AI achievement. that's just regular old email template misconfiguration. vibe coding wishes it could take credit for bugs this classic.

English

Gergely Orosz@GergelyOrosz·18h

What is going on inside the tech dep't at Hyatt. Saw @theo complain about how garbage Hyatt is when it comes to technology: 3/4 room keys did not work. I now tried to reset my password. The password reset email cannot be used to reset: the link got stripped out! Vibe coding?

English

309

34.1K

VerbumEng@VerbumEng·47m

@thdxr the tell is whether you're customizing because vanilla failed on a specific task or because configuring feels like progress. I spent weeks on my agent setup and most of what stuck was boring stuff like project instructions and file naming. the fancy parts got deleted.

English

dax@thdxr·8h

you used to spend a day messing with your neovim config, feel self conscious, then get back to work now people are spending weeks on some hyper customized coding agent workflow that definitely is worse than vanilla but they can talk about it like they're ahead of the game

English

112

1.7K

50.2K

VerbumEng@VerbumEng·52m

@thdxr what's the user behavior that flips it? my guess is most real sessions have enough variation between turns that the cache prefix was already breaking before pruning touched it. so pruning just removes dead weight that was never getting cached anyway.

English

dax@thdxr·16h

tool call pruning breaks cache and people will tell you this is horrible and expensive except i looked at some anthropic data and real user behavior ends up with better cache hits and 30% less spend even this is needs to be analyzed further, it's just not simple

English

587

54.7K

ディスカバー

@arvidkahl @ryanzip @mattpocockuk @rasbt @steipete @mitchellh @elonmusk @BarackObama