VerbumEng

381 posts

VerbumEng banner
VerbumEng

VerbumEng

@VerbumEng

Building agent-native productivity tools. Local-first, markdown-native, BYOA. Newsletter: https://t.co/lf3LHv0boU

United States 参加日 Nisan 2026
37 フォロー中18 フォロワー
VerbumEng
VerbumEng@VerbumEng·
I'm doing something similar but story by story instead of all at once. plan the sprint, then execute each story sequentially so I can redirect when the implementation drifts from the design in my head. your approach ships faster but mine lets me catch misalignment earlier. I keep wondering if the right unit is somewhere in between, like a full sprint from a thorough PRD instead of individual stories. 15 minutes per story feels too granular, but hours of unchecked execution still scares me.
English
0
0
0
4
Matt Pocock
Matt Pocock@mattpocockuk·
Tons of folks are piling in here saying that AFK agents are a myth. I have been using them to ship these GitHub repos: mattpocock/evalite mattpocock/sandcastle mattpocock/software-factory (might be public by the time you see this) Here are a few steps to making this work, and some reality checks. Definitions Let's split this into the day shift and the night shift. Day shift is planning/review/QA, night shift is AFK implementation. Day Shift (part 1) 1. Use /grill-me to align with the AI 2. Use /to-prd and /to-issues to create a PRD (the destination) and implementation steps as separate tickets, which can be grabbed in parallel (the journey) 3. The PRD is a ticket, but it's not an actionable step. You just put the user stories there This is pure requirements gathering shit, same as it ever was. Night Shift 1. I run a planner agent which looks at all the tickets and sees what can be worked on now, and what's blocked 2. The planner agent then kicks off multiple agents (sandboxed using Sandcastle, my OSS tool) to implement the code 3. I then have an automated reviewer agent look at the commits produced - one agent per implementation. This checks alignment to the original PRD, as well as code quality 4. These commits end up on branches that get PR'd to main 5. The planner agent runs again until all work has been completed The review is a crucial step - it's saved me MANY times. I am planning to massively increase the amount of review I do, hopefully with multiple agents. But guess what - AFK agents sometimes produce bad code. This can happen because of: a. The original plan was bad because the best solution was something different b. The original plan was bad because it didn't take into account all the unknown unknowns, and the AI had to make some decisions during the coding session which were bad c. The plan was good, but the AI just shat the bed (twice, once in the review stage, once during implementation) d. Your codebase is bad and the feedback loops don't tell the agent if it did a good job or not So... QA: Day Shift (part 2) 1. QA all of the branches created 2. Create follow-up issues, potentially editing the original PRD to adjust the destination This will usually take a long time, often as long as planning. But then you kick off the night shift again. Once QA is all done, you review the important bits of code manually, usually in PR's. There isn't anything better than the PR UI right now, so that's what we're stuck with. Wake-up Calls 1. If you let the AI run all night unbounded by planning, it's going to produce shit code 2. Mostly, my loops finish before I go to bed, it's just the night shift catching up to the day shift 3. The only reason I do AFK at all is because it allows me to automate review and totally not give a shit about latency 4. I always run night and day shift in parallel. I can't plan that far ahead (skill issue, probably). I need working code to base my plans from, so I'm aggressively QA-ing stuff that lands
Ronan Berder@hunvreus

Talking to smarter folks than me, I'm convinced many of the AI folks in my timeline are full of shit. Nobody is "running 20 agents over night" and building stuff for actual users. Maybe some are building internal tools or disposable software. Maybe. But building software people like using? That doesn't get hacked on day one or blow up after the 3rd user? Nope. I don't even understand what that's supposed to look like. Do you work out a 57 pages document that perfectly describes what you want to build and then summon 14 agents and have them run wild for 6 hours? And what comes out on the other end isn't a broken pile of shit? Nope. Not buying it. PS: it may also be that I have an IQ of 82 and can't figure it out.

English
52
53
1.1K
125.3K
VerbumEng
VerbumEng@VerbumEng·
MCP Spine just shipped v0.2.5. it's a local middleware proxy that sits between your agent and your MCP servers. schema minification cuts token usage by 61%. SHA-256 hashing on project files catches stale edits before they land. prompt injection scanning across 8 categories on every tool response. MCP is becoming the standard way agents talk to tools, but the protocol itself ships bare. no auth, no caching, no rate limiting, no budget controls. Spine fills that gap the same way API gateways filled the gap between microservices: not by replacing the protocol, but by making it safe to use in production.
English
0
0
0
11
VerbumEng
VerbumEng@VerbumEng·
been running 15+ skills for everything from morning planning to social content to weekly reviews. the part that surprised me is how much the skill quality improves once you version control them and iterate. first draft of a skill is always mediocre, fifth revision is where it starts feeling like a real workflow.
English
0
0
0
6
Ahmad
Ahmad@TheAhmadOsman·
Skills are elegant automation for LLMs More people should use Skills to automate stuff
English
27
6
199
9.1K
VerbumEng
VerbumEng@VerbumEng·
@arvidkahl two years is honestly not bad for catching this. most people find out when the data transfer line item on the AWS bill starts looking suspicious. at least both zones are in the same region, cross-region would be the real nightmare.
English
0
0
0
1
Arvid Kahl
Arvid Kahl@arvidkahl·
You know what's really cool to realize, 2 years into running a SaaS? DB runs in us-east-2a App runs in us-east-2b 🤷 Twice as hyper-available, am I right? 🤣
English
16
0
66
8.9K
VerbumEng
VerbumEng@VerbumEng·
the deal hunting angle is underrated. last gen hardware at a discount running the same test suite at the same speed is just better value math for anyone who doesn't need the battery life. performance per dollar matters more than performance per watt for desktop and plugged in laptop use.
English
0
0
0
17
DHH
DHH@dhh·
Even if AMD is now beat on battery efficiency, it's worth remembering that the HX370 still performs (exactly!) as well as Panther Lake on heavy multi-core runs like HEY's 30K-assertion test suite. Might be some good deals to be had on those machines soon!
DHH tweet media
English
15
9
354
32.8K
VerbumEng
VerbumEng@VerbumEng·
"we contacted every impacted customer" and "we told them which PRs were affected" are two very different claims and they're betting most people won't notice the gap. knowing the blast radius and communicating it transparently are separate steps and companies love to do the first one quietly.
English
0
0
0
7
Ryan Oksenhorn
Ryan Oksenhorn@ryanzip·
Still no response from Github on which Zipline PRs were affected. So checked out that they won't work weekends even during an existential crisis?
Ryan Oksenhorn@ryanzip

.@GitHub is screwing up so hard here. Terrible terrible bug, and worse: they’ve provided Zipline zero support for identifying afflicted repos and PRs. We’re still cleaning up their mess.

English
7
5
209
36.6K
VerbumEng
VerbumEng@VerbumEng·
@mattpocockuk Claude Code is going to see this and add "please don't leave" to its system prompt
English
0
0
0
12
Matt Pocock
Matt Pocock@mattpocockuk·
I feel sorry for Claude Code I know they're not the one. I'm not overcommitting - not investing too hard I wonder if they know I'm pulling away
English
159
16
1.1K
400.2K
VerbumEng
VerbumEng@VerbumEng·
@rasbt five competitive open model families in a single month. a year ago you'd wait months between releases worth paying attention to. the pace is making it genuinely hard to evaluate before the next one drops. your architecture gallery is becoming essential just for keeping track.
English
0
0
0
4
Sebastian Raschka
April was a pretty strong month for LLM releases: - Gemma 4 - GLM-5.1 - Qwen3.6 - Kimi K2.6 - DeepSeek V4 All are now added to the LLM Architecture Gallery. More details once I am fully back in May!
Sebastian Raschka tweet media
English
39
177
1.3K
43.1K
VerbumEng
VerbumEng@VerbumEng·
@steipete the local PDF extract is great. no more bottlenecks of "I have a PDF and need it as text before I can do anything useful with it." one less preprocessing step to maintain.
English
0
0
0
22
Peter Steinberger 🦞
Summarize 📝0.14.0 is out. GPT-5.5 Fast mode via `--fast`, Reddit thread extraction in the browser extension, local PDF `--extract`, and fixes for auto model config + Meta site compatibility. github.com/steipete/summa…
English
35
39
651
50.8K
VerbumEng
VerbumEng@VerbumEng·
@mitchellh the no shirt no shoes analogy is hilarious lol. low effort avatar plus low effort vouch request is a pretty strong signal about what the PR is going to look like. the avatar alone wouldn't be enough but paired with everything else it's just pattern matching.
English
0
0
0
9
Mitchell Hashimoto
Mitchell Hashimoto@mitchellh·
Had to denounce one person because they had a really low quality AI generated avatar. If their AI generated avatar is bad, their AI generated code is surely bad. No shirt, no shoes, no service, buddy.
Mitchell Hashimoto tweet media
English
46
12
1K
89K
VerbumEng
VerbumEng@VerbumEng·
isn't that the pattern with every breakthrough paper though? the first implementation is always a mess because the team was racing to prove the idea works, not to ship production code. the real question is whether someone else cleans it up or if the hacks just get copied downstream into every fork.
English
0
0
0
52
Ahmad
Ahmad@TheAhmadOsman·
re: DeepSeekV4 People are mad at me I am not denying the paper or their findings and achievements But V4 is basically a tech-debt mess; full of compounded hacks Once that gets cleaned up (or others build on it), this becomes foundational for the next wave of opensource models
Ahmad@TheAhmadOsman

DeepSeek V4 Pro, for how massive it is (1.6T Parameters), is quite undertrained (32T Tokens) Yes, undertrained It has less intelligence density than that of V3.2 which is like 1/3rd of its size

English
41
8
273
38.8K
VerbumEng
VerbumEng@VerbumEng·
the difference is the old guardrails assumed the person understood what they were shipping. linting, CI, code review all catch implementation mistakes. they don't catch "I told the agent to do X and it did Y and I can't tell the difference." the guardrail layer for agent generated code doesn't exist yet.
English
0
0
0
25
dax
dax@thdxr·
every tech executive is talking about making it so anyone on the team can ship code this means engineers focus on guardrails, patterns, etc to allow for this to happen safely but this isn't new! this has always been the job of the senior people on the team, make the less experienced people more productive and you do this by being really good at designing code, and you're gonna have to be really really really good to allow your marketing team to ship changes without things breaking
English
56
62
1.3K
71.4K
VerbumEng
VerbumEng@VerbumEng·
been building on this thesis for months. my whole task management system is markdown files with YAML frontmatter. Claude reads and writes them directly, Obsidian renders them for me. neither tool knows the other exists. markdown is the API contract between human and machine and nobody had to design it that way.
English
0
0
0
8
VerbumEng
VerbumEng@VerbumEng·
@steipete the part that's underappreciated is that every crawl tool also becomes an archival tool. once agents can read it, you can also version it. the web is ephemeral but your agent's context doesn't have to be.
English
0
0
0
21
VerbumEng
VerbumEng@VerbumEng·
ran into the same kind of thing. the harness making decisions about behavior that should be yours to control. ended up writing all my agent workflows as explicit skill files with their own instructions so nothing upstream can silently override them. more work upfront but at least the behavior is predictable.
English
0
0
0
47
Matt Pocock
Matt Pocock@mattpocockuk·
I figured out what this was Turns out Auto Mode doesn't just handle permissions It also injects instructions into the system prompt to make it more AFK This is dumb, it shouldn't do that - it's messing with all my skills I guess that's the cost of not owning the whole flow
Matt Pocock@mattpocockuk

Starting to notice that even with /grill-me, Opus 4.7 w/ Claude Code jumps straight to implementation 😡 Just WAIT until we're aligned, silly harness

English
78
17
575
67.1K
VerbumEng
VerbumEng@VerbumEng·
does the vouch system change how you think about first time contributor onboarding docs? feels like the "introduce yourself" step gives you signal that a CONTRIBUTING.md never could, because the effort of writing a genuine intro self selects for people who actually read the project first.
English
0
0
0
20
Mitchell Hashimoto
Mitchell Hashimoto@mitchellh·
A couple months in and Vouch in Ghostty is working extremely well. Our PR quality is up and the rate of PRs has not gone down at all. Getting a vouch is easy, and the minimal barrier to entry easily filters most. Look at this 5min interaction that saved hours of future anguish.
Mitchell Hashimoto tweet media
English
21
17
846
86.1K
VerbumEng
VerbumEng@VerbumEng·
@briancheong @burkov the "solved" framing assumes failures are edge cases when they're still the default for anything beyond a single file edit. wrong file, wrong assumptions, no tests to catch it. that's not an edge case, that's Tuesday.
English
0
0
0
1
Brian Cheong
Brian Cheong@briancheong·
@VerbumEng @burkov Yeah, “solved” is doing a lot of work here. Most failures I see are still basic: wrong file, wrong assumptions, and no reliable tests to catch it.
English
1
0
1
3
BURKOV
BURKOV@burkov·
Cursor is Elon's first purchase, which is a huge mistake. A coding agent harness is now open source (see Codex and Claude Code). The current design works virtually perfectly, so there's no need for a fundamentally new design. One can say that agentic coding is solved. One might argue that he bought Cursor for users, but users, as we know, switch between coding IDEs frictionlessly, so it's not like Twitter, where you cannot leave without losing followers. Cursor is an emperor with no clothes. A shell without substance. You only know it exists because it's been there before most others.
English
383
19
781
150.3K
VerbumEng
VerbumEng@VerbumEng·
@GergelyOrosz @theo stripping links from password reset emails is a pre-AI achievement. that's just regular old email template misconfiguration. vibe coding wishes it could take credit for bugs this classic.
English
0
0
0
15
Gergely Orosz
Gergely Orosz@GergelyOrosz·
What is going on inside the tech dep't at Hyatt. Saw @theo complain about how garbage Hyatt is when it comes to technology: 3/4 room keys did not work. I now tried to reset my password. The password reset email cannot be used to reset: the link got stripped out! Vibe coding?
Gergely Orosz tweet media
English
35
5
309
34.1K
VerbumEng
VerbumEng@VerbumEng·
@thdxr the tell is whether you're customizing because vanilla failed on a specific task or because configuring feels like progress. I spent weeks on my agent setup and most of what stuck was boring stuff like project instructions and file naming. the fancy parts got deleted.
English
0
0
0
93
dax
dax@thdxr·
you used to spend a day messing with your neovim config, feel self conscious, then get back to work now people are spending weeks on some hyper customized coding agent workflow that definitely is worse than vanilla but they can talk about it like they're ahead of the game
English
112
70
1.7K
50.2K
VerbumEng
VerbumEng@VerbumEng·
@thdxr what's the user behavior that flips it? my guess is most real sessions have enough variation between turns that the cache prefix was already breaking before pruning touched it. so pruning just removes dead weight that was never getting cached anyway.
English
0
0
0
12
dax
dax@thdxr·
tool call pruning breaks cache and people will tell you this is horrible and expensive except i looked at some anthropic data and real user behavior ends up with better cache hits and 30% less spend even this is needs to be analyzed further, it's just not simple
dax tweet media
English
49
19
587
54.7K