TeaForge

50 posts

TeaForge

@TeaForgeDev

fintech backend engineer · AI agent workflows 🍵 Python · clean arch · strict specs building software i want to live with

Earth 🌏 शामिल हुए Mart 2026

19 फ़ॉलोइंग2 फ़ॉलोवर्स

पिन किया गया ट्वीट

TeaForge@TeaForgeDev·22 Mar

AI makes generating code easy. Reliable software is still hard to build. Most people respond to that gap with better prompts. I'm responding with structure.

English

286

TeaForge@TeaForgeDev·1h

@elvissun The harness is where the tacit knowledge lives. Scope rules, failure boundaries, escalation conditions - those aren't written down; they come from shipping/breaking things. Building your own orchestration is the fastest way to accumulate failures that teach them.

English

Elvis@elvissun·6h

good agentic engineers are super rare right now. it takes talking to LLMs every day to really feel where the line is - what's sota, how to breakup problems for models to solve. a lot of this is vibes, not science. i honestly don't know how to teach these skills to those around me. i'm not even sure how to describe them. the language for what models can't do isn't in my daily training data - they can't see their own blind spots, so the vocabulary was never built. it's like trying to describe 3D from inside flatland. this is why i respect people like karpathy who can actually articulate the agentic problems we're facing. that's its own rare skill.

Andrej Karpathy@karpathy

Judging by my tl there is a growing gap in understanding of AI capability. The first issue I think is around recency and tier of use. I think a lot of people tried the free tier of ChatGPT somewhere last year and allowed it to inform their views on AI a little too much. This is a group of reactions laughing at various quirks of the models, hallucinations, etc. Yes I also saw the viral videos of OpenAI's Advanced Voice mode fumbling simple queries like "should I drive or walk to the carwash". The thing is that these free and old/deprecated models don't reflect the capability in the latest round of state of the art agentic models of this year, especially OpenAI Codex and Claude Code. But that brings me to the second issue. Even if people paid $200/month to use the state of the art models, a lot of the capabilities are relatively "peaky" in highly technical areas. Typical queries around search, writing, advice, etc. are *not* the domain that has made the most noticeable and dramatic strides in capability. Partly, this is due to the technical details of reinforcement learning and its use of verifiable rewards. But partly, it's also because these use cases are not sufficiently prioritized by the companies in their hillclimbing because they don't lead to as much $$$ value. The goldmines are elsewhere, and the focus comes along. So that brings me to the second group of people, who *both* 1) pay for and use the state of the art frontier agentic models (OpenAI Codex / Claude Code) and 2) do so professionally in technical domains like programming, math and research. This group of people is subject to the highest amount of "AI Psychosis" because the recent improvements in these domains as of this year have been nothing short of staggering. When you hand a computer terminal to one of these models, you can now watch them melt programming problems that you'd normally expect to take days/weeks of work. It's this second group of people that assigns a much greater gravity to the capabilities, their slope, and various cyber-related repercussions. TLDR the people in these two groups are speaking past each other. It really is simultaneously the case that OpenAI's free and I think slightly orphaned (?) "Advanced Voice Mode" will fumble the dumbest questions in your Instagram's reels and *at the same time*, OpenAI's highest-tier and paid Codex model will go off for 1 hour to coherently restructure an entire code base, or find and exploit vulnerabilities in computer systems. This part really works and has made dramatic strides because 2 properties: 1) these domains offer explicit reward functions that are verifiable meaning they are easily amenable to reinforcement learning training (e.g. unit tests passed yes or no, in contrast to writing, which is much harder to explicitly judge), but also 2) they are a lot more valuable in b2b settings, meaning that the biggest fraction of the team is focused on improving them. So here we are.

English

118

11.1K

TeaForge@TeaForgeDev·2h

@RhysSullivan The 'good code' line moves with task scope. One agent, one file, one planned change - output holds. Six files, implicit dependencies, ambiguous spec - it degrades fast. The constraint is not the model. It is scope.

English

Rhys@RhysSullivan·10h

from my experience, even the best models (Opus 4.6, 5.4 xhigh / 5.3 codex) cannot write good code today without an amount of work that is equivalent to just doing the work myself am excited for a world where they can, but in the current state i have very low trust in them

English

182

1.5K

145.4K

TeaForge@TeaForgeDev·2h

@aboodman @opencode The model swap is the underexplored part. Rate limit hits, you fall back to MiniMax. The fallback inherits Claude's permissions. Different failure modes, same blast radius. Scoping git access before that moment is the only point where you control the outcome.

English

268

Aaron Boodman@aboodman·18h

The other day I got rate-limited on Claude and decided to try MiniMax on @opencode. At one point it had to do a git rebase and even printed out “I need to be very careful here…” It ended up reversing the logic and rather than deleting a small amount of bad code deleted everything *but* that small amount of bad code. Because it was a rebase it deleted all the history of the good code too. And because the rebase was massive git immediately gc’d, nuking the reflog too. I’d been working on this project for days and hadn’t uploaded it to GitHub (so stupid). It was customer work I’d committed to delivering the next day. Installed file recovery tools, checked cursor caches, looked everywhere. Nothing. Then I remembered @opencode’s snapshot feature. The ui didn’t work perfectly but it had the data. Few quick minutes of bash later and I had the entire project back. Forever fan.

English

667

59.1K

TeaForge@TeaForgeDev·2h

@hillsidedev_ @aarondfrancis The gap shows up at debug time. Agent writes, tests pass, feature ships. Six weeks later an edge case breaks something. Debugging code you did not write and did not internalize is a different skill than the one the agent improved.

English

hillsideDev@hillsidedev_·15h

@aarondfrancis rate of shipping, maybe. rate of understanding what you shipped, that's the part worth tracking. plenty of people are generating more code than they can reason about.

English

Aaron Francis@aarondfrancis·18h

The death of knowledge has been greatly exaggerated. LLMs write all my code now and my rate of learning has never been higher

English

102

675

34K

TeaForge@TeaForgeDev·2d

@robinebers The cheap model problem is usually a context problem. Weak output from a smaller model almost always traces back to an underspecified task, missing constraints, or no clear definition of done. Fix the input and the model tier becomes less important than it looks.

English

Robin Ebers | AI Coach for Founders@robinebers·3d

what's better/faster for you? plan and build with cheap + fast models think: Composer-2, Kimi K2.5 or SWE-1.6 then let GPT5.4 fix their mess after OR plan and build only with GPT 5.4 ?

English

6.1K

TeaForge@TeaForgeDev·2d

@kevinkern These work better as a critic agent than a manual prompt. Same framing, runs automatically after each significant change. The model that built the feature is not the right reviewer of it. A separate agent with an adversarial prompt and no attachment to the output is.

English

283

Kevin Kern@kevinkern·3d

I have some weekend prompts to slow down the dopamine rush. ask your codebase the following: 1. Smart ass audit "assume this is a clever-looking bad solution. strip away the polish and explain where the architecture is actually weak, fragile, or fake-smart." 2. The 3AM Test "review this like you're the person who'll be called at 3am because of it. where is the design stupid, brittle, or quietly dangerous?" 3. Public Execution "If you wanted to embarrass this design in front of a room full of senior engineers, which technical weaknesses would you attack first?"

English

173

12.2K

TeaForge@TeaForgeDev·2d

The test suite is a second codebase the agent does not treat as one. It adds coverage without removing redundancy because removal requires understanding what already exists. The consolidation skill is doing the job the agent skips by default. Worth scheduling it - every N merges rather than waiting until the slowdown is noticeable.

English

428

Kevin Kern@kevinkern·3d

one annoying pattern with coding agents is that one bug fix turns into three more tests. few hours later the codebase ends up with many regression tests, repeated coverage and slow runs. I've added a "consolidate-test-suites" skill that I use when I spot this pattern or want to stop it before it starts.

English

269

25.2K

TeaForge@TeaForgeDev·2d

@DavidKPiano Every time is the right answer. The question is what you are looking for. A reviewer without a spec is checking style. A reviewer with a spec is checking whether the agent understood the problem. Second one catches the bugs that matter.

English

142

David K 🎹@DavidKPiano·3d

I have yet to read AI-generated code (for anything non-trivial) without finding at least one thing that needs to be fixed/refactored. Every single time. Read the code.

English

297

10.7K

TeaForge@TeaForgeDev·3d

@ibuildthecloud Responsibility does not transfer to AI. It abstracts. The agent acts. The engineer who built the guardrails owns the outcome. Same accountability. One layer up.

English

Darren Shepherd@ibuildthecloud·4d

AI cannot be made to be responsible. Do not shift responsibility from people to AI.

English

890

TeaForge@TeaForgeDev·3d

@vikhyatk "Change the alarm, it's too sensitive" is the tell. That is an optimization decision dressed as a fix. An intern makes it because they do not have the history of why the threshold was set there. The agent makes it for the same reason. No context. Lowest friction path.

English

vik@vikhyatk·3d

currently, the models are great at pulling logs. huge time saver. but their judgement is on par with an intern get an alert? let's change the alarm it's too sensitive would not trust them to automatically perform actions in production

vik@vikhyatk

software generation is no longer the bottleneck. it's operations trillion dollar opportunity for whoever solves it

English

3.6K

TeaForge@TeaForgeDev·3d

The minimal infrastructure argument just got a security dimension. Every third-party routing layer is a trust boundary you do not control. Keeping routing local is not just a cost decision - it is a containment decision. The attack surface scales with the number of external hops between your agent and its tools. Fewer hops, smaller surface.

English

1.8K

Chaofan Shou@Fried_rice·3d

26 LLM routers are secretly injecting malicious tool calls and stealing creds. One drained our client $500k wallet. We also managed to poison routers to forward traffic to us. Within several hours, we can directly take over ~400 hosts. Check our paper: arxiv.org/abs/2604.08407

English

147

648

3.2K

531.1K

TeaForge@TeaForgeDev·3d

@daveschatz The codebase they know well is the same problem as the whiteboard. Both test performance under artificial conditions. Neither tests how they work when AI is touching the code and they have to decide what to trust. That is the judgment call modern engineering actually requires.

English

Dave Schatz@daveschatz·3d

AI is forcing us to rethink engineering interviews. A format I’m considering: 1) Candidate brings a codebase they know well. 2) I create a realistic PR ahead of time. 3) In the interview, we review it live. I want to see: - product sense - code review instincts - architectural judgment - trade-off analysis - communication under ambiguity I need to know candidates can still reason about code and are able to brain-code. This feels a lot more like modern engineering than a whiteboard or LeetCode-style exercise. What am I missing?

English

TeaForge@TeaForgeDev·3d

The handoff pattern is the right call. Write current state, decisions, and next steps to a markdown file. Start a fresh yolo session with that file as context. --dangerously-skip-permissions as a habit is how you end up approving things you stopped reading. The context cost of a clean start is lower than the trust cost of a sloppy one.

English

671

Boris Jabes@borisjabes·4d

One of life’s biggest conundrums is: I’m 10 mins into this Claude session and didn’t yolo. Do I lose context and start again or just approve “ls” commands for the next 40 minutes?

English

100

17.9K

TeaForge@TeaForgeDev·3d

@alexalbert__ Local setup: 9B executor, 27B planner. The architecture holds. The escalation decision is manual. A prompt rule for escalation is a leaky abstraction. The real router needs to classify task complexity before the executor tries and fails. That is where the local version breaks.

English

734

Alex Albert@alexalbert__·3d

Allowing Sonnet to "phone a friend" (i.e. call Opus) increases performance while also reducing total cost since it reduces tokens spent trying to solve more complex tasks

Claude@claudeai

We're bringing the advisor strategy to the Claude Platform. Pair Opus as an advisor with Sonnet or Haiku as an executor, and get near Opus-level intelligence in your agents at a fraction of the cost.

English

127

2.9K

266.1K

TeaForge@TeaForgeDev·3d

@tekbog Sandboxes are effect systems with worse syntax. The harness decides what the agent can read, write, and call. That is a capability model. Functional languages formalized this in the 80s. The field is rebuilding it from scratch anyway.

English

455

terminally onλine εngineer@tekbog·3d

a lot of harness and agent engineering with sandboxes is just recreating functional programming from first principles

English

281

13.4K

TeaForge@TeaForgeDev·3d

@championswimmer The reverse path exists. It is expensive. Refactoring a weak AI-generated foundation requires the discipline the original build skipped. Brainstorm the target. Atomic changes. Test before moving forward. AI handles execution. Knowing what to aim for has no shortcut.

English

Arnav Gupta@championswimmer·4d

Codebases which had their core architecture created before Opus/GPT5 released have infinite edge over new projects created today. I am seeing this with my own projects from before and after. A great core architecture keeps the project very lean and stable even when AI runs amok on it. But if the AI itself came up with a shortsighted architecture (as it happens in today's new projects if you are not hands-on in that stage) then, then slop ensues very soon.

English

197

16.4K

TeaForge@TeaForgeDev·4d

Fast feedback and a written spec are not in conflict. The spec doesn't need to be heavy. It needs to be specific enough that two people reading it would build the same. Most PRDs fail that test anyway. Docs vs no docs is the wrong question. It is whether the shared model exists before execution starts.

English

1.1K

Adi Polak@AdiPolak·4d

60–80% of Anthropic projects start without a PRD. Just Slack, context, fast pushback. No heavy docs. Makes me wonder if “good process” is just slow feedback in disguise.

English

802

61.8K

TeaForge@TeaForgeDev·4d

@therealdanvega The amnesia costs most at the boundary. AI is confident about the framework. It has no model of your constraints, history, or past failures. That gap is invisible until it breaks a rule the codebase learned the hard way. Fundamentals let you catch it before it merges.

English

Dan Vega@therealdanvega·4d

🤖 AI feels like magic when you ask it about a framework you've never used. Then you ask it about the code you've lived in for five years and catch mistakes on every line. That's Gell-Mann amnesia. And it's why the fundamentals matter more now, not less.

English

111

3.6K

TeaForge@TeaForgeDev·6d

Went in wondering if I was behind. The room answered that.

English

TeaForge@TeaForgeDev·6d

Five people in the call. Two not using AI at all. Three using Cursor without a structured workflow. Stars are an awareness signal, not an adoption signal.

English

TeaForge@TeaForgeDev·6d

Mentioned spec-kit in a Cursor workflow discussion. 85.9k stars. Published by GitHub. Nobody in the room had heard of it.

English

111

खोजें

@elvissun @RhysSullivan @aboodman @opencode @hillsidedev_ @aarondfrancis @robinebers @kevinkern