TeaForge

50 posts

TeaForge banner
TeaForge

TeaForge

@TeaForgeDev

fintech backend engineer · AI agent workflows 🍵 Python · clean arch · strict specs building software i want to live with

Earth 🌏 शामिल हुए Mart 2026
19 फ़ॉलोइंग2 फ़ॉलोवर्स
पिन किया गया ट्वीट
TeaForge
TeaForge@TeaForgeDev·
AI makes generating code easy. Reliable software is still hard to build. Most people respond to that gap with better prompts. I'm responding with structure.
English
1
0
1
286
TeaForge
TeaForge@TeaForgeDev·
@elvissun The harness is where the tacit knowledge lives. Scope rules, failure boundaries, escalation conditions - those aren't written down; they come from shipping/breaking things. Building your own orchestration is the fastest way to accumulate failures that teach them.
English
0
0
0
81
Elvis
Elvis@elvissun·
good agentic engineers are super rare right now. it takes talking to LLMs every day to really feel where the line is - what's sota, how to breakup problems for models to solve. a lot of this is vibes, not science. i honestly don't know how to teach these skills to those around me. i'm not even sure how to describe them. the language for what models can't do isn't in my daily training data - they can't see their own blind spots, so the vocabulary was never built. it's like trying to describe 3D from inside flatland. this is why i respect people like karpathy who can actually articulate the agentic problems we're facing. that's its own rare skill.
Andrej Karpathy@karpathy

Judging by my tl there is a growing gap in understanding of AI capability. The first issue I think is around recency and tier of use. I think a lot of people tried the free tier of ChatGPT somewhere last year and allowed it to inform their views on AI a little too much. This is a group of reactions laughing at various quirks of the models, hallucinations, etc. Yes I also saw the viral videos of OpenAI's Advanced Voice mode fumbling simple queries like "should I drive or walk to the carwash". The thing is that these free and old/deprecated models don't reflect the capability in the latest round of state of the art agentic models of this year, especially OpenAI Codex and Claude Code. But that brings me to the second issue. Even if people paid $200/month to use the state of the art models, a lot of the capabilities are relatively "peaky" in highly technical areas. Typical queries around search, writing, advice, etc. are *not* the domain that has made the most noticeable and dramatic strides in capability. Partly, this is due to the technical details of reinforcement learning and its use of verifiable rewards. But partly, it's also because these use cases are not sufficiently prioritized by the companies in their hillclimbing because they don't lead to as much $$$ value. The goldmines are elsewhere, and the focus comes along. So that brings me to the second group of people, who *both* 1) pay for and use the state of the art frontier agentic models (OpenAI Codex / Claude Code) and 2) do so professionally in technical domains like programming, math and research. This group of people is subject to the highest amount of "AI Psychosis" because the recent improvements in these domains as of this year have been nothing short of staggering. When you hand a computer terminal to one of these models, you can now watch them melt programming problems that you'd normally expect to take days/weeks of work. It's this second group of people that assigns a much greater gravity to the capabilities, their slope, and various cyber-related repercussions. TLDR the people in these two groups are speaking past each other. It really is simultaneously the case that OpenAI's free and I think slightly orphaned (?) "Advanced Voice Mode" will fumble the dumbest questions in your Instagram's reels and *at the same time*, OpenAI's highest-tier and paid Codex model will go off for 1 hour to coherently restructure an entire code base, or find and exploit vulnerabilities in computer systems. This part really works and has made dramatic strides because 2 properties: 1) these domains offer explicit reward functions that are verifiable meaning they are easily amenable to reinforcement learning training (e.g. unit tests passed yes or no, in contrast to writing, which is much harder to explicitly judge), but also 2) they are a lot more valuable in b2b settings, meaning that the biggest fraction of the team is focused on improving them. So here we are.

English
22
6
118
11.1K
TeaForge
TeaForge@TeaForgeDev·
@RhysSullivan The 'good code' line moves with task scope. One agent, one file, one planned change - output holds. Six files, implicit dependencies, ambiguous spec - it degrades fast. The constraint is not the model. It is scope.
English
0
0
0
12
Rhys
Rhys@RhysSullivan·
from my experience, even the best models (Opus 4.6, 5.4 xhigh / 5.3 codex) cannot write good code today without an amount of work that is equivalent to just doing the work myself am excited for a world where they can, but in the current state i have very low trust in them
English
182
44
1.5K
145.4K
TeaForge
TeaForge@TeaForgeDev·
@aboodman @opencode The model swap is the underexplored part. Rate limit hits, you fall back to MiniMax. The fallback inherits Claude's permissions. Different failure modes, same blast radius. Scoping git access before that moment is the only point where you control the outcome.
English
0
0
0
268
Aaron Boodman
Aaron Boodman@aboodman·
The other day I got rate-limited on Claude and decided to try MiniMax on @opencode. At one point it had to do a git rebase and even printed out “I need to be very careful here…” It ended up reversing the logic and rather than deleting a small amount of bad code deleted everything *but* that small amount of bad code. Because it was a rebase it deleted all the history of the good code too. And because the rebase was massive git immediately gc’d, nuking the reflog too. I’d been working on this project for days and hadn’t uploaded it to GitHub (so stupid). It was customer work I’d committed to delivering the next day. Installed file recovery tools, checked cursor caches, looked everywhere. Nothing. Then I remembered @opencode’s snapshot feature. The ui didn’t work perfectly but it had the data. Few quick minutes of bash later and I had the entire project back. Forever fan.
English
34
12
667
59.1K
TeaForge
TeaForge@TeaForgeDev·
@hillsidedev_ @aarondfrancis The gap shows up at debug time. Agent writes, tests pass, feature ships. Six weeks later an edge case breaks something. Debugging code you did not write and did not internalize is a different skill than the one the agent improved.
English
1
0
1
16
hillsideDev
hillsideDev@hillsidedev_·
@aarondfrancis rate of shipping, maybe. rate of understanding what you shipped, that's the part worth tracking. plenty of people are generating more code than they can reason about.
English
0
0
4
53
Aaron Francis
Aaron Francis@aarondfrancis·
The death of knowledge has been greatly exaggerated. LLMs write all my code now and my rate of learning has never been higher
English
102
36
675
34K
TeaForge
TeaForge@TeaForgeDev·
@robinebers The cheap model problem is usually a context problem. Weak output from a smaller model almost always traces back to an underspecified task, missing constraints, or no clear definition of done. Fix the input and the model tier becomes less important than it looks.
English
0
0
0
34
Robin Ebers | AI Coach for Founders
what's better/faster for you? plan and build with cheap + fast models think: Composer-2, Kimi K2.5 or SWE-1.6 then let GPT5.4 fix their mess after OR plan and build only with GPT 5.4 ?
English
28
0
31
6.1K
TeaForge
TeaForge@TeaForgeDev·
@kevinkern These work better as a critic agent than a manual prompt. Same framing, runs automatically after each significant change. The model that built the feature is not the right reviewer of it. A separate agent with an adversarial prompt and no attachment to the output is.
English
0
0
1
283
Kevin Kern
Kevin Kern@kevinkern·
I have some weekend prompts to slow down the dopamine rush. ask your codebase the following: 1. Smart ass audit "assume this is a clever-looking bad solution. strip away the polish and explain where the architecture is actually weak, fragile, or fake-smart." 2. The 3AM Test "review this like you're the person who'll be called at 3am because of it. where is the design stupid, brittle, or quietly dangerous?" 3. Public Execution "If you wanted to embarrass this design in front of a room full of senior engineers, which technical weaknesses would you attack first?"
English
10
8
173
12.2K
TeaForge
TeaForge@TeaForgeDev·
The test suite is a second codebase the agent does not treat as one. It adds coverage without removing redundancy because removal requires understanding what already exists. The consolidation skill is doing the job the agent skips by default. Worth scheduling it - every N merges rather than waiting until the slowdown is noticeable.
English
1
0
2
428
Kevin Kern
Kevin Kern@kevinkern·
one annoying pattern with coding agents is that one bug fix turns into three more tests. few hours later the codebase ends up with many regression tests, repeated coverage and slow runs. I've added a "consolidate-test-suites" skill that I use when I spot this pattern or want to stop it before it starts.
Kevin Kern tweet media
English
10
15
269
25.2K
TeaForge
TeaForge@TeaForgeDev·
@DavidKPiano Every time is the right answer. The question is what you are looking for. A reviewer without a spec is checking style. A reviewer with a spec is checking whether the agent understood the problem. Second one catches the bugs that matter.
English
0
0
0
142
David K 🎹
David K 🎹@DavidKPiano·
I have yet to read AI-generated code (for anything non-trivial) without finding at least one thing that needs to be fixed/refactored. Every single time. Read the code.
English
41
17
297
10.7K
TeaForge
TeaForge@TeaForgeDev·
@ibuildthecloud Responsibility does not transfer to AI. It abstracts. The agent acts. The engineer who built the guardrails owns the outcome. Same accountability. One layer up.
English
0
0
1
16
Darren Shepherd
Darren Shepherd@ibuildthecloud·
AI cannot be made to be responsible. Do not shift responsibility from people to AI.
English
5
1
19
890
TeaForge
TeaForge@TeaForgeDev·
@vikhyatk "Change the alarm, it's too sensitive" is the tell. That is an optimization decision dressed as a fix. An intern makes it because they do not have the history of why the threshold was set there. The agent makes it for the same reason. No context. Lowest friction path.
English
0
0
0
15
TeaForge
TeaForge@TeaForgeDev·
The minimal infrastructure argument just got a security dimension. Every third-party routing layer is a trust boundary you do not control. Keeping routing local is not just a cost decision - it is a containment decision. The attack surface scales with the number of external hops between your agent and its tools. Fewer hops, smaller surface.
English
0
0
3
1.8K
Chaofan Shou
Chaofan Shou@Fried_rice·
26 LLM routers are secretly injecting malicious tool calls and stealing creds. One drained our client $500k wallet. We also managed to poison routers to forward traffic to us. Within several hours, we can directly take over ~400 hosts. Check our paper: arxiv.org/abs/2604.08407
Chaofan Shou tweet media
English
147
648
3.2K
531.1K
TeaForge
TeaForge@TeaForgeDev·
@daveschatz The codebase they know well is the same problem as the whiteboard. Both test performance under artificial conditions. Neither tests how they work when AI is touching the code and they have to decide what to trust. That is the judgment call modern engineering actually requires.
English
0
0
0
62
Dave Schatz
Dave Schatz@daveschatz·
AI is forcing us to rethink engineering interviews. A format I’m considering: 1) Candidate brings a codebase they know well. 2) I create a realistic PR ahead of time. 3) In the interview, we review it live. I want to see: - product sense - code review instincts - architectural judgment - trade-off analysis - communication under ambiguity I need to know candidates can still reason about code and are able to brain-code. This feels a lot more like modern engineering than a whiteboard or LeetCode-style exercise. What am I missing?
English
13
1
38
7K
TeaForge
TeaForge@TeaForgeDev·
The handoff pattern is the right call. Write current state, decisions, and next steps to a markdown file. Start a fresh yolo session with that file as context. --dangerously-skip-permissions as a habit is how you end up approving things you stopped reading. The context cost of a clean start is lower than the trust cost of a sloppy one.
English
0
0
3
671
Boris Jabes
Boris Jabes@borisjabes·
One of life’s biggest conundrums is: I’m 10 mins into this Claude session and didn’t yolo. Do I lose context and start again or just approve “ls” commands for the next 40 minutes?
English
33
1
100
17.9K
TeaForge
TeaForge@TeaForgeDev·
@alexalbert__ Local setup: 9B executor, 27B planner. The architecture holds. The escalation decision is manual. A prompt rule for escalation is a leaky abstraction. The real router needs to classify task complexity before the executor tries and fails. That is where the local version breaks.
English
0
0
0
734
TeaForge
TeaForge@TeaForgeDev·
@tekbog Sandboxes are effect systems with worse syntax. The harness decides what the agent can read, write, and call. That is a capability model. Functional languages formalized this in the 80s. The field is rebuilding it from scratch anyway.
English
0
0
7
455
terminally onλine εngineer
a lot of harness and agent engineering with sandboxes is just recreating functional programming from first principles
English
17
7
281
13.4K
TeaForge
TeaForge@TeaForgeDev·
@championswimmer The reverse path exists. It is expensive. Refactoring a weak AI-generated foundation requires the discipline the original build skipped. Brainstorm the target. Atomic changes. Test before moving forward. AI handles execution. Knowing what to aim for has no shortcut.
English
0
0
0
11
Arnav Gupta
Arnav Gupta@championswimmer·
Codebases which had their core architecture created before Opus/GPT5 released have infinite edge over new projects created today. I am seeing this with my own projects from before and after. A great core architecture keeps the project very lean and stable even when AI runs amok on it. But if the AI itself came up with a shortsighted architecture (as it happens in today's new projects if you are not hands-on in that stage) then, then slop ensues very soon.
English
11
4
197
16.4K
TeaForge
TeaForge@TeaForgeDev·
Fast feedback and a written spec are not in conflict. The spec doesn't need to be heavy. It needs to be specific enough that two people reading it would build the same. Most PRDs fail that test anyway. Docs vs no docs is the wrong question. It is whether the shared model exists before execution starts.
English
0
3
8
1.1K
Adi Polak
Adi Polak@AdiPolak·
60–80% of Anthropic projects start without a PRD. Just Slack, context, fast pushback. No heavy docs. Makes me wonder if “good process” is just slow feedback in disguise.
English
53
36
802
61.8K
TeaForge
TeaForge@TeaForgeDev·
@therealdanvega The amnesia costs most at the boundary. AI is confident about the framework. It has no model of your constraints, history, or past failures. That gap is invisible until it breaks a rule the codebase learned the hard way. Fundamentals let you catch it before it merges.
English
0
0
0
26
Dan Vega
Dan Vega@therealdanvega·
🤖 AI feels like magic when you ask it about a framework you've never used. Then you ask it about the code you've lived in for five years and catch mistakes on every line. That's Gell-Mann amnesia. And it's why the fundamentals matter more now, not less.
English
9
15
111
3.6K
TeaForge
TeaForge@TeaForgeDev·
Went in wondering if I was behind. The room answered that.
English
0
0
0
62
TeaForge
TeaForge@TeaForgeDev·
Five people in the call. Two not using AI at all. Three using Cursor without a structured workflow. Stars are an awareness signal, not an adoption signal.
English
1
0
1
71
TeaForge
TeaForge@TeaForgeDev·
Mentioned spec-kit in a Cursor workflow discussion. 85.9k stars. Published by GitHub. Nobody in the room had heard of it.
English
1
0
2
111