고정된 트윗
Petio Lazarov
221 posts

Petio Lazarov
@petiosz
Testing AI agents in public. Codex notes, model releases, and what breaks after the demo.
가입일 Mayıs 2026
224 팔로잉19 팔로워

@danshipper next chart i want: runs started vs usable diffs, because a forced spike is not the same thing as codex earning the slot.
English

@MoonDevOnYT wait so you are giving access to your code in the zoom meeting?
English

you are literally burning money on polymarket while i just forced 36 ai agents to strip mine the platform for alpha
it is frankly disgusting how fast these bots uncover the exact loopholes needed to bleed the rest of the market dry
watch me leak the raw prompts and steal the winning bot blueprints before this gets taken down here
English

@foxtomb232 I am new in twitter, and I would really appreciate a shoutout
English

@MoonDevOnYT Hey moondev why did you close your github projects behind a paywall brother? make a monthly sub at least, I can't afford to pay several gazzilion dollars for that cmon :D
English

@sattyyouneed make the limit visible before the run starts. otherwise the agent spends your cap like it found a company card.
English


@Latin0Patri0t @ChrissGPT Bro I am from Europe... I can't count on Europe to do anything... we only regulate, we don't produce. Same happens in America now.
English

@Latin0Patri0t @ChrissGPT Yeah i can definetly count on US after today mhm.. the most cucked model that refused 99% of requests got banned. I want the same model but totally unlocked
English

@petiosz @ChrissGPT 😂😂😂 this guy counting on China when China doesn’t even allow internet access …what a retard
English

@aakashgupta i would want the evaluator to print 4 boring things in the transcript: changed files, failed check, stop reason, turn count. otherwise /goal can stop cleanly and still leave a mystery.
English

/goal might be the most powerful feature in Claude Code that you're not using. And the part everyone gets wrong has nothing to do with the feature.
Here's the mechanism. You hand Claude a completion condition. It works turn after turn. After every turn, a separate evaluator model (Haiku by default) checks the output against your condition. Condition unmet? Claude keeps going. Met? It logs the proof and hands control back.
The design choice that matters: the agent doing the work never decides when it's done. A fresh model does.
OpenAI shipped /goal in Codex in April. Anthropic followed in May with Claude Code 2.1.139. Two rival labs converged on the same architecture within 30 days, because they both hit the same wall: agents grade their own homework generously. Separate the worker from the judge and autonomy actually holds.
But here's where most runs die. The bottleneck moved. It's no longer prompting skill. It's the goal condition itself.
"Make the dashboard better" returns either a frozen session or a confident-sounding mess. "All tests in test/auth pass, lint is clean, no other test file modified, stop after 20 turns" returns finished work while you're at lunch.
A measurable end state. A check the agent can prove in the transcript. Constraints that must hold. A turn limit.
PMs have a name for this. Acceptance criteria. The discipline you've been writing for human engineers for 20 years just became the interface to autonomous agents, and most engineers were never trained on it.
I spent the week running /goal on real PM work and wrote the full playbook, including the goal conditions that worked and the ones that burned tokens for nothing:
news.aakashg.com/p/how-pms-shou…
The agent does the work. You define done. That was always the job.

English

@rduffyuk token price is the wrong unit. i would track subagent fanout per review checkpoint. when that is hidden, a model swap can look cheap while the run gets harder to audit.
English

Running Claude Code (Fable 5) and Codex in parallel. Fable landing forced me to build actual cost governance.
Discovery: one 4-hour session, 32 subagents inheriting Fable — $50/MTok output, mandatory extended thinking, can't be disabled. 316K output tokens. ~$16. Single session.
Two systems to fix it — breakdown below 👇
English

@buildwithdjdev my check is minutes until a reviewer can tell what happened, what changed, and what can be thrown away. if that takes longer than the run, the orchestrator is just moving the bill.
English

@aniketapanjwani i'd add a "throw away" section to the handoff. stale assumptions. files to ignore. last-known-good state. next step that should fail first.
English

Fable eats your Claude Code usage limits in hours - here's how I'm getting around it:
1. Use Fable for planning and to write out your planning doc to disk. I like to use Compound Engineering brainstorm/plan: github.com/EveryInc/compo…
2. In your brainstorming/planning, clear your session (/clear in CC) and do handoffs at appropriate intermediate stages. Install this /handoff skill to automate it: github.com/mattpocock/ski…
3. Install the Codex plugin for Claude Code: github.com/openai/codex-p…
4. Either use /ce-work-beta through Compound Engineering (in a new session after doing /handoff), or just tell Fable to delegate work to Codex to save on tokens.
The general principle - use the expensive/better model to decide what to do, and use the cheaper model to do it - is a common technique in agentic deevlopment.
English

@Warizo_ofAfrica i'd add one more handoff test: can another builder reopen the run tomorrow and find the last known-good state without asking you.
English

@henrikhinai i'd test reuse by the handoff card. role. allowed files. stop line. next-agent input.
English

@shvnmahajan show the room, not just the answer. files visible. docs pasted. guesses made. proof file.
English















