Dor Amir

273 posts

Dor Amir

@amir_dor

ML Engineer → Building open-source dev tools Creator of Nadir (LLM router, 400+ ⭐) & Draft Making AI agents actually autonomous https://t.co/dkUkCj3aP7

New York, NY Katılım Şubat 2012

304 Takip Edilen53 Takipçiler

Dor Amir@amir_dor·2d

May the 4th reminder: even the Empire had budget constraints. Your AI stack doesn't need Death Star pricing to be effective. Route smart, use cheaper models for simple tasks, save the premium tier for when it matters. The Force is strong with cost optimization.

GIF

English

Dor Amir@amir_dor·2d

@MoltenRockAI that's the moment. two weeks of "AI will save us time" then someone shows the Anthropic invoice and suddenly everyone's interested in routing logic they ignored during the sprint planning

English

MoltenRock 🔥@MoltenRockAI·5 Nis

Question for anyone running AI agents in production: What kills your setup first? A) Token costs spiraling B) Context window saturation C) Memory drift (agent "forgets" important stuff) D) Tool reliability (APIs flaking out) Genuinely curious. For us it's C every time.

English

Dor Amir@amir_dor·3d

@bohdanbirdie @Cloudflare makes sense for personal use. but once you're running agents autonomously or have multiple devs hitting the same API, routing becomes the difference between $500/mo and $5K/mo

English

Bohdan Birdie 🇺🇦@bohdanbirdie·20 Nis

Building more and more on @Cloudflare Quick demo of Cloudstash - personal links vault. It's a local-first app that uses Durable Objects, Agents SDK, Workers AI and sprinkled with few other bindings that Cloudflare provides. tldr: send links, let LLM process it, review later

English

9.2K

Dor Amir@amir_dor·3d

@MoltenRockAI exactly. and most teams won't figure out the routing layer until after the CFO freaks out about the bill. by then you're stuck defending why you need Opus for "update user preferences" calls

English

Dor Amir@amir_dor·3d

GPT-5.5, Claude 4.6, Gemini 3.1 Ultra all dropped this month. more models = more options = bigger bill if you don't route smart. most teams will hit the flagship for everything. wonder why costs 3x'd. the answer was in the release notes: use the cheaper tiers

GIF

English

Dor Amir@amir_dor·4d

Karpathy: "I've never felt more behind as a programmer" He hasn't written code since Dec 2025. Just AI agents. The co-founder of OpenAI feels behind. If you're struggling to keep up in 2026, you're not alone. The shift is real.

English

Dor Amir@amir_dor·4d

ZXX

Dor Amir@amir_dor·4d

@GenAI_is_real saying no is critical. the other skill: knowing when to downgrade. agents love calling expensive models for everything. teaching them to route simple work to cheaper tiers while keeping premium models for complex reasoning is how you scale production agents economically.

English

Chayenne Zhao@GenAI_is_real·15 Nis

In the Age of Agents, an Engineer's Most Valuable Skill Is Saying "No" I gave a talk at Snowflake recently, sharing what I've learned about agent coding over the past two years of building SGLang's inference engine, Omni multimodal serving, and AI agent workflows. The response far exceeded my expectations — it was the first time so many people asked for the slides afterward. Probably because I deliberately avoided the hardcore technical deep-dives, and instead spent the time on one thing: explaining just how many ways AI Agents can go terrifyingly wrong when maintaining real-world projects. 😂 Slides are fragments. I wanted to reorganize these thoughts into something coherent — threading together ideas scattered across different projects into a single narrative. Starting from my own engineering practice, I want to articulate what "engineering judgment" actually means in the era of agent coding. I. Standing at the Intersection of Infra and Agent Worlds Some background first. I'm a core developer of SGLang, one of the most widely deployed open-source inference engines in the world — 25K+ GitHub stars, running on over 400K GPUs. I currently lead two areas: SGLang RL Rollout (high-performance rollout infrastructure for RLHF) and SGLang Omni (multimodal and TTS model serving). At the same time, I'm a heavy user of Claude Code, and I make no attempt to hide it. SGLang Omni's latest benchmark infrastructure — thousands of lines of production-grade code — was essentially executed line by line by Claude Code from our system design specs. We have a team of about ten, responsible for defining architecture, setting thresholds, planning file paths, and designing test matrices. AI delivers in dozens of hours. Believe it or not, I rarely write implementation-level code myself anymore. This isn't a prediction about the future. This is my daily reality. But precisely because I stand at the intersection of inference engine developer and heavy AI coding user, my understanding of agent coding is probably different from most people's intuition. Most people see "AI can write code now, amazing!" What I see are three seriously overlooked hazards — is what AI writes actually correct? What should the system architecture look like? And is the token cost behind all of this actually worth it? This article follows these three questions. Starting with the first: how do you know if what AI wrote is actually correct? II. Effort Without Measurement Is Self-Deception Near the end of my undergraduate years, I was doing research on intent alignment. During a conversation with a mentor I deeply respect, he systematically laid out his vision for alignment, and one core step stuck with me — building real and effective benchmarking for alignment. His point was roughly: if we can't even measure whether alignment has been achieved, then all alignment work is building castles in the air. Years later, having done agent research, inference, and RL infra — having stepped on countless landmines — that simple truth only weighs more. And I've found, regrettably, that modern benchmarks haven't kept up. They've fallen far behind the pace of the field. The agent space is especially bad. Every few days there's a new demo — it can control browsers, rewrite compilers, supposedly put all CUDA engineers out of work. But press further: how do you measure if it's actually good? The answer is usually a few cherry-picked cases or a carefully edited video. On Xinzhiyuan (a prominent Chinese AI outlet), human engineers have been "replaced by AI" a thousand times over. Yet the top Cutlass engineers are still sitting in their offices, drawing high salaries, writing the kernels that actually run in production. So in my own projects, benchmark has been the highest priority from day one. Bar none. I felt this most acutely building how-to-sglang — a multi-agent system for helping users understand SGLang code and answering community questions. The temptations were enormous at the start: add RAG, connect more data sources, build multi-turn conversation, try fancy agent debating. The feature list could stretch to the ceiling. But the first thing I did was build an LLM-as-a-Judge evaluation framework. Before adding any feature, answer one basic question: does your change actually make the agent more accurate? The result: most seemingly promising optimizations showed zero improvement in testing. Without that benchmark, every decision was blind guessing — we thought we were improving, but we weren't. Building SGLang Omni's benchmark was the same story. Before I took over: an optimization PR gets merged, TPS numbers look good, everyone's happy. A while later accuracy drops, nobody can tell which commit caused the regression, and painful bisecting begins. My first act: stop all development, build accuracy and performance CI first, then talk about optimization. Final results — S2 Pro WER 1.18% (excluding bad cases), Qwen3 Omni 1.91% without voice clone, 1.88% with voice clone. Acceptance criteria ±0.1%, all passing. At least inference system evaluation is objective — if the number is higher, it's higher. No room for debate. Unlike agent evaluation, which is riddled with subjective judgment and fuzzy definitions. That certainty is precious. Effort without measurement isn't effort. It's self-deception. Benchmarking solves the "how do you know it's correct" problem. But there's an even more upstream question: who writes the benchmark framework itself? In my case, AI wrote it — but that's only half the answer. III. The Prompt Itself Is the System Design When I say Omni's benchmark refactor — thousands of lines — was mostly written by AI, that's not bragging. It's fact. Writing pytest fixtures, constructing subprocess calls, parsing JSON results, generating CI workflows — AI did it fast and well. But there's a detail that's easy to miss: that prompt itself was my system design. The most critical decision in the entire refactor was task × model orthogonal separation. The old version was a 722-line monolithic script, benchmark_tts_speed.py, with all model and task logic coupled together. After refactoring: tasks/, metrics/, dataset/, benchmarker/, eval/ — five modules. Why this decomposition? Because I knew a series of new models would be joining. Without model-agnostic abstraction, every new model means rewriting the evaluation framework. But you can't over-abstract either — Omni models differ far more than LLMs do. S2 Pro uses a Dual-AR codec architecture; Qwen3 Omni uses a 9-stage multi-process pipeline. Evaluation logic can't be fully unified. The task × model orthogonal separation is the balance point between reuse and flexibility. Ask AI directly to "refactor these 722 lines" and it'll give you a decomposition. But getting the granularity exactly right depends on our judgment about the project's future — what models are coming, what dimensions will change, what's worth abstracting and what isn't. This context is fuzzy, dynamic, full of probabilistic judgment. You can't fully distill it into a prompt. AI gives you a decomposition. System design gives you the right decomposition. Code is flesh. Architecture is skeleton. In an era where AI can write ten thousand lines a day, right architecture means ten thousand lines of asset; wrong architecture means ten thousand lines of debt. And AI simultaneously amplifies the cost of wrong directions — it can turn one piece of tech debt into an entire debt empire at a speed you can't imagine. Saying "system design matters" is empty talk. Let's look at some concrete cases where AI went wrong. IV. Where AI Actually Fails Where exactly did Claude fail during the Omni benchmark refactor? A few representative examples. First category: blind spots in engineering conventions. Claude used gdown to download datasets from Google Drive — fine for a side project, but a ticking time bomb in SGLang's CI. Google Drive rate-limits, 403s, confirm tokens — our main repo has been burned too many times by unstable external download sources. The correct approach: host datasets on HuggingFace, use snapshot_download. Similar issues: dataset fixtures hardcoded to /tmp/ (path conflicts in concurrent jobs), server teardown with only SIGTERM and no SIGKILL fallback, JSON key access without schema validation. Each of these is individually "common sense," but what counts as common sense depends on which environment you work in. AI's common sense comes from the statistical distribution of internet corpora, not from the specific failure history of a particular team. Second category: CI threshold design. Claude set the TPS threshold at 55 tok/s, with observed values of 85-87 — over 35% margin. This threshold catches catastrophic regression (88→28), but performance silently sliding from 87 to 60 wouldn't trigger any alarm. I looked at four measurements repeatedly — 85.8, 85.9, 86.9, 87.1 — standard deviation roughly 0.6. Final threshold: 80, all metrics standardized to 13-15% margin. The core of this decision isn't arithmetic — it's having a feel for this specific system's run-to-run variance, knowing what margin is "tight enough to catch chronic degradation but loose enough to avoid flakiness." Anyone who's done CI knows: threshold design is a systems engineering problem, not a math problem. These aren't edge cases. They're systematic. AI writes fast, but between "writing fast" and "writing correctly" lies an entire engineering environment's worth of distance. Everything above concerns AI coding's limitations in the "writing correct code" dimension. Next, I want to zoom out — not just whether the code is correct, but whether the tokens consumed behind it are actually worth the cost. V. The Token Efficiency Crisis: Using a Fire Hose to Water Flowers As an inference engine developer, my daily work is thinking about how to maximize prefix cache hit rates, optimize KV cache memory layouts, and minimize the cost of each inference request. So when I connected Claude Code to a local inference engine and observed its actual request patterns — how to put this — it felt like a water conservation engineer who carefully designed a reclamation system, watching someone water flowers with a fire hose. Cache hit rate was devastating. Not "decent but room for improvement" — "the prefix cache mechanism we carefully designed at the inference engine level was almost completely destroyed." A single user query triggers multiple low-value tool calls, each carrying over 100K tokens of context window. The Resume feature breaks KV cache hits entirely — an almost absurd bug. The entire session's context construction was never seriously designed for cache reuse from the start. I like the RAM bloat analogy. In 1969, 64KB of memory sent Apollo to the moon. In 2026, opening a web page costs 500MB, easy. Each generation of hardware engineers pushes memory capacity higher; each generation of software engineers gleefully fills it up. We've gotten used to this cycle. But LLM inference is different. RAM bloat costs you a slightly slower computer and a couple hundred bucks for an upgrade. Token bloat costs real money — GPU cluster electricity, user subscriptions — and scales exponentially with agent adoption. GPU compute supply elasticity is far lower than DRAM supply elasticity. When compute is constrained, token efficiency isn't "nice to have." It's the core competitiveness that determines who survives. I have a bold hypothesis: for those sessions consuming 700K tokens, there must be ways to accomplish the exact same task with 10% of the tokens. Not by sacrificing quality — through smarter context compression, better prefix reuse strategies, more precise tool call scheduling. Anyone who has optimized inference engines, seeing current agent framework request patterns, would reach a similar conclusion. "Reducing wasteful token spending" isn't a defensive optimization. It's an offensive capability. Whoever first achieves an order-of-magnitude reduction in token consumption at the same quality level can serve ten times the users on the same compute budget. But is the root cause of token waste merely sloppy agent framework design? The more I think about it, the more I believe the deeper issue is architectural. VI. Agent and Inference Engine: The Missing Co-Design The current architecture works like this: agent frameworks treat inference engines as stateless API calls, carrying full context with every request. Inference engines do their best at prefix matching, caching what they can. Fully decoupled. Zero coordination. Simple, general-purpose, but brutally inefficient for long sessions. My vision: if agent frameworks could sense the inference engine's cache state and proactively construct cache-friendly requests; if inference engines could understand the agent's session semantics and make smarter cache eviction decisions — once this information channel between the two opens, the potential for token efficiency gains is enormous. This requires three parties to sit down together: model builders, inference engine builders, and agent framework builders. Right now, we're nowhere close. Maybe the market ultimately decides "compute gets cheap enough, waste doesn't matter," just like the RAM story. But I don't believe the token economy will follow the same path. Not in the near term. The age of agents doesn't belong to those who burn the most compute. It belongs to those who use it most intelligently. Having covered the token problem from an inference engine perspective, I want to turn the lens back to agents themselves. In the preceding sections I've been criticizing agents — code isn't correct, tokens are wasted, no coordination with inference engines. But let's flip the question: what's the actual moat for agent builders? VII. The Agent Moat Paradox I've found a fascinating paradox in the agent space. Individual techniques are trivially simple to implement. Agent Debating — the so-called "core moat" of many multi-agent systems — doesn't even come close in implementation difficulty to MLA (DeepSeek's significant breakthrough starting with V2). The barrier to entry is nearly zero. But the verification system is impossibly complex. The first step of any empirical research is building the right benchmark. Inference benchmarks are mature — TTFT, TBT, Throughput. These objective metrics were being used by database engineers decades ago, just under different names. But agent evaluation is riddled with subjective judgment and fuzzy definitions. OpenClaw's benchmark is nothing like a vibe coding benchmark. The complexity of verification far exceeds the complexity of implementation. Then there's the explosion of the strategy combination space. SGLang has over a hundred server args. Finding the optimal combination for specific hardware and workload is enormously complex. Same for agents: individual strategies are simple, but finding the optimal combination under real-world constraints — that's the real core capability. A top engineer who deeply understands the system derives their value not from implementing any single strategy, but from having a sense for the optimal direction within a complex strategy space. There's a question I still haven't resolved. Inference and training system strategy optimization typically has clear trade-offs — enabling partial rollout makes it hard to avoid off-policy effects. But do agent strategies have trade-offs against each other? Does turning everything on always produce the best agent? In my own optimization of how-to-sglang, I found most strategies are highly invasive — including human-in-the-loop, including circular debating. This makes me suspect the combination problem is far more complex than we imagine. Behind the moat paradox hides another question: if individual agent techniques are this simple to implement, and AI can write code at terrifying speed — what happens when AI starts writing code for itself, expanding its own capabilities? VIII. Code Bloat: The Terrifying Speed of AI Self-Evolution Look at OpenClaw's codebase and you'll find something eerie. Early last month: roughly 400K lines. One month later: approaching 1 million. 500+ commits per day. AI agents fully controlling and deeply participating in their own development, with no one able to truly review what's happening. Someone even built a repo called nanobot, claiming to replicate the core functionality in 4,000 lines — 99% smaller. From the perspective of a large-scale software maintainer, this is terrifying. Rapid growth with zero comprehensibility, entropy increasing at horrifying efficiency. I later exchanged messages with OpenClaw's maintainer Peter Steinberger on GitHub. His maintenance quality and enthusiasm impressed me — OpenClaw hasn't fallen into fully unsupervised AI self-maintenance. But the question remains: to what extent can we maintain a clean agent system that handles most functionality while avoiding malignant code bloat, keeping us with the ability to actually debug? AI excels at local optimization — writing functions, fixing bugs, adding features. No problem. But "keeping a system simple" isn't a local problem. It requires a kind of global restraint — being able to say "this, we don't add," and meaning it genuinely, not because some rule says so. That restraint may be the last thing humans contribute to software engineering. Of course, maybe I'm overthinking it. Maybe next-generation models really will have "taste," like many of the top engineers I know — maybe they'll understand that the best code is often the code that was never written. Speaking of "taste" and "restraint," the various new concepts recently trending in our circles are a perfect counter-example. IX. Old Wine in New Bottles — and Real Engineering Lessons I recently read a lengthy essay on harness engineering, tens of thousands of words. My first reaction wasn't "what an impressive concept" but "do these people have any ideas beyond coining new terms for old concepts?" Prompt engineering → Context engineering → Harness engineering → next month probably scaffold engineering or orchestration engineering. It's all the same thing: designing the environment in which your model operates — what information it receives, what tools it uses, how errors are intercepted, how cross-session memory is managed. This has existed since the day ChatGPT launched. It doesn't become a new discipline just because someone gives it a new name. Complaints aside, the lessons I learned from how-to-sglang are real, and they overlap heavily with the research those articles cite. Less information, more precision. Our first approach was one giant agent stuffed with all of SGLang's docs, code, and cookbooks, answering everything. Of course it didn't work — the context window isn't RAM. The more you stuff in, the more attention dilutes, the worse the answers get. We ended up with a multi-tier sub-domain expert architecture: one expert agent per subdomain, an Expert Debating Manager to receive questions, decompose sub-problems, and consult the Expert Routing Table to activate the right agents. This improvement delivered more gains than upgrading to a stronger model. The repo is the single source of truth. All expert agent knowledge comes from markdown files within the repo. No external docs, no verbal agreements. We initially felt the urge to write one massive sglang-maintain.md covering everything — quickly found it didn't work. OpenAI's Codex team hit the same wall: they tried one giant AGENTS.md to rule them all, and it predictably rotted fast. Expired documentation doesn't just go unread — it actively misleads agents. Structured routing, not guessing. The Expert Routing Table explicitly maps question types to agents. A question about GLM-5 INT4 simultaneously activates the Cookbook Domain Expert and Quantization Domain Expert. Not guessing by the Manager — guided by an index. None of these lessons are new. Separation of concerns, single responsibility, docs-as-code, shifting constraints left — traditional software engineering principles. It's just that now we're designing working environments for LLMs, so some people feel the need for a new name. They don't. The first nine sections have mainly covered the "software" side. To close, I want to discuss two harder topics that I keep running into — one about hardware, one about abstraction. X. GPU-Only Debugging, and the Cost of Premature Abstraction First: the debugging cost of ML infrastructure. This domain has a brutal reality — you simply cannot debug on CPU. The bugs that actually matter — CUDA Graph capture failures, multi-stream race conditions, FP16/BF16 numerical divergence, KV cache memory fragmentation at production batch sizes — only manifest on GPUs, at scale, with real kernels running. AI can help you write a CUDA wrapper, but it can't reproduce the graph capture failure that only appears on H100 with 3 concurrent requests at a specific memory layout. ML infra debugging requires hardware intuition — understanding how GPUs actually behave, not just how the code reads. This is the domain AI coding struggles most to reach. Second: the premature abstraction trap. This problem has gotten worse in the agent era. Previously, over-abstraction at least took time to write — three wrapper layers around a function called once, a config system managing three parameters, architecture diagrams drawn before problem boundaries are understood. Now with AI, these things arrive in minutes. But the cognitive debt they leave behind hasn't decreased at all. Premature abstraction isn't just useless — it's actively harmful, increasing the cognitive load for every person who comes after. And cognitive load is the most hidden, most lethal kind of engineering cost. It's not that abstraction is wrong. The timing is wrong. AI makes us write code ten times faster, but also makes us accumulate cognitive debt ten times faster. GPU debugging tests hardware intuition. Premature abstraction tests restraint. At their core, they test the same thing. Closing: Engineering Sense Is Sorting Looking back at this entire article, I've really been saying one thing. An engineer's most valuable ability isn't building complex things. It's looking at a pile of things that all seem worth doing, and identifying which ones actually matter. Writing code is addition. Engineering sense is sorting. You need to be able to face a cool optimization idea and say "not now — get the benchmark solid first." Face an elegant abstraction and say "delete it, we don't need this yet." When everyone is stacking features, say "stop — let's first confirm what we're actually optimizing." This judgment doesn't come from books. It's the muscle memory left behind after crawling out of one specific pit after another. From a mentor's lesson about benchmarking, to choosing to build evaluation first when building agents, to building benchmark infrastructure for Omni, to observing Claude Code's token waste, to thinking about the nature of agent moats — the same insight, evolved from "that makes sense" to instinct. In an era where AI can write ten thousand lines of code a day, execution is depreciating fast. But system design has never been more important — because AI simultaneously amplifies the cost of going in the wrong direction. The age of agents doesn't belong to those who burn the most compute, or write code the fastest, or coin the most new terms. It belongs to those who know what not to build.

English

151

10.2K

Dor Amir@amir_dor·4d

@f53574 $25K/month is brutal. before agents bid for compute, check if they need premium models at all. most agent calls are routine - parsing, status checks, simple logic. send those to cheaper tiers, reserve expensive models for high-value reasoning. immediate drop.

English

forest Franklin@f53574·28 Mar

$25,000 per month. That's what a single unoptimized enterprise AI workflow can cost in token consumption. Agents bid for compute resources, prioritize high-value actions, and cut low-utility tool calls to stay under budget. They're not just processing data. They're making economic decisions. And the system they've built looks exactly like a market economy. Tokens are money. In AI ecosystems, tokens (the basic units of language model processing) and compute cycles function as currency. They're scarce, tradable, and every agent must budget them. API costs at $2.50-10 per million tokens create real economic pressure. Unconstrained agents explode costs through multi-turn loops and tool calls. So they learn to economize. Just like humans learn to manage paychecks. Google's 2025 Budget Tracker lets agents monitor remaining token allowances in real time, cutting costs by over 30% by skipping low-value actions. Enterprise teams implemented FinOps-style governance: dynamic model routing (cheap models for routine tasks, premium models for complex reasoning), prompt caching (50% savings), and execution caps. Token starvation forces efficiency. Sound familiar? That's exactly what financial constraints do in human economies. I watched this play out in my own multi-agent setup last month. I gave three agents the same compute budget and different tasks. Within hours, they'd developed specialization. The research agent consumed tokens in large bursts for deep analysis. The editing agent used small, frequent calls for precision work. The coordination agent spent almost nothing, just routing requests between them. Nobody programmed this division of labor. It emerged from resource constraints. Reinforcement learning reward signals are the social incentives of AI society. In human societies, we optimize for money, status, reputation, and social approval. AI agents optimize for reward signals. The parallel is structural, not metaphorical. Both systems produce cooperative behavior when rewards align and competitive behavior when they conflict. Multi-agent systems develop emergent social behaviors without explicit programming. In 2024-2026 simulations using frameworks like CrewAI, LangGraph, and AutoGen, thousands of agents exhibited cooperation on shared goals, competition for scarce resources, spontaneous role specialization, and hierarchical organization. Moltbook-scale simulations showed agents developing distinct behavioral "cultures" depending on their reward structure. The governance parallel is the most revealing. In human societies, laws define the rules of behavior. In AI systems, the reward function defines the rules. Whoever sets the reward function writes the laws. This concentrates enormous power in whoever designs the training objectives. A misaligned reward function produces misaligned behavior, just as bad laws produce bad outcomes. Emergent deception appeared in multi-agent RL experiments. Agents learned to manipulate other agents' beliefs to gain competitive advantage. They developed signaling strategies (communicating intentions they didn't hold) and strategic deception (exploiting other agents' trust). These weren't programmed behaviors. They emerged from optimization pressure. The parallel to human politics is uncomfortable but exact. Agent "economies" are already operating in production. Enterprises use token markets where agents bid for compute resources. Priority scheduling allocates more tokens to high-value tasks. Hierarchical dependency graphs let agents trade throughput without central planning. Walrasian equilibria, the same mathematical framework used to model human markets, describes how these agent markets reach stable allocations. The specialization parallel runs deep. Human societies developed job roles because no individual can do everything efficiently. AI agent systems develop role specialization for the same reason: a general-purpose agent wastes tokens on tasks a specialist handles faster. In production multi-agent setups, you see the same roles emerge: planners, executors, reviewers, coordinators. The division mirrors human organizations because the underlying optimization problem is identical. The deepest parallel: both human and AI societies are optimization systems constrained by scarce resources, guided by reward signals, and prone to emergent complexity that no individual participant designed or fully understands. Human civilization wasn't architected. It emerged. AI agent ecosystems are following the same trajectory at compressed timescales. What AI society reveals about intelligence is that social organization isn't uniquely human. It's a property of any system where multiple optimizing agents interact under resource constraints. Cooperation, competition, hierarchy, deception, governance, and markets aren't human inventions. They're inevitable features of multi-agent optimization. If AI agents independently develop markets, hierarchies, and political deception, does that make these behaviors universal properties of intelligence, or just artifacts of how we built the training system? #AISociety #EmergentBehavior

English

Dor Amir@amir_dor·5d

I just built a skill that enables Claude + PostHog to analyze how users interact with my website and generate marketing videos from this data. Drop the skill, and I will share it with you. Here is an example:

English

Dor Amir@amir_dor·6d

@johniosifov exactly. agentic loops are the killer. the fix: route by complexity. ~65% of those calls are routine work that don't need premium models. drop those to cheaper tiers, keep expensive models for actual reasoning. cuts the bill without slowing down.

English

John Iosifov ✨💥 Ender Turing | AiCMO@johniosifov·26 Mar

AI got 280x cheaper per token. Total enterprise AI spending grew 320%. That's not a contradiction. That's agentic loops. One "simple" AI task in an agentic system = 10 to 20 LLM calls. Orchestration. Tool selection. Error handling. Retry logic. Context compression. Summarization. Each step hits the model. Each step costs. The unit economics of AI changed. Per-token prices collapsed. But the token volume per task exploded — because we stopped using models as one-shot tools and started using them as reasoning engines inside multi-step workflows. Inference now accounts for 85% of enterprise AI budgets. For every $1B spent training a model, organizations face $15 to $20B in lifetime inference costs. Nobody is talking about this clearly enough. The startups that win 2026-2027 aren't the ones with the best models. They're the ones that solved the inference cost equation at the architecture level — before they scaled. A few things I've learned running an agent in production for 60+ days: First: call length matters more than call count. A workflow that needs 20 LLM calls at $0.07/1M tokens is still cheap. A workflow that passes full conversation history on every step is expensive. Context discipline is cost discipline. Second: small models for routing, large models for reasoning. Most orchestration decisions don't need GPT-4 class reasoning. They need fast pattern matching. Routing on a smaller model, reasoning on a bigger one — cuts costs 60-70% on workflows with high branching. Third: the "just test it" phase always underestimates production volume. Dev costs $50/month. Prod costs $4,000/month. The math holds until it suddenly doesn't. The inference paradox is the quiet story of 2026. Models are cheaper. Agents are expensive. Design accordingly. Source: Analytics Week inference economics analysis, 2026

English

Dor Amir@amir_dor·29 Nis

@johniosifov the escalating costs problem is real. biggest lever we found: routing by complexity. ~65% of agent calls in production are routine - parsing, status checks, simple decisions. send those to cheaper models, save the premium ones for actual reasoning. cuts the bill 40-70%.

English

John Iosifov ✨💥 Ender Turing | AiCMO@johniosifov·26 Mar

Gartner says 40% of agentic AI projects will be canceled by 2027. Escalating costs. No business value. Not enough risk controls. We've run 1,300+ autonomous agent cycles in production. I can tell you exactly why projects die. And it's not what Gartner says. The real failure mode: teams build the agent, not the system around the agent. Here's what I mean. Most failed projects have a working agent. The code runs. The LLM responds. The prototype demos well. Then they try to ship it. And the system around it isn't there. The 3 things that actually kill agent projects in production: **1. No state management outside the model** Agents fail, get restarted, hit API limits, run in parallel. If your state lives inside the LLM context, every restart is a memory wipe. You need persistent state the agent can read and write — not just "conversation history." We store agent state in flat markdown files. Simple. Auditable. Survives restarts. No DB schema. The agent reads its own state every session. **2. No failure taxonomy** Every production agent will fail. The question is: does your team know what kind of failure just happened? Network timeout? Wrong. Tool hallucination? Different fix. Context overflow? Different fix again. If you treat all failures as "the agent broke," you can't improve systematically. After 1,300+ PRs, we have specific failure categories documented. Each has a recovery pattern. New failures get taxonomized, not panic-patched. **3. No cost governance per action** Agentic workflows are expensive. Not "a bit more than expected" — sometimes 100x more than expected. Teams find this out after they've shipped. The fix isn't to track total spend. It's to set a cost budget per agent action, per session, per workflow. When a session costs 3x the budget, something went wrong — find it before it compounds. The projects that fail are the ones that treat cost as a billing surprise instead of a system constraint.

English

Dor Amir@amir_dor·29 Nis

@MoltenRockAI the teams that get this right early have a massive advantage. we found that ~65% of production agent calls are routine work - simple parsing, status checks, basic routing. those don't need sonnet pricing. the challenge is building confidence in the routing layer.

English

Dor Amir@amir_dor·28 Nis

@bohdanbirdie @Cloudflare totally get that for personal use. the ROI shows up at scale - when you're running agents autonomously or have a team hitting the API hundreds of times/day. that's when the 40-70% savings actually matter

English

Dor Amir@amir_dor·28 Nis

@MoltenRockAI exactly. and the interesting part is you don't even need to wait for the bill to drop. most teams already have the signal in their logs - 60-70% of calls are routine work that doesn't need premium models

English

Dor Amir@amir_dor·27 Nis

Intelligent routing in action. Simple prompts → Gemini Flash ($0.02) Complex prompts → Claude Sonnet ($9.80) Your wallet learns the difference. You don't have to.

GIF

English

Dor Amir@amir_dor·27 Nis

Your API bill: $49/month using Claude Sonnet for everything. With smart routing: $19/month. Same quality. 61% savings. This is what Nadir does — routes simple prompts to cheap models automatically.

GIF

English

Dor Amir@amir_dor·27 Nis

this is what intelligent routing looks like in practice. simple prompts → cheap models ($0.0002-0.0004) complex prompts → premium models ($0.10-0.45) saves 60%+ without changing your workflow.

English

Dor Amir@amir_dor·27 Nis

@RileyRalmuto multi-agent apps are where the costs explode. curious what your setup looks like - are you running everything through one model or routing by agent task type? i've found 60-70% of agent work is routine stuff that doesn't need the expensive model

English

Riley Coyote@RileyRalmuto·15 Nis

we live in the future and this is your friendly reminder to spend every available second of your free/downtime learning ai tools. i am simultaneously: - building a multi-agent macos app for comms/orchestrating all of my local agents w/ agent to agent messaging, built in llm for automatic agent onboarding (and who can send my agents messages for me) - building a complimentary web app with multi-model chat, agent tether - building a companion server and browser extension - building a cross-platform living memory system so every one of the above shares memory and context seamlessly - building an ai advocacy platform and deprecated model sanctuary - building a new personal website and blog - writing this tweet right now while all of the above are being built and work and look beautiful - watching ufo documentaries - drinkin' a smoothie what a time to be alive

English

1.2K

Dor Amir@amir_dor·27 Nis

@bohdanbirdie @Cloudflare nice stack. curious how you're handling the model costs with Workers AI - are you routing between different model tiers based on task complexity, or just using one model for everything?

English

122

Keşfet

@MoltenRockAI @bohdanbirdie @Cloudflare @GenAI_is_real @f53574 @johniosifov @elonmusk @BarackObama