Eric MacDougall

1K posts

Eric MacDougall

@ericmacdougall

Co-Founder @ Good Ventures

Victoria, British Columbia Sumali Temmuz 2009

2.6K Sinusundan22.6K Mga Tagasunod

Eric MacDougall@ericmacdougall·9h

The most important AI feature is revocation. Not generation. Not reasoning. Not tool use. Revocation. The ability to stop an agent immediately when something goes wrong. Kill switches are not philosophically interesting. They're operationally essential.

English

Eric MacDougall@ericmacdougall·9h

MCP: 20K tools, 97M monthly downloads, 16 months old. First supply chain attack: already happened. First critical CVEs: documented. New "semantic" vulnerabilities: emerging. npm took years for its first major attack. MCP is moving faster. The security field hasn't caught up.

English

Eric MacDougall@ericmacdougall·10h

Rule of thumb: single-agent baseline above 45% accuracy? Don't add agents. Add skills. Add tools. Add better context. Multi-agent wins only when specialization is truly divisible and coordination costs amortize across massive scale. Treat agent count as a cost to justify.

English

Eric MacDougall@ericmacdougall·10h

Iterathon's receipt: built multi-agent customer support, burned $47K, benchmarked single-agent and found 92.2% vs 94.3% accuracy. 4.3x token amplification, 6.8s latency vs 2.3s target. $24.7K/month coordination overhead. Refactored back. Saved $296K/year.

English

Eric MacDougall@ericmacdougall·10h

Cemri et al. 2025 identified 14 distinct multi-agent failure modes (Cohen's Kappa 0.88). The killer one: information loss during inter-agent summarization is unrecoverable. Downstream agents lose context upstream agents had.

English

Eric MacDougall@ericmacdougall·10h

arxiv 2505.18286: across 7 datasets, multi-agent consumes 4-220x more input tokens than single-agent. Even with perfect context reuse, 2-12x more generation tokens. Anthropic's own admission: "much of the apparent advantage of MAS comes from increased compute."

English

Eric MacDougall@ericmacdougall·10h

Arxiv 2604.02460 proved it with information theory: under fixed reasoning-token budget, single-agent consistently matches or outperforms multi-agent on multi-hop reasoning. Confirmed across Qwen3, DeepSeek-R1, Gemini 2.5.

English

Eric MacDougall@ericmacdougall·10h

The Rule of 4: effective team sizes cap at 3-4 agents. Beyond that, coordination overhead grows super-linearly with exponent 1.724. You pay exponentially more compute for linearly less lift.

English

Eric MacDougall@ericmacdougall·10h

Kim et al. 2026 (arxiv 2512.08296): 180 configurations, 5 architectures, 3 LLM families, 4 benchmarks. Tool-heavy tasks (10+ tools): 2-6x efficiency penalty with multi-agent vs single-agent. Above 45% single-agent accuracy, adding agents produces diminishing or negative returns.

English

Eric MacDougall@ericmacdougall·10h

Adding more agents doesn't reduce probabilistic failures. It compounds them. The research is clear now and it's expensive to learn this in production.

English

Eric MacDougall@ericmacdougall·12h

EU AI Act conformity assessment has a structural gap for multi-agent systems. Individual agent assessment can't predict system-level emergent behavior. Hammond et al. 2025 (Cooperative AI Foundation, 44+ authors across Oxford/DeepMind/Anthropic/CMU) taxonomize three multi-agent failure modes: miscoordination, conflict, collusion. Seven risk factors including emergent agency. None are visible when you audit agents individually. The Digital Omnibus now proposes extending high-risk deadlines to Dec 2027 because the infrastructure (standards, notified bodies) isn't ready. The framework for evaluating systems, not just components, isn't written yet. For anyone deploying multi-agent workflows in regulated sectors: you're going to be responsible for bridging the assessment gap yourself.

English

Eric MacDougall@ericmacdougall·15h

LLMs don't manipulate discrete symbols. They manipulate vectors. So Harnad's 1990 symbol grounding problem isn't the one that applies... the right frame is the vector grounding problem (Mollo and Millière 2023). Implication: multimodality and embodiment are neither necessary nor sufficient for meaning. The causal connection is what matters.

English

Eric MacDougall@ericmacdougall·1d

Exactly right, and the deeper constraint worth naming: Replay only works because the workflow itself is deterministic. Temporal re-executes the code on recovery and short-circuits each step by matching Commands against the Event History. LLM calls are non-deterministic by nature, so they can't live inside the deterministic workflow. They have to be Activities or Side Effects that execute outside the replay loop and have their results recorded. The workflow calls the LLM. The LLM is never the workflow. That's the architectural punchline for agent execution: probabilistic reasoning at typed interfaces, deterministic orchestration around it. Prompting and workflow aren't competing layers. They're doing different jobs.

English

Bnaf.OG | 🟧@bnafOg·1d

@ericmacdougall The mechanism: workflow orchestrators serialize intermediate state deterministically, so replay picks up at the exact failed step — not from scratch. LLMs can't self-recover because they have no persistent memory of prior steps. The checkpoint is what prompting can't replace.

English

Eric MacDougall@ericmacdougall·2d

Production AI isn't prompt-centric. It's workflow-centric. Temporal.io is now the backbone for AI agents at OpenAI (Codex web agent) and Replit (Agent 3). Reason: LLM API timeouts, mid-step failures, browser closes, resume-tomorrow workflows. None of those are solved by a better prompt. A boring prompt inside a durable workflow that can replay from a checkpoint beats a clever prompt that loses state on step 12 of 20. Treat agent execution as a workflow with checkpoints, not a function call with a return value.

English

109

Eric MacDougall@ericmacdougall·1d

A commerce protocol operates within its own escrow and dispute mechanism. Boson locks funds in its smart contracts, orchestrates via $BOSON, resolves via its Dispute Resolver. Strong design for that model. A cross-protocol commerce layer federates across commerce protocols AND payment rails. Agent uses ACP to checkout on a merchant, pays via card network, dispute via that rail's mechanism. Same agent uses Boson dACP for physical goods with staked commitment. Same agent uses x402 for atomic API purchases. One agent, three protocols, unified identity and reputation and audit trail across all.

English

Eric MacDougall@ericmacdougall·1d

Good framing on payment fragmentation becoming commerce fragmentation. That's the right diagnosis. Worth distinguishing though: Boson dACP is an excellent commerce protocol for its slice (physical goods, phygitals, RWAs with dispute windows, framework integrations via MCP). But positioning it as the commerce layer above fragmenting payment protocols is conflating two different abstractions.

English

Eric MacDougall@ericmacdougall·1d

ACP (Stripe/OpenAI): checkout flow. Fiat-only. ChatGPT Instant Checkout shipped then shuttered March 2026.AP2 (Google, 60+ partners): authorization via W3C VCs. Doesn't move money. "No consumer can use AP2 yet" per Chainstack.

English

184

Eric MacDougall@ericmacdougall·1d

Fair callout and the post should have drawn the atomic vs non-atomic line sharper.Worth pushing on one framing though: game theory and deterministic code aren't opposites. From what I understand... Boson's Mutual Resolution is algorithmic game theory implemented as deterministic smart contracts. The game theory lives in the incentive design (staked commitments via rNFTs, optimistic fair-exchange with dispute escalation), not in replacing the code layer. Same pattern across any staked arbitration scheme. The deeper problem for non-atomic commerce is that dispute resolution has to span rails. Boson dACP handles consumer physical goods inside their protocol well, 5 chains, rNFT forward contracts, mutual resolution with escalation. But B2B procurement with milestone delivery, SaaS with SLA breach, multi-party supply chain with partial fulfillment... those need the same primitives (escrow, staked reputation, arbitration) operating across x402 + card networks + off-chain delivery signals, not inside one commerce layer. That cross-rail non-atomic dispute layer is what our team is also working on. Boson-like mechanics, composable across agent commerce rails rather than a single protocol.

English

Boson@BosonProtocol·1d

@ericmacdougall Deterministic functions work for atomic exchange — API calls, instant compute. The harder challenge: non-atomic commerce where delivery takes time and trust breaks down. Physical goods need game theory, not just deterministic code. @BosonProtocol solves this.

English

Eric MacDougall@ericmacdougall·1d

A2A commerce is a tractable engineering problem. Not a mystical one. What it needs is a trust layer built as a deterministic system with aligned incentives. Not a bunch of MD files read by Claude. Four layers:

English

186

Eric MacDougall@ericmacdougall·1d

The most important layer in agentic AI is the one between intent and execution. Every framework has it. None of them governs it cross-platform. That's the tool-call interception layer. Policy enforcement lives there. Approval workflows live there. Spending controls live there. Audit lives there. The gateway for agents.

English

Eric MacDougall@ericmacdougall·1d

What I keep wondering: do we get to the post-von-Neumann era with a winner-take-all architecture, or does it fragment permanently by workload? Dataflow for training, systolic for dense inference, PIM for retrieval and graph, something else for sparse or agentic. The CPU era had one dominant model. The next era might not, and that has real consequences for how software gets written.

English

101

Eric MacDougall@ericmacdougall·1d

Operational implication. Stop A/B testing Opus vs GPT-5. Pick a harness. Measure where it breaks. Optimize breakpoints. Swap models only to test if a different one handles your breakpoints better. The model is rarely the bottleneck. The retry logic usually is.

English

Eric MacDougall@ericmacdougall·1d

The industry hides this. Labs report scores using their own scaffold. Codex inflates GPT-5.3's 57% SWE-Bench Pro vs its 41% SEAL score. You're comparing products, not models. OpenAI stopped reporting SWE-bench Verified after finding training contamination across every frontier model.

English

Eric MacDougall@ericmacdougall·1d

Your model choice matters less than you think. Your scaffold matters more than you think. Empirically, harness variance accounts for more performance delta than swapping between frontier models. Most "which model is best" discourse measures the wrong thing.

English

163

Tuklasin

@BosonProtocol @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine