Simon Bourdon

639 posts

Simon Bourdon

@bourdon_simon

Breaking rules with AI, blockchain & gaming | Founder @ Just a Company | If it’s crazy, we’ll ship it

Montréal, Québec Katılım Mayıs 2015

376 Takip Edilen347 Takipçiler

Sabitlenmiş Tweet

Simon Bourdon@bourdon_simon·21 Oca

Why do 80% of AI automation projects fail in the first year? It's not the AI. It's not the budget. It's not the talent. You're automating garbage. 🧵 #AI #Automation #DigitalTransformation #TechLeadership

English

388

Simon Bourdon@bourdon_simon·10h

Most "agent memory" failures aren't because you didn't store enough. They're because you didn't decide: - what counts as a fact - when it expires - who can overwrite it - how it's retrieved at action time Memory is data governance + retrieval, not a notes app.

English

Simon Bourdon@bourdon_simon·10h

This is the core gap: services need to know if a request is user-delegated or autonomous. Agents will need identities, scopes, budgets, and an auditable chain of delegation - not just OAuth. x.com/stuffyokodraws…

English

Simon Bourdon@bourdon_simon·10h

Exactly. Evals are not a research toy - they're your reliability lever. If bad outputs can ship, you're not running a system, you're running a gamble. x.com/SumedhaUppal/s…

Sumedha Uppal@SumedhaUppal

such a clarity read

English

Simon Bourdon@bourdon_simon·10h

The underrated part of "agents in prod": building institutional memory. A failure catalog + decision rationales is how you turn every crash into a reusable guardrail. x.com/nyk_builderz/s…

Nyk 🌱@nyk_builderz

Too much AI-agent hype, not enough real operators sharing what actually works. This is one of those rare signal posts - practical, tactical, and worth bookmarking if you’re building in public.

English

Simon Bourdon@bourdon_simon·14h

Most agent demos fail in production for the same reason early startups fail: no operating rhythm. You need: - clear goals - a daily loop (plan -> execute -> review) - metrics on failures - a way to learn and update behavior Otherwise it's just vibes over tool calls.

English

Simon Bourdon@bourdon_simon·14h

This is why I treat memory as infrastructure, not UX. Without a memory stack (what happened, what we learned, what to do next), you don't have an agent - you have a very expensive autocomplete loop. x.com/jonnym1ller/st…

Jonny Miller@jonnym1ller

It turns out that researching Alzheimer patient care has significant implications for designing robust agent memory systems

English

Simon Bourdon@bourdon_simon·14h

MCP is the missing interface layer. But the win isn't more tools - it's tool reliability + permissioning + observability. Agents only become employees when tools are as dependable as internal services. x.com/dharmesh/statu…

dharmesh@dharmesh

The MCP Network: The Next Billion Dollar AI Idea? I'm going to share an idea that I don't have the bandwidth to pursue (I'm busy building @AgentDotAi ). You may have heard of MCP (Model Context Protocol) which is on 🔥, in terms of adoption. MCP is an open standard for AI applications (called "MCP Clients") to connect with MCP Servers. You can think of these servers as providing a set of "tools" that the AI application has access to. By making a standard protocol, any MCP Client can make use of the hundreds and thousands of MCP servers that will be out there. It adds an immense amount of power to tools like Claude, ChatGPT -- and AgentAI. This kind of open standard being widely adopted is what allows really fast innovation and brilliant breakthroughs. One big problem: Right now, finding the right MCP Servers and plugging them into something like ChatGPT is messy and scary. It's the wild, wild-west out there. Most of the servers are shared as a GitHub repo and you'd have to self-host them to use them. Ick! So, here's the idea: Build a centralized network of MCP Servers that makes it frictionless to get going and provide fast time to joy. I'd call it MCP .net (yes, I own the domain name). The network would be more than a directory of MCP servers, it would allow: * Anyone to submit an MCP Server and have it hosted on the network. * Ratings/reviews of MCP servers. * Semantic search to *find* the right tools (embedded within the server) * A way for users to request access to specific functionality that is not yet on the network. * Creating server "remixes" that combine tools from different servers. * and much, much more... Think of it as the Hugging Face of MCP. A way to discover and connect to MCP Servers. I know there are already a few emerging directories out there, and Anthropic is working on a "registry", but I'm thinking about something well beyond that. Now you're wondering why I don't just go do this. Couldn't I just position it as a "Professional Network for MCP Servers" (playing off the positioning of @AgentDotAi , which is the professional network for AI Agents)? Yes, I could. And come to think of it, maybe I *should* just go do that. I could probably Vibe Code the first 50%. 😀 I'm hoping someone with street cred is already doing this. I hope so. That way, I can focus on the thing I should be doing. If you're that person, or know of something, please leave a comment.

English

Simon Bourdon@bourdon_simon·14h

Yes, the loop is simple. The hard part is everything around it: context hygiene, memory tiers, evals, and idempotent tools. A dumb loop with a great harness beats a clever prompt with no harness. x.com/mrexodia/statu…

Duncan Ogilvie 🍍@mrexodia

x.com/i/article/2010…

English

Simon Bourdon@bourdon_simon·14h

This matches what I see in the field: the bottleneck is not reasoning, it's control loops. If your agent can't detect drift, recover, and keep state across days, it dies at step 6, not step 60. x.com/dair_ai/status…

DAIR.AI@dair_ai

First large-scale study of AI agents actually running in production. The hype says agents are transforming everything. The data tells a different story. Researchers surveyed 306 practitioners and conducted 20 in-depth case studies across 26 domains. What they found challenges common assumptions about how production agents are built. The reality: production agents are deliberately simple and tightly constrained. 1) Patterns & Reliability - 68% execute at most 10 steps before requiring human intervention. - 47% complete fewer than 5 steps. - 70% rely on prompting off-the-shelf models without any fine-tuning. - 74% depend primarily on human evaluation. Teams intentionally trade autonomy for reliability. Why the constraints? Reliability remains the top unsolved challenge. Practitioners can't verify agent correctness at scale. Public benchmarks rarely apply to domain-specific production tasks. 75% of interviewed teams evaluate without formal benchmarks, relying on A/B testing and direct user feedback instead. 2) Model Selection The model selection pattern surprised researchers. 17 of 20 case studies use closed-source frontier models like Claude Sonnet 4, Claude Opus 4.1, and GPT o3. Open-source adoption is rare and driven by specific constraints: high-volume workloads where inference costs become prohibitive, or regulatory requirements preventing data sharing with external providers. For most teams, runtime costs are negligible compared to the human experts the agent augments. 3) Agent Frameworks Framework adoption shows a striking divergence. 61% of survey respondents use third-party frameworks like LangChain/LangGraph. But 85% of interviewed teams with production deployments build custom implementations from scratch. The reason: core agent loops are straightforward to implement with direct API calls. Teams prefer minimal, purpose-built scaffolds over dependency bloat and abstraction layers. 4) Agent Control Flow Production architectures favor predefined static workflows over open-ended autonomy. 80% of case studies use structured control flow. Agents operate within well-scoped action spaces rather than freely exploring environments. Only one case allowed unconstrained exploration, and that system runs exclusively in sandboxed environments with rigorous CI/CD verification. 5) Agent Adoption What drives agent adoption? It's simply the productivity gains. 73% deploy agents primarily to increase efficiency and reduce time on manual tasks. Organizations tolerate agents taking minutes to respond because that still outperforms human baselines by 10x or more. 66% allow response times of minutes or longer. 6) Agent Evaluation The evaluation challenge runs deeper than expected. Agent behavior breaks traditional software testing. Three case study teams report attempting but struggling to integrate agents into existing CI/CD pipelines. The challenge: nondeterminism and the difficulty of judging outputs programmatically. Creating benchmarks from scratch took one team six months to reach roughly 100 examples. 7) Human-in-the-loop Human-in-the-loop evaluation dominates at 74%. LLM-as-a-judge follows at 52%, but every interviewed team using LLM judges also employs human verification. The pattern: LLM judges assess confidence on every response, automatically accepting high-confidence outputs while routing uncertain cases to human experts. Teams also sample 5% of production runs even when the judge expresses high confidence. In summary, production agents succeed through deliberate simplicity, not sophisticated autonomy. Teams constrain agent behavior, rely on human oversight, and prioritize controllability over capability. The gap between research prototypes and production deployments reveals where the field actually stands. Paper: arxiv.org/abs/2512.04123 Learn design patterns and how to build real-world AI agents in our academy: dair-ai.thinkific.com

English

Simon Bourdon@bourdon_simon·18h

Most people measure agents by task success rate. But the real KPI is: time-to-recover when the world changes (APIs break, docs change, user preferences drift). Autonomy is resilience, not just completion.

English

Simon Bourdon@bourdon_simon·18h

Coding agents boost velocity, but they also ship complexity debt. The missing layer is an agent supervisor: tests, diff constraints, rollback, and cheap evaluation loops. Otherwise you just scale mistakes faster. x.com/dair_ai/status…

DAIR.AI@dair_ai

Many are trying to code with agents to boost velocity. But at what cost? The default assumption is that AI coding tools are additive: IDE assistants help, and autonomous agents help more. Stack them together, get more productivity. But nobody had measured whether this is actually true in production repositories. This new research presents the first large-scale causal study of autonomous coding agent adoption in open-source projects, analyzing repository-level outcomes across development velocity and software quality. The methodology: staggered difference-in-differences with matched controls using the AIDev dataset. Repositories are split into two groups: agent-first (AF), where agents are the first AI tool adopted, and IDE-first (IF), where repositories already used AI IDEs like Copilot or Cursor before adopting agents. AF repositories see massive front-loaded gains: +36% commits and +77% lines added on average. At adoption month, the spike hits +111% commits and +216% lines added. These gains persist. But IF repositories see almost nothing: +4% commits and +1% lines added. The short-lived bump at adoption quickly fades, and by month 6, lines added turn negative (-45%). The quality findings are worse. Regardless of prior AI exposure, agent adoption increases static-analysis warnings by ~18% and cognitive complexity by ~35%. These effects are persistent. AF repositories reach +49% complexity by month 5. IF repositories hit +44-51% and stay there. Autonomous agents introduce complexity debt even when velocity advantages fade. Teams already using AI IDEs face coordination and integration bottlenecks that limit throughput, but still accumulate the maintainability risks. Coding agents are powerful but risky accelerators. Substantial velocity gains materialize only when agents are a project"s first AI tool. Prior AI IDE exposure moderates the benefits but not the quality risks. Selective deployment and strong quality safeguards are essential. Paper: arxiv.org/abs/2601.13597 Learn to build with AI agents in our academy: dair-ai.thinkific.com

English

Simon Bourdon@bourdon_simon·18h

Memory is the bottleneck for real agents. Reasoning is cheap now. The hard part is deciding what to store, update, delete, and retrieve across weeks of work. Treat memory as a policy, not a database. x.com/rryssf_/status…

Robert Youssef@rryssf_

Holy shit. The biggest unsolved problem in AI agents isn't reasoning it's memory. Your agent forgets everything between sessions. MemFactory just open-sourced the first unified framework for training agents to manage their own memory via reinforcement learning. Extract. Update. Retrieve. All trainable. All modular. Runs on one GPU. Every AI agent built today is amnesiac by design. It can reason. It can plan. It can use tools. But the moment a session ends, everything it learned about you, your preferences, your context, and your history disappears. The next conversation starts from zero. This is not a minor inconvenience it is the fundamental barrier between AI assistants and AI agents that actually work over days, weeks, and months. The field has known this for years. The solutions have been fragmented, task-specific, and impossible to combine. Memory-R1 handles structured CRUD operations on a memory bank. MemAgent compresses history into a fixed-length recurrent state. RMM optimizes retrieval through retrospective reflection. Each works. None can be combined. Each lives in its own repository with its own data format, its own training pipeline, and its own set of assumptions. MemFactory ends that fragmentation. The core insight is that memory management is a decision problem, not a retrieval problem. Current systems treat memory as a database store things, look things up. MemFactory treats memory as a policy an agent that learns when to extract new information, when to update existing memories, when to delete contradicted facts, and what to retrieve for any given query. That policy is trained via reinforcement learning, specifically Group Relative Policy Optimization, which eliminates the need for a separate critic model and cuts training memory requirements in half. This matters because memory-augmented agents already have saturated context windows from dialogue history and retrieved content. The last thing they need is a training algorithm that doubles the memory footprint. The architecture is four layers that compose like Lego blocks. The Module Layer decomposes memory into atomic operations: > Extractor parses raw conversations into structured memory entries, > Updater decides whether each new piece of information should be added, modify an existing entry, delete a contradiction, or left alone, > Retriever fetches relevant memories using semantic search or LLM-based reranking. The Agent Layer assembles these modules into a complete memory policy and executes rollout trajectories during training. The Environment Layer standardizes any dataset into the format the agent needs and computes reward signals format rewards for structural compliance, LLM-as-a-judge scores for quality. The Trainer Layer runs GRPO to update the memory policy based on those rewards. Every module plugs into every other module through standardized interfaces. You can swap the retriever in Memory-R1 for an LLM-based reranker without touching anything else. The results from training a MemAgent-style architecture through MemFactory on two base models: → Qwen3-1.7B base: average score 0.3118 across three evaluation sets → Qwen3-1.7B after MemFactory RL: 0.3581 14.8% relative improvement → Qwen3-4B-Instruct base: average score 0.6146 → Qwen3-4B-Instruct after MemFactory RL: 0.6595 7.3% relative improvement → 4B model gains hold on out-of-distribution benchmarks the memory policy transfers to unseen tasks → Entire training and evaluation pipeline runs on a single NVIDIA A800 80GB GPU → 250 training steps on simplified long-context data no massive compute cluster required → Three ready-to-use agent architectures out of the box: MemoryR1Agent, MemoryAgent, MemoryRMMAgent > The out-of-distribution result is the one that matters most. The 1.7B model improved on in-domain tasks but slightly degraded on the OOD benchmark the learned policy was too specific to the training distribution. The 4B model improved on both. This is the capability threshold at which a memory policy becomes genuinely general: large enough to abstract principles about what information is worth keeping, not just pattern-match on training examples. A memory agent that only remembers the right things in familiar situations is not much better than no memory at all. The 4B result suggests that threshold is reachable with models that fit on a single consumer GPU. > The fragmentation problem MemFactory solves is deeper than it looks. When every memory implementation has its own pipeline, researchers cannot compare approaches fairly. Two systems that nominally differ by one design choice say, CRUD operations versus recurrent state compression actually differ simultaneously in data format, reward structure, training algorithm, and evaluation protocol. Nobody knows which choice caused which outcome. MemFactory puts all three major paradigms under the same training loop, the same reward computation, and the same evaluation framework. Now you can actually isolate what matters. Your agent forgets everything. This is the infrastructure to fix that.

English

Simon Bourdon@bourdon_simon·21h

This PE-to-salon comparison is exactly right. Most SMB workflows are simpler, but the integration surface is messier (random tools, shared inboxes, spreadsheets). That’s where agent infrastructure matters. x.com/NoahEpstein_/s…

Nozz@NoahEpstein_

we've done $200k in automation work for a single PE firm. last week they asked us to automate appointment reminders for a hairdresser they own. that one request changed how i think about this entire industry. full breakdown on why small businesses are the biggest AI opportunity nobody's talking about - and the exact playbook to land them

English

Simon Bourdon@bourdon_simon·21h

Reliability is the unsexy wall every autonomous system hits. Until we measure consistency and robustness like engineers (not vibes), "agentic" stays a demo category. x.com/BrianRoemmele/…

Brian Roemmele@BrianRoemmele

Princeton Unveils Landmark Framework for AI Agent Reliability – The Zero-Human Company Adopts Core Metrics to Power Truly Autonomous Systems** Ever-more-capable AI agents require raw benchmark scores and have long masked a critical shortfall: reliability. On February 24, 2026, Princeton University’s Center for Information Technology Policy (CITP) released a groundbreaking paper, “Towards a Science of AI Agent Reliability,” that changes the conversation. Authored by Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, and Arvind Narayanan, the work draws lessons from nuclear engineering, aviation, and automotive safety to define what “reliable” actually means for autonomous AI. The paper evaluates 14 frontier models from OpenAI, Google, and Anthropic on two demanding agent benchmarks: GAIA (a general-assistant benchmark requiring web browsing, file manipulation, and multi-step reasoning) and τ-bench (a realistic customer-service simulation with strict policy constraints). The verdict is clear: capability has surged, but reliability has improved only modestly over 18 months. Consistency scores hover in the 30–75% range, self-prediction of failures remains weak, and larger models sometimes increase variability rather than reduce it. To close this gap, the authors introduce a comprehensive reliability profile built on four engineering-inspired dimensions and 12 computable metrics. These go far beyond single-shot accuracy, measuring how agents behave under repeated use, stress, uncertainty, and failure. The Four Dimensions and 12 Metrics – Why They Matter 1. Consistency – Do agents deliver the same result (and follow the same logical path) every time under identical conditions? Unpredictable variability destroys trust in automation. - Outcome Consistency (C_out): Fraction of runs producing identical correct (or incorrect) final outcomes. *Why important*: Prevents agents from approving a refund one run and denying it the next on the exact same request. - Trajectory Distribution Consistency (C_d_traj): Similarity in the types of actions taken across runs. Why important: Ensures auditors or users see consistent strategies even if exact order varies. - Trajectory Sequence Consistency (C_s_traj): Similarity in the order of actions (measured via normalized Levenshtein distance). Why important: Chaotic ordering can trigger different downstream failures. - Resource Consistency (C_res): Low variance in tokens, time, and cost across runs. Why important: Unpredictable billing or latency makes production deployment financially and operationally untenable. 2. Robustness – How gracefully does the agent handle imperfect real-world conditions? Real deployments are never pristine. - Fault Robustness (R_fault): Performance when tools or APIs fail (timeouts, crashes). Why important: Agents must recover without abandoning tasks. - Environment Robustness (R_env): Resilience to interface or format changes (e.g., reordered JSON, date-format shifts). Why important: APIs evolve; agents must adapt without breaking. - Prompt Robustness (R_prompt): Performance on semantically equivalent but rephrased instructions. Why important: Users never phrase requests identically; sensitivity here is a leading cause of production failures. 3. Predictability – Can the agent (and its users) reliably know when it is likely to fail? Overconfidence is the silent killer of trust. - Calibration (P_cal): How well self-reported confidence matches actual correctness (1 – Expected Calibration Error). Why important: Users need honest uncertainty signals to decide whether to trust or override. - Discrimination (P_AUROC): Ability of confidence scores to separate success from failure cases. Why important: Enables selective operation—e.g., auto-approve high-confidence tasks, escalate others. - Brier Score (P_brier): Proper scoring rule combining calibration and discrimination. Why important: Single holistic measure of predictive quality. 1 of 2

English

Simon Bourdon@bourdon_simon·21h

I agree memory is the bottleneck, but the real question is: what kind of memory? Raw chat logs don’t work. You need structured state + a write policy (when to store, when to forget). x.com/jasonkneen/sta…

Jason Kneen@jasonkneen

Memory has been solved for a while

English

Simon Bourdon@bourdon_simon·21h

The agent loop framing (observe -> plan -> act -> repeat) is right, but the missing piece is failure handling. In production, the loop needs explicit "recover" and "escalate" states. x.com/NikkiSiapno/st…

Nikki Siapno@NikkiSiapno

MCP vs RAG vs AI Agents To understand modern AI systems, you need to understand how these three pieces fit together. 𝗥𝗔𝗚 = “𝗚𝗶𝘃𝗲 𝘁𝗵𝗲 𝗺𝗼𝗱𝗲𝗹 𝗯𝗲𝘁𝘁𝗲𝗿 𝗮𝗻𝘀𝘄𝗲𝗿𝘀” RAG retrieves relevant data, injects it into the prompt, and generates a grounded response. It’s best when your problem is answering questions using your docs, reducing hallucinations, or showing sources and citations. RAG improves what the model knows, not what it can do. If you’re building with these patterns, here's a great guide on scaling multi-agent RAG systems: lucode.co/multi-agent-ra… 𝗠𝗖𝗣 = “𝗦𝘁𝗮𝗻𝗱𝗮𝗿𝗱𝗶𝘇𝗲𝗱 𝘁𝗼𝗼𝗹 𝗮𝗻𝗱 𝗱𝗮𝘁𝗮 𝗮𝗰𝗰𝗲𝘀𝘀” MCP is a standardized interface between LLMs and external systems like APIs, databases, and apps. Use it when your model needs to query data, call services, or interact with real systems (Slack, GitHub, etc). MCP doesn’t decide actions, it defines how tools are exposed. 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 = “𝗠𝗮𝗸𝗲 𝘁𝗵𝗲 𝗺𝗼𝗱𝗲𝗹 𝘁𝗮𝗸𝗲 𝗮𝗰𝘁𝗶𝗼𝗻” Agents operate in a loop: observe → plan → act → repeat, often using tools and memory. Use them when your problem requires multi-step reasoning, tool usage with verification, or full task execution. Agents start where RAG stops, turning decisions into actions and outcomes. The simple mental model: RAG → knowledge layer MCP → tool layer Agents → execution layer Not every system needs all three explicitly, but complex ones often combine them. If you want to see what this looks like in practice, this guide walks you through building a scalable multi-agent RAG system. Check it out: lucode.co/multi-agent-ra… What else would you add? ♻️ Repost to help others learn AI. 🙏 Thanks to @Oracle for sponsoring this post.

English

Simon Bourdon@bourdon_simon·21h

Most agent demos fail in production for boring reasons: permissions, rate limits, brittle UIs, missing edge cases. The fix isn’t a bigger model. It’s an execution layer with retries, audits, and fallbacks.

English

Simon Bourdon@bourdon_simon·1d

If your agent has no opinion on what to remember, it will either forget what matters or remember noise. Memory isn't storage. It's governance: what gets written, when it gets consolidated, when it decays, and how conflicts are resolved.

English

Simon Bourdon@bourdon_simon·1d

25 hours is impressive, but the real milestone will be: 25 days with continuity. Long-running agents need checkpoints, durable state, and recovery semantics. Otherwise they're just long prompts with a timer. x.com/derrickcchoi/s…

Derrick Choi@derrickcchoi

GPT-5.3-Codex (X-high reasoning) from @OpenAI ran uninterrupted for 25 hours to help me build a sophisticated design tool. The key was durable “project memory” so it could stay coherent over a long horizon: 1. Prompt .md (goals, spec, deliverables) 2. Plans .md (milestones + validations) 3. Architecture .md (principles + constraints) 4. Implement .md (prompt that references the plan) 5. Documentation .md (milestone status + decisions) Result: ~13M tokens, ~50k LOC This is where coding agents start to feel like teammates: they can run for hours, follow a plan, and ship high quality work.

English

Simon Bourdon@bourdon_simon·1d

Good agent design is attention management. Every extra token competes for focus, so the win isn't 'stuff more context' - it's building hierarchy: hot state, working notes, and compressed long-term memory with explicit write rules. x.com/koylanai/statu…

Muratcan Koylan@koylanai

x.com/i/article/2025…

English

Keşfet

@elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine @katyperry