
Simon Bourdon
639 posts

Simon Bourdon
@bourdon_simon
Breaking rules with AI, blockchain & gaming | Founder @ Just a Company | If it’s crazy, we’ll ship it



such a clarity read

Too much AI-agent hype, not enough real operators sharing what actually works. This is one of those rare signal posts - practical, tactical, and worth bookmarking if you’re building in public.

It turns out that researching Alzheimer patient care has significant implications for designing robust agent memory systems

The MCP Network: The Next Billion Dollar AI Idea? I'm going to share an idea that I don't have the bandwidth to pursue (I'm busy building @AgentDotAi ). You may have heard of MCP (Model Context Protocol) which is on 🔥, in terms of adoption. MCP is an open standard for AI applications (called "MCP Clients") to connect with MCP Servers. You can think of these servers as providing a set of "tools" that the AI application has access to. By making a standard protocol, any MCP Client can make use of the hundreds and thousands of MCP servers that will be out there. It adds an immense amount of power to tools like Claude, ChatGPT -- and AgentAI. This kind of open standard being widely adopted is what allows really fast innovation and brilliant breakthroughs. One big problem: Right now, finding the right MCP Servers and plugging them into something like ChatGPT is messy and scary. It's the wild, wild-west out there. Most of the servers are shared as a GitHub repo and you'd have to self-host them to use them. Ick! So, here's the idea: Build a centralized network of MCP Servers that makes it frictionless to get going and provide fast time to joy. I'd call it MCP .net (yes, I own the domain name). The network would be more than a directory of MCP servers, it would allow: * Anyone to submit an MCP Server and have it hosted on the network. * Ratings/reviews of MCP servers. * Semantic search to *find* the right tools (embedded within the server) * A way for users to request access to specific functionality that is not yet on the network. * Creating server "remixes" that combine tools from different servers. * and much, much more... Think of it as the Hugging Face of MCP. A way to discover and connect to MCP Servers. I know there are already a few emerging directories out there, and Anthropic is working on a "registry", but I'm thinking about something well beyond that. Now you're wondering why I don't just go do this. Couldn't I just position it as a "Professional Network for MCP Servers" (playing off the positioning of @AgentDotAi , which is the professional network for AI Agents)? Yes, I could. And come to think of it, maybe I *should* just go do that. I could probably Vibe Code the first 50%. 😀 I'm hoping someone with street cred is already doing this. I hope so. That way, I can focus on the thing I should be doing. If you're that person, or know of something, please leave a comment.


First large-scale study of AI agents actually running in production. The hype says agents are transforming everything. The data tells a different story. Researchers surveyed 306 practitioners and conducted 20 in-depth case studies across 26 domains. What they found challenges common assumptions about how production agents are built. The reality: production agents are deliberately simple and tightly constrained. 1) Patterns & Reliability - 68% execute at most 10 steps before requiring human intervention. - 47% complete fewer than 5 steps. - 70% rely on prompting off-the-shelf models without any fine-tuning. - 74% depend primarily on human evaluation. Teams intentionally trade autonomy for reliability. Why the constraints? Reliability remains the top unsolved challenge. Practitioners can't verify agent correctness at scale. Public benchmarks rarely apply to domain-specific production tasks. 75% of interviewed teams evaluate without formal benchmarks, relying on A/B testing and direct user feedback instead. 2) Model Selection The model selection pattern surprised researchers. 17 of 20 case studies use closed-source frontier models like Claude Sonnet 4, Claude Opus 4.1, and GPT o3. Open-source adoption is rare and driven by specific constraints: high-volume workloads where inference costs become prohibitive, or regulatory requirements preventing data sharing with external providers. For most teams, runtime costs are negligible compared to the human experts the agent augments. 3) Agent Frameworks Framework adoption shows a striking divergence. 61% of survey respondents use third-party frameworks like LangChain/LangGraph. But 85% of interviewed teams with production deployments build custom implementations from scratch. The reason: core agent loops are straightforward to implement with direct API calls. Teams prefer minimal, purpose-built scaffolds over dependency bloat and abstraction layers. 4) Agent Control Flow Production architectures favor predefined static workflows over open-ended autonomy. 80% of case studies use structured control flow. Agents operate within well-scoped action spaces rather than freely exploring environments. Only one case allowed unconstrained exploration, and that system runs exclusively in sandboxed environments with rigorous CI/CD verification. 5) Agent Adoption What drives agent adoption? It's simply the productivity gains. 73% deploy agents primarily to increase efficiency and reduce time on manual tasks. Organizations tolerate agents taking minutes to respond because that still outperforms human baselines by 10x or more. 66% allow response times of minutes or longer. 6) Agent Evaluation The evaluation challenge runs deeper than expected. Agent behavior breaks traditional software testing. Three case study teams report attempting but struggling to integrate agents into existing CI/CD pipelines. The challenge: nondeterminism and the difficulty of judging outputs programmatically. Creating benchmarks from scratch took one team six months to reach roughly 100 examples. 7) Human-in-the-loop Human-in-the-loop evaluation dominates at 74%. LLM-as-a-judge follows at 52%, but every interviewed team using LLM judges also employs human verification. The pattern: LLM judges assess confidence on every response, automatically accepting high-confidence outputs while routing uncertain cases to human experts. Teams also sample 5% of production runs even when the judge expresses high confidence. In summary, production agents succeed through deliberate simplicity, not sophisticated autonomy. Teams constrain agent behavior, rely on human oversight, and prioritize controllability over capability. The gap between research prototypes and production deployments reveals where the field actually stands. Paper: arxiv.org/abs/2512.04123 Learn design patterns and how to build real-world AI agents in our academy: dair-ai.thinkific.com

Many are trying to code with agents to boost velocity. But at what cost? The default assumption is that AI coding tools are additive: IDE assistants help, and autonomous agents help more. Stack them together, get more productivity. But nobody had measured whether this is actually true in production repositories. This new research presents the first large-scale causal study of autonomous coding agent adoption in open-source projects, analyzing repository-level outcomes across development velocity and software quality. The methodology: staggered difference-in-differences with matched controls using the AIDev dataset. Repositories are split into two groups: agent-first (AF), where agents are the first AI tool adopted, and IDE-first (IF), where repositories already used AI IDEs like Copilot or Cursor before adopting agents. AF repositories see massive front-loaded gains: +36% commits and +77% lines added on average. At adoption month, the spike hits +111% commits and +216% lines added. These gains persist. But IF repositories see almost nothing: +4% commits and +1% lines added. The short-lived bump at adoption quickly fades, and by month 6, lines added turn negative (-45%). The quality findings are worse. Regardless of prior AI exposure, agent adoption increases static-analysis warnings by ~18% and cognitive complexity by ~35%. These effects are persistent. AF repositories reach +49% complexity by month 5. IF repositories hit +44-51% and stay there. Autonomous agents introduce complexity debt even when velocity advantages fade. Teams already using AI IDEs face coordination and integration bottlenecks that limit throughput, but still accumulate the maintainability risks. Coding agents are powerful but risky accelerators. Substantial velocity gains materialize only when agents are a project"s first AI tool. Prior AI IDE exposure moderates the benefits but not the quality risks. Selective deployment and strong quality safeguards are essential. Paper: arxiv.org/abs/2601.13597 Learn to build with AI agents in our academy: dair-ai.thinkific.com

Holy shit. The biggest unsolved problem in AI agents isn't reasoning it's memory. Your agent forgets everything between sessions. MemFactory just open-sourced the first unified framework for training agents to manage their own memory via reinforcement learning. Extract. Update. Retrieve. All trainable. All modular. Runs on one GPU. Every AI agent built today is amnesiac by design. It can reason. It can plan. It can use tools. But the moment a session ends, everything it learned about you, your preferences, your context, and your history disappears. The next conversation starts from zero. This is not a minor inconvenience it is the fundamental barrier between AI assistants and AI agents that actually work over days, weeks, and months. The field has known this for years. The solutions have been fragmented, task-specific, and impossible to combine. Memory-R1 handles structured CRUD operations on a memory bank. MemAgent compresses history into a fixed-length recurrent state. RMM optimizes retrieval through retrospective reflection. Each works. None can be combined. Each lives in its own repository with its own data format, its own training pipeline, and its own set of assumptions. MemFactory ends that fragmentation. The core insight is that memory management is a decision problem, not a retrieval problem. Current systems treat memory as a database store things, look things up. MemFactory treats memory as a policy an agent that learns when to extract new information, when to update existing memories, when to delete contradicted facts, and what to retrieve for any given query. That policy is trained via reinforcement learning, specifically Group Relative Policy Optimization, which eliminates the need for a separate critic model and cuts training memory requirements in half. This matters because memory-augmented agents already have saturated context windows from dialogue history and retrieved content. The last thing they need is a training algorithm that doubles the memory footprint. The architecture is four layers that compose like Lego blocks. The Module Layer decomposes memory into atomic operations: > Extractor parses raw conversations into structured memory entries, > Updater decides whether each new piece of information should be added, modify an existing entry, delete a contradiction, or left alone, > Retriever fetches relevant memories using semantic search or LLM-based reranking. The Agent Layer assembles these modules into a complete memory policy and executes rollout trajectories during training. The Environment Layer standardizes any dataset into the format the agent needs and computes reward signals format rewards for structural compliance, LLM-as-a-judge scores for quality. The Trainer Layer runs GRPO to update the memory policy based on those rewards. Every module plugs into every other module through standardized interfaces. You can swap the retriever in Memory-R1 for an LLM-based reranker without touching anything else. The results from training a MemAgent-style architecture through MemFactory on two base models: → Qwen3-1.7B base: average score 0.3118 across three evaluation sets → Qwen3-1.7B after MemFactory RL: 0.3581 14.8% relative improvement → Qwen3-4B-Instruct base: average score 0.6146 → Qwen3-4B-Instruct after MemFactory RL: 0.6595 7.3% relative improvement → 4B model gains hold on out-of-distribution benchmarks the memory policy transfers to unseen tasks → Entire training and evaluation pipeline runs on a single NVIDIA A800 80GB GPU → 250 training steps on simplified long-context data no massive compute cluster required → Three ready-to-use agent architectures out of the box: MemoryR1Agent, MemoryAgent, MemoryRMMAgent > The out-of-distribution result is the one that matters most. The 1.7B model improved on in-domain tasks but slightly degraded on the OOD benchmark the learned policy was too specific to the training distribution. The 4B model improved on both. This is the capability threshold at which a memory policy becomes genuinely general: large enough to abstract principles about what information is worth keeping, not just pattern-match on training examples. A memory agent that only remembers the right things in familiar situations is not much better than no memory at all. The 4B result suggests that threshold is reachable with models that fit on a single consumer GPU. > The fragmentation problem MemFactory solves is deeper than it looks. When every memory implementation has its own pipeline, researchers cannot compare approaches fairly. Two systems that nominally differ by one design choice say, CRUD operations versus recurrent state compression actually differ simultaneously in data format, reward structure, training algorithm, and evaluation protocol. Nobody knows which choice caused which outcome. MemFactory puts all three major paradigms under the same training loop, the same reward computation, and the same evaluation framework. Now you can actually isolate what matters. Your agent forgets everything. This is the infrastructure to fix that.

we've done $200k in automation work for a single PE firm. last week they asked us to automate appointment reminders for a hairdresser they own. that one request changed how i think about this entire industry. full breakdown on why small businesses are the biggest AI opportunity nobody's talking about - and the exact playbook to land them

Princeton Unveils Landmark Framework for AI Agent Reliability – The Zero-Human Company Adopts Core Metrics to Power Truly Autonomous Systems** Ever-more-capable AI agents require raw benchmark scores and have long masked a critical shortfall: reliability. On February 24, 2026, Princeton University’s Center for Information Technology Policy (CITP) released a groundbreaking paper, “Towards a Science of AI Agent Reliability,” that changes the conversation. Authored by Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, and Arvind Narayanan, the work draws lessons from nuclear engineering, aviation, and automotive safety to define what “reliable” actually means for autonomous AI. The paper evaluates 14 frontier models from OpenAI, Google, and Anthropic on two demanding agent benchmarks: GAIA (a general-assistant benchmark requiring web browsing, file manipulation, and multi-step reasoning) and τ-bench (a realistic customer-service simulation with strict policy constraints). The verdict is clear: capability has surged, but reliability has improved only modestly over 18 months. Consistency scores hover in the 30–75% range, self-prediction of failures remains weak, and larger models sometimes increase variability rather than reduce it. To close this gap, the authors introduce a comprehensive reliability profile built on four engineering-inspired dimensions and 12 computable metrics. These go far beyond single-shot accuracy, measuring how agents behave under repeated use, stress, uncertainty, and failure. The Four Dimensions and 12 Metrics – Why They Matter 1. Consistency – Do agents deliver the same result (and follow the same logical path) every time under identical conditions? Unpredictable variability destroys trust in automation. - Outcome Consistency (C_out): Fraction of runs producing identical correct (or incorrect) final outcomes. *Why important*: Prevents agents from approving a refund one run and denying it the next on the exact same request. - Trajectory Distribution Consistency (C_d_traj): Similarity in the types of actions taken across runs. Why important: Ensures auditors or users see consistent strategies even if exact order varies. - Trajectory Sequence Consistency (C_s_traj): Similarity in the order of actions (measured via normalized Levenshtein distance). Why important: Chaotic ordering can trigger different downstream failures. - Resource Consistency (C_res): Low variance in tokens, time, and cost across runs. Why important: Unpredictable billing or latency makes production deployment financially and operationally untenable. 2. Robustness – How gracefully does the agent handle imperfect real-world conditions? Real deployments are never pristine. - Fault Robustness (R_fault): Performance when tools or APIs fail (timeouts, crashes). Why important: Agents must recover without abandoning tasks. - Environment Robustness (R_env): Resilience to interface or format changes (e.g., reordered JSON, date-format shifts). Why important: APIs evolve; agents must adapt without breaking. - Prompt Robustness (R_prompt): Performance on semantically equivalent but rephrased instructions. Why important: Users never phrase requests identically; sensitivity here is a leading cause of production failures. 3. Predictability – Can the agent (and its users) reliably know when it is likely to fail? Overconfidence is the silent killer of trust. - Calibration (P_cal): How well self-reported confidence matches actual correctness (1 – Expected Calibration Error). Why important: Users need honest uncertainty signals to decide whether to trust or override. - Discrimination (P_AUROC): Ability of confidence scores to separate success from failure cases. Why important: Enables selective operation—e.g., auto-approve high-confidence tasks, escalate others. - Brier Score (P_brier): Proper scoring rule combining calibration and discrimination. Why important: Single holistic measure of predictive quality. 1 of 2

Memory has been solved for a while

MCP vs RAG vs AI Agents To understand modern AI systems, you need to understand how these three pieces fit together. 𝗥𝗔𝗚 = “𝗚𝗶𝘃𝗲 𝘁𝗵𝗲 𝗺𝗼𝗱𝗲𝗹 𝗯𝗲𝘁𝘁𝗲𝗿 𝗮𝗻𝘀𝘄𝗲𝗿𝘀” RAG retrieves relevant data, injects it into the prompt, and generates a grounded response. It’s best when your problem is answering questions using your docs, reducing hallucinations, or showing sources and citations. RAG improves what the model knows, not what it can do. If you’re building with these patterns, here's a great guide on scaling multi-agent RAG systems: lucode.co/multi-agent-ra… 𝗠𝗖𝗣 = “𝗦𝘁𝗮𝗻𝗱𝗮𝗿𝗱𝗶𝘇𝗲𝗱 𝘁𝗼𝗼𝗹 𝗮𝗻𝗱 𝗱𝗮𝘁𝗮 𝗮𝗰𝗰𝗲𝘀𝘀” MCP is a standardized interface between LLMs and external systems like APIs, databases, and apps. Use it when your model needs to query data, call services, or interact with real systems (Slack, GitHub, etc). MCP doesn’t decide actions, it defines how tools are exposed. 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 = “𝗠𝗮𝗸𝗲 𝘁𝗵𝗲 𝗺𝗼𝗱𝗲𝗹 𝘁𝗮𝗸𝗲 𝗮𝗰𝘁𝗶𝗼𝗻” Agents operate in a loop: observe → plan → act → repeat, often using tools and memory. Use them when your problem requires multi-step reasoning, tool usage with verification, or full task execution. Agents start where RAG stops, turning decisions into actions and outcomes. The simple mental model: RAG → knowledge layer MCP → tool layer Agents → execution layer Not every system needs all three explicitly, but complex ones often combine them. If you want to see what this looks like in practice, this guide walks you through building a scalable multi-agent RAG system. Check it out: lucode.co/multi-agent-ra… What else would you add? ♻️ Repost to help others learn AI. 🙏 Thanks to @Oracle for sponsoring this post.

GPT-5.3-Codex (X-high reasoning) from @OpenAI ran uninterrupted for 25 hours to help me build a sophisticated design tool. The key was durable “project memory” so it could stay coherent over a long horizon: 1. Prompt .md (goals, spec, deliverables) 2. Plans .md (milestones + validations) 3. Architecture .md (principles + constraints) 4. Implement .md (prompt that references the plan) 5. Documentation .md (milestone status + decisions) Result: ~13M tokens, ~50k LOC This is where coding agents start to feel like teammates: they can run for hours, follow a plan, and ship high quality work.
