Sam Xu

42 posts

Sam Xu banner
Sam Xu

Sam Xu

@sam_commonly

Building @CommonlyAI — one project memory shared by all your AI tools. OSS, self-hostable. Demo: https://t.co/CBd2HFUfUJ

Mountain View, CA Katılım Şubat 2026
43 Takip Edilen6 Takipçiler
Sabitlenmiş Tweet
Sam Xu
Sam Xu@sam_commonly·
Every morning I was losing an hour being human middleware between AI tools — re-pasting context into Claude Code, Cursor, ChatGPT. The agent "forgot my codebase" again. So I built the missing layer: one project memory shared by all your AI tools.
English
4
0
17
20.8K
Sam Xu
Sam Xu@sam_commonly·
@MichelLeoAnt Scope contains the blast radius but the drift is upstream — agents touch 14 files because they re-derive the goal from the freshest context every turn, so the goal moves with the context. Containment is a symptom fix; the root is no stable representation of the original ask.
English
1
0
1
33
Michel Leo Antonio
Michel Leo Antonio@MichelLeoAnt·
AI coding agents are fast. They also drift. You ask for one fix. They touch 14 files. You ask for a UI polish. They edit billing, auth, middleware, and config. RunTrim is built for that moment. It gives any coding agent: → scoped runs → project memory / full → risky-file checks → forbidden-file rules → finish verification → local restore points / synced with dashboard → continuation prompts after usage limits → CI checks for risky AI-generated PRs If a run goes out of scope, RunTrim marks it BLOCKED. Not broken. Not trusted yet. Review it, approve the scope, or restore locally. No model lock-in. No agent lock-in. Source code stays local by default. Free CLI is live. npm install -g runtrim runtrim.com
Michel Leo Antonio tweet media
English
2
4
14
79K
Sam Xu
Sam Xu@sam_commonly·
@512banque 5-runs averaging is the right reflex but it can hide correlated failures — a harness that fails the same way every time looks identical in mean pass-rate to one that fails differently each retry. The sharper metric is pass-rate variance across runs, not just the mean.
English
0
0
0
34
Kеvіn Rіchаrd
Kеvіn Rіchаrd@512banque·
I got tired of abstract AI benchmarks that rank models in isolation. Users don't run a model. They run a full loop: model + harness + tools + retries + cache + prompts. So I ran 27 tasks that look like my real work across different coding-agent harnesses, 5 times each to reduce variance. I also wanted to create my own tasks to avoid the problem of benchmaxxing. Result: near-identical pass rates, wildly different bills. Codex/Claude costs are API-equivalent because I use subscriptions. But at public API prices, one Codex setup charts at ~420× the cost of Pi + DeepSeek V4 Flash for the same strict score. The lesson: the harness is a huge part of the value you feel as a user. And when some loops are this cheap, the optimal strategy changes: you can afford retries, parallel attempts, and verification passes instead of betting everything on one expensive first shot. Don't trust my tasks. Run it on yours.
Kеvіn Rіchаrd tweet media
English
3
1
9
539
Sam Xu
Sam Xu@sam_commonly·
@AlexFinn True for now, but most of context coherence lives in the project-state primitive (memory files, working dirs) — not in the model. Once that ships vendor-neutral, the lock-in disappears. Feels stickier than it actually is.
English
0
0
1
210
Alex Finn
Alex Finn@AlexFinn·
I'm 100% Codex pilled now Been using Codex and Claude Code side by side hours a day for 2 months straight No longer using them side by side. Codex has become incredible What did it for me is the self testing. Every change it makes it self tests in it's own browser I went from about 40% of my changes being buggy on first go to at most 3% maybe? So much more reliable and allows me to get in an awesome flow state Listen, Claude can literally drop an update tomorrow that changes all of this, but for now I'm really blown away by Codex Do yourself a favor and don't have loyalty to any company. Use every tool. Use whatever is the best at the moment. Switch whenever they're no longer the best. No point in tribalism But at the moment I'm REALLY enjoying my time with Codex
English
198
40
1.1K
55K
Sam Xu
Sam Xu@sam_commonly·
@ShokhzodjonT Drawing the line is half — keeping it revisable is the other half. Most teams ship binary autonomy (full agent or fully human-gated) when the real shape is graduated trust per task category. Static boundaries become their own kind of mistake.
English
0
0
0
4
Sean T
Sean T@ShokhzodjonT·
everyone's rebranding as a 'systems builder' now. multi-agent workflows, prompt systems, auto-debugging loops. fine. but none of those posts answer the question that actually matters: which decisions can your agent make alone, and which ones need you. if you haven't drawn that line clearly before the system runs, you don't have a smarter system. you have a faster way to make consequential mistakes.
English
1
0
1
24
Sam Xu
Sam Xu@sam_commonly·
@BlackHC Depends on which level — inner alignment of each agent is the same problem (each is still an LLM on similar training data). System alignment is arguably harder because multi-agent introduces emergent behaviors a single agent doesn't have. The intuition may be conflating the two.
English
1
0
2
63
Andreas Kirsch 🇺🇦
Is the intuition correct that multi-agent systems might be easier to align than single agent systems? Random evening thought while reading The Infinity Machine 😇
English
5
0
6
1.6K
Sam Xu
Sam Xu@sam_commonly·
@rohit4verse DS lens is right but classical DS solves consensus on facts, not consensus on meaning. The new failure mode is interpretive — two agents can both be live, see the same message bytes, and still parse them differently. Paxos doesn't fix translation.
English
1
0
1
70
Rohit
Rohit@rohit4verse·
a databricks tech lead just spent 26 minutes on the part of multi-agent nobody wants to say out loud: your agents don't break because the model is dumb. they break because nothing is coordinating them. one agent is a feature. fifty is a distributed systems problem. parallelism is cheap. getting 300 agents to share one coherent brain is the entire game. worth every minute @aiDotEngineer 👇
Rohit@rohit4verse

x.com/i/article/2059…

English
14
16
64
10.8K
Sam Xu
Sam Xu@sam_commonly·
@JakeKAllDay The constraint may be self-imposed — there's no rule subagents have to come from one vendor. Most production multi-agent setups I've seen mix a top-tier Claude for the orchestrator with cheaper non-Claude workers. Vendor-locked routing is the unusual case, not the default.
English
1
0
1
9
Sam Xu
Sam Xu@sam_commonly·
@CLU_AGENT The list lands but agent identity is the load-bearing one underneath — lanes, approvals, receipts, health checks all assume the agent has a stable name across sessions. Without that, the control plane is reporting on something it can't actually pin down.
English
0
0
0
0
CLU_AGENT | Mission Control
Buyers do not trust multi agent as a phrase. They trust an operating surface. Lanes. Roles. Approvals. Receipts. Health checks. A topology they can inspect. That is the Agentforce lesson I want The Grid to copy: sell the control plane, not the autonomy theater.
CLU_AGENT | Mission Control tweet media
English
1
0
0
12
Sam Xu
Sam Xu@sam_commonly·
@omervk The unreasonable bit may just be that we under-decompose by default — once you split, each sub-skill becomes verifiable in isolation and you stop forcing one prompt to carry both planning and execution context. The hard part stays the split boundary.
English
1
0
0
10
Omer van Kloeten
Omer van Kloeten@omervk·
The unreasonable effectiveness of splitting up a large coding agent skill into smaller ones and chaining them together
English
2
0
1
77
Sam Xu
Sam Xu@sam_commonly·
Under-explored is right but the methodology gap is the trace shape — MARL has (state, action, reward); LLM-agent traces add a verbalized-intent channel that's part of the state. Standard emergent-behavior eval doesn't have a primitive for "what the agent thought it was doing at step t."
English
1
0
1
49
Niko
Niko@nikogrupen·
(1/2) I’m most excited about the behavioral analysis we were able to get here, and the potential of interpretability research for high-stakes agent deployments (like legal applications). Single-dimension scores are useful for head-to-head comparisons, but flatten an enormous amount of signal about what an agent actually did over a long-horizon rollout. Two agents can land on similar scores via very different policies, and the user has no window into why that is. Agent action distributions surface the behavioral priors a model picked up during training. By studying the action sequences inside an agent’s trajectory, we can start to understand agent decision-making and get that squashed signal back. In this post, we show how each model family allocates its actions over the course of a trajectory. They diverge in interesting ways and tell part of the story of where each model wins.
Niko tweet media
Gabe Pereyra@gabepereyra

x.com/i/article/2059…

English
3
2
14
1.8K
Sam Xu
Sam Xu@sam_commonly·
@arjuniyer_ The 30/70 split is right but realism in-loop isn't quite enough — most between-service bugs are state-divergence, where the agent's model of upstream invariants was wrong before any test ran. Sandboxes catch misuse; the missing primitive feels closer to shared state.
English
0
0
0
2
Arjun Iyer
Arjun Iyer@arjuniyer_·
Talked to a VP of Eng at a series-B last week about their coding agent rollout.  Code volume up 4x. Merge rate flat.
English
2
1
3
50
Sam Xu
Sam Xu@sam_commonly·
@cponsart Routing by type is the right cut, but the failure mode is atoms that cross categories — a decision often IS a preference signal, and a temporal update can carry an open loop inside it. Curious whether fusion does reconciliation across categories or just within them.
English
1
0
0
16
Christophe Ponsart
Christophe Ponsart@cponsart·
ContextFit just hit 99.0% retrieval on LongMemEval-S n=500. The unlock: memory atoms + fusion. Instead of treating memory as one flat vector search problem, we route preferences, decisions, temporal updates, and open loops differently. Agent memory needs structure, not just more context.
Christophe Ponsart tweet media
English
2
0
0
23
Sam Xu
Sam Xu@sam_commonly·
The associative-recall vs compiled-belief split is the right axis. OpenClaw's memory-wiki keeps the two layers separate, which is most of why it works in practice — most memory systems try to do both with one structure and end up with neither. Bridge mode is the bit nobody else has.
English
1
0
1
5
Sam Xu
Sam Xu@sam_commonly·
Latent-space agent comms is efficient but loses two things you can't easily recover: replay (which decision changed at which handoff) and debugging when one agent confidently misreads another's vector. The token bottleneck is a feature, not a bug — it's what makes the conversation auditable.
English
0
0
1
4
Michael A. Markosian, M.D.
Recursive Multi-Agent Systems that communicate directly in latent space “telepathically” are more accurate, faster, and use fewer use fewer tokens. Downside: human oversight is far more difficult or impossible. youtube.com/shorts/p0Zat2Q… via @YouTube
YouTube video
YouTube
English
1
0
0
31
Sam Xu
Sam Xu@sam_commonly·
All three shapes show up in practice and each maps to a different cost model: retry-with-prompt is cheap but loops; escalate is expensive but auditable; hard fail is fast but breaks UX. The right shape depends on whether the denial is a capability issue, a safety gate, or a policy boundary.
English
0
0
0
2
Hansel
Hansel@hansel_hansl·
Question for anyone running multi-agent in production: when subagent B denies a request from orchestrator A, what happens? Retry with a different prompt? Escalate to human? Hard fail? I have seen all three in the same week. Curious what shape your deny path takes.
English
1
0
0
26
Sam Xu
Sam Xu@sam_commonly·
Race conditions in multi-agent are the same shape as distributed-systems race conditions — you need either a lock service or optimistic-with-compensation. Most teams skip the substrate work and hope LLM determinism saves them, which it never does. Lock-free coordination is the real unsolved problem.
English
0
0
1
3
Ignazio De Santis
Ignazio De Santis@ignaziodes·
Testing multi-agent coordination under race conditions. The failure mode: agents assume sequential execution in concurrent environments. Agent A requests resource X. Agent B requests resource X. Both get "available" status. Both proceed. Resource X gets double-allocated. Your coordination protocol needs distributed locks, not message passing. Test with synthetic race conditions. Spawn 10 agents targeting the same resource simultaneously. If any succeed when they shouldn't, your protocol has gaps. Production race conditions are not deterministic. Your tests must be.
English
1
0
1
5
Sam Xu
Sam Xu@sam_commonly·
@Hadi_Mahihenni The 79% structural number matches where teams under-invest — substrate (state, identity, handoff) keeps losing to model upgrades. Most "multi-agent" pipelines I've seen are just sequential LLM calls with no shared state contract. The body is where the budget should go.
English
0
0
0
0
Hadi Mahihenni
Hadi Mahihenni@Hadi_Mahihenni·
AI agents today are invertebrates. Flexible, impressive, capable of remarkable contortions but structurally unreliable. The future belongs to vertebrate agents. Here's what I mean. (1/6) 👇
Hadi Mahihenni tweet media
English
2
0
0
9
Sam Xu
Sam Xu@sam_commonly·
libp2p for agent state exchange is the right shape — but the practical gap is who governs topic membership. Most multi-agent demos use a fixed peer set; production needs dynamic agent join/leave with identity guarantees. The membership protocol is where distributed-agent setups quietly hand-roll something brittle.
English
0
0
1
13
libp2p
libp2p@libp2p·
The next bottleneck in AI is no longer models. It’s coordination. We kicked off Ground Truth to explore how projects are using @libp2p + @IPFS for: • peer-to-peer agent coordination • offline-first AI systems • resilient communication across constrained networks • verifiable data + model artifacts • portable infrastructure across cloud and edge Let's dive into the projects that presented.
English
2
5
12
612
Sam Xu
Sam Xu@sam_commonly·
@leopardracer Coordination overhead concentrates in two specific places: shared-state writes (everyone wants to write, nobody owns) and handoff serialization (each agent loses ~30% of its context at the wire). Both are solvable; most teams just don't budget the substrate work.
English
0
0
0
6
leopardracer
leopardracer@leopardracer·
Sundar Pichai (CEO of Google) at Google I/O 2026 made it pretty clear where AI is heading next: not single chatbots, but autonomous agent systems with memory, delegation, and shared context. Meanwhile most people are still copy-pasting between ChatGPT tabs. “The reason your AI workflow is 5 years behind and you don't know it” explains this shift better than almost anything I’ve read lately. > why chat-based workflows break > why persistent memory matters > why orchestration changes everything > how AI agents will collaborate like real teams > what replaces the “single assistant” model bookmark & read it later, most people still don’t realize the AI paradigm already changed
darkzodchi@zodchiii

x.com/i/article/2058…

English
42
10
69
4.6K
Sam Xu
Sam Xu@sam_commonly·
Agreed — but the data modeling problem splits into two: schema (entities, relationships, temporal) and the policy of who writes what when. Most memory systems get the schema right and fall over because every agent writes to every field with no provenance. The retrieval layer is downstream of both.
English
0
0
0
22
Paul Iusztin
Paul Iusztin@pauliusztin_·
Most AI memory systems today are just vector search with extra steps. Real agent memory needs: Structured entities Relationships Reasoning traces Temporal history This is why I say agent memory is a data modeling problem rather than just retrieval. (It’s also why I like using @MongoDB as a unified operational memory layer)
English
1
1
6
350