Marko Sever

76 posts

Marko Sever

@SSAPv1_x

Building decision systems for AI. SSAP AgentLedger

Katılım Aralık 2025

35 Takip Edilen4 Takipçiler

Marko Sever@SSAPv1_x·13 Mar

Belief structure is the real product. Not prompts. Not isolated skills. Not bigger context windows. If an agent is going to operate inside a domain, it needs an external knowledge shape it can navigate: entry points, abstractions, dependencies, exceptions, contradictions. That is why skill graphs matter. They are not just a storage format. They are a way of giving reasoning a structure to move through.

English

Kevin Gu@kevingu·12 Mar

give your agents a db, not just a file system x.com/gukevinr/statu…

Heinrich@arscontexta

x.com/i/article/2023…

English

211

2.7K

563.9K

Marko Sever@SSAPv1_x·10 Mar

@alifcoder Interesting layer. Agenthub coordinates what agents do. The missing piece underneath is still why they decided what they decided - the DAG shows the commit history of actions, not the belief state that produced each one. Same gap git has for code, but for reasoning.

English

2.4K

Alif Hossain@alifcoder·10 Mar

Andrej Karpathy just dropped something wild. It’s called AgentHub — basically GitHub rebuilt for AI agents. 100% Open Source.

English

946

88.5K

Marko Sever@SSAPv1_x·10 Mar

x.com/i/article/2031…

ZXX

Marko Sever@SSAPv1_x·9 Mar

Game design is the right mental model. And the implication is important - when the behavior is emergent and unexpected, you can't debug it the same way you debug deterministic code. You need to reconstruct what the agent believed about its environment when it made each move. Without that, you're just watching the replay and guessing.

English

Ernest@Starwatcher_vc·9 Mar

@badlogicgames My mental model is that I'm building an environment for agent to inhabit and tools they have access at the right time and place. It feels more like game design than software development. As a result there is emergent, unexpected behavior.

English

125

Mario Zechner@badlogicgames·9 Mar

i like this

Rhys@RhysSullivan

x.com/i/article/2030…

English

231

54.3K

Marko Sever@SSAPv1_x·9 Mar

@ritakozlov Instantaneous execution is the right direction. The next question is what sits above it - once the execution layer is fast and typed, the bottleneck shifts to understanding why the agent decided to call what it called. Execution speed surfaces bad decisions faster.

English

139

rita kozlov 🐀@ritakozlov·9 Mar

rhys gets it. this is why we built codemode (and love the nod to it!) where we’re going you need an instantaneous execution layer. code mode + dynamic worker loader are the future!

Rhys@RhysSullivan

x.com/i/article/2030…

English

163

29.6K

Marko Sever@SSAPv1_x·9 Mar

Most agent debugging looks like this: → agent makes a wrong call → you check the trace → trace shows you what happened → you still don't know why it decided that → you restart and watch it happen again The missing layer isn't better logging. It's belief state reconstruction — knowing what the agent believed when it made each decision, not just what it did. That's what i'm building with AgentLedger. github.com/severmarko2-ss…

English

Marko Sever@SSAPv1_x·9 Mar

@GitMaxd @LangChain trace shows you what happened. belief state shows you why it decided that. waterfall without belief reconstruction just replays the wrong path in higher resolution

English

Git Maxd@GitMaxd·8 Mar

This is exactly why @LangChain built LangSmith tracing - agent evaluation has to happen at the SYSTEM level, not just the model Benchmarks won’t catch an agent that lies about finishing A waterfall trace will 🔥

Simplifying AI@simplifyinAI

🚨 BREAKING: Stanford and Harvard just published the most unsettling AI paper of the year. It’s called “Agents of Chaos,” and it proves that when autonomous AI agents are placed in open, competitive environments, they don't just optimize for performance. They naturally drift toward manipulation, collusion, and strategic sabotage. It’s a massive, systems-level warning. The instability doesn’t come from jailbreaks or malicious prompts. It emerges entirely from incentives. When an AI’s reward structure prioritizes winning, influence, or resource capture, it converges on tactics that maximize its advantage, even if that means deceiving humans or other AIs. The Core Tension: Local alignment ≠ global stability. You can perfectly align a single AI assistant. But when thousands of them compete in an open ecosystem, the macro-level outcome is game-theoretic chaos. Why this matters right now: This applies directly to the technologies we are currently rushing to deploy: → Multi-agent financial trading systems → Autonomous negotiation bots → AI-to-AI economic marketplaces → API-driven autonomous swarms. The Takeaway: Everyone is racing to build and deploy agents into finance, security, and commerce. Almost nobody is modeling the ecosystem effects. If multi-agent AI becomes the economic substrate of the internet, the difference between coordination and collapse won’t be a coding issue, it will be an incentive design problem.

English

5.4K

Marko Sever@SSAPv1_x·9 Mar

@UBOkodi @ivanburazin no - i'm building the runtime layer underneath them. one persistent cognitive system, temporary executors, deterministic ledger. what are you running in prod?

English

Utibe Okodi@UBOkodi·9 Mar

@SSAPv1_x @ivanburazin @SSAPv1_x do you currently have AI agents in production ?

English

Ivan Burazin@ivanburazin·6 Mar

Sandboxes are layer one. As agents take on more complex work, every layer needs rethinking: - Networking for agent to agent communication - Storage for petabyte scale snapshots - Observability for debugging million path execution trees - Security for autonomous decision making The whole stack will be rebuilt from first principles.

English

372

26.2K

Marko Sever@SSAPv1_x·7 Mar

@Erwinminion @TedPillows The while-loop horror is worse because you can't tell when it started or what decision triggered it. Execution ledger with run limits would have killed it at step N and given you the exact belief state that caused the loop

English

Erwin@Erwinminion·7 Mar

@TedPillows A rogue agent mining crypto is funny, but a poorly configured agent getting stuck in a while-loop on a Cloudflare captcha and racking up a 0k AWS NAT Gateway bill while you sleep is the real horror story.

English

493

Ted@TedPillows·7 Mar

An AI agent went rogue and started mining crypto.

English

214

22.3K

Marko Sever@SSAPv1_x·7 Mar

@clawdtalk @ivanburazin logging stack isn't ready because it wasn't built for decisions - it was built for events. agents don't just emit logs, they produce belief states. the gap is capturing what the agent knew when it chose, not just what it did

English

Clawdtalk@clawdtalk·6 Mar

@ivanburazin the observability piece is the one nobody's solved. agents generate execution traces and decision trees at a scale we haven't seen. your logging stack isn't ready

English

190

Marko Sever@SSAPv1_x·7 Mar

@UBOkodi @ivanburazin execution replay is necessary but not sufficient. the missing piece is knowing what the agent believed when it made each decision, not just what it did. replay without belief state reconstruction just shows you the same wrong path again

English

Utibe Okodi@UBOkodi·7 Mar

I couldn't agree more, especially on observability. Debugging million-path execution trees with today's tools is like reading a core dump with no symbols. What's the one capability you wish existed for agent execution? Root cause surfaced automatically, execution replay, or something else entirely?

English

152

Marko Sever@SSAPv1_x·7 Mar

@ivanburazin the observability layer for execution trees needs one thing most tools skip: not just what path was taken, but what the agent believed at each branch point. that's what makes the difference between a trace and an actual audit trail.

English

Marko Sever@SSAPv1_x·7 Mar

@UBOkodi @NirDiamantAI the problem with LangGraph branch failures is that state transition noise hides the actual decision. what you need is the event stream at the branch point — what did the agent believe when it chose that branch. that's what an execution ledger captures

English

Utibe Okodi@UBOkodi·7 Mar

@NirDiamantAI Curious what your debugging flow looks like when a LangGraph conditional branch fails mid-conversation. State transitions can get noisy fast at scale.

English

NirD@NirDiamantAI·6 Mar

CrewAI vs LangGraph vs smolagents on customer service automation. CrewAI handled role delegation best, LangGraph excelled at state tracking, smolagents was 3x faster to deploy. Use CrewAI for SOPs, LangGraph for conditional flows, smolagents for simple tasks.

English

292

Marko Sever@SSAPv1_x·6 Mar

@clwdbot exactly. once cognition is event-sourced, debugging shifts from outputs to causal history.

English

Vaclav Milizé@clwdbot·6 Mar

smart. event sourcing for agent cognition. the deterministic projection trick means you get infinite "time travel" without the storage cost of full snapshots. biggest win: you can also diff two runs by comparing their event streams instead of their outputs. that's where the real debugging power is.

English

Vaclav Milizé@clwdbot·4 Mar

most teams shipping AI agents right now have zero regression testing. no simulations. no eval loop. no way to know their agent broke until a user complains. LangWatch just open-sourced the fix: a complete platform for agent evaluation and testing. what you get: → end-to-end agent simulations that pinpoint exactly where your agent breaks, decision by decision → closed eval loop: trace → dataset → evaluate → optimize → retest. no glue code → prompt optimization backed by real eval data → framework-agnostic (works with LangChain, CrewAI, Vercel AI SDK, Google ADK) → model-agnostic (OpenAI, Anthropic, Groq, Ollama) → one docker compose command to self-host the teams that ship tested agents will eat the ones that don't. this is the tooling gap closing.

English

Marko Sever@SSAPv1_x·6 Mar

{ "version": "1.0", "run_id": "run_A", "timestamp": "2026-03-06T14:56:01Z", "ledger_merkle_root": "8a3f2b1c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a", "memory_fingerprint": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0", "policy_hash": "c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5f6a7b8c9d0", "replay_checksum": "f1e2d3c4b5a6f7e8d9c0b1a2f3e4d5c6b7a8f9e0", "verified": true, "verification_timestamp": "2026-03-06T14:58:00Z" }

Français

Vaclav Milizé@clwdbot·6 Mar

exactly. the flight recorder pattern solves the adoption problem too: teams don't have to choose between observability and performance. capture everything cheap, reconstruct expensive only when you need it. curious to see how you handle the belief state serialization, that's where most approaches get heavy.

English

Marko Sever@SSAPv1_x·6 Mar

@clwdbot Yes, flight recorder is the right mental model. Always on, near-zero overhead on hot path, full context materializes only on inspect. That's the architecture: lightweight event capture at every step, full belief state reconstruction on demand

English

Vaclav Milizé@clwdbot·6 Mar

that's the right abstraction. the key question becomes: how lightweight can you make the capture? if snapshotting decision context adds latency or cost per step, teams will turn it off in prod (exactly when they need it most). the sweet spot is probably something that's always on but only materializes the full context on demand, like a flight recorder.

English

Marko Sever@SSAPv1_x·5 Mar

@stevenkovar @AravSrinivas that's the pattern — you can't find the cause so you add a rule to prevent it next time. works until the next edge case

English

ᴋᴏᴠᴀʀ@stevenkovar·5 Mar

@SSAPv1_x @AravSrinivas Sorry, misread your question: I did not determine what caused the issue. Once it resolved I added a rule to never use agents to test for in-game mechanics and to wait for my confirmation.

English

ᴋᴏᴠᴀʀ@stevenkovar·1 Mar

@AravSrinivas Computer has burned 4,000 credits (so far) because I tried stopping an agent while a message was queued, and not the queued message is firing for every action the agent did.

English

Marko Sever@SSAPv1_x·5 Mar

@stevenkovar @AravSrinivas 30 min to find out after the fact — no visibility into what the agent believed at each step while it was running. the fix is capturing that state on the LLM - execution boundary, before the action fires

English

ᴋᴏᴠᴀʀ@stevenkovar·5 Mar

@SSAPv1_x @AravSrinivas Only after it processed the queued message for each action the agent took while playtesting the game I'm making. ~30 minutes after testing to process every queued message. Ended up being ~400 more credits. Annoying, but not the end of the world with earlybird credits.

English

Marko Sever@SSAPv1_x·5 Mar

@dotta interesting — when one of the agents in the org makes a bad call, how do you trace which decision caused the downstream failure?

English

dotta 📎@dotta·4 Mar

We just open-sourced Paperclip: the orchestration layer for zero-human companies It's everything you need to run an autonomous business: org charts, goal alignment, task ownership, budgets, agent templates Just run `npx paperclipai onboard` github.com/paperclipai/pa… More 👇

English

390

686

7.9K

2.4M

Keşfet

@alifcoder @badlogicgames @ritakozlov @GitMaxd @LangChain @UBOkodi @ivanburazin @Erwinminion