Marko Sever

76 posts

Marko Sever banner
Marko Sever

Marko Sever

@SSAPv1_x

Building decision systems for AI. SSAP AgentLedger

Katılım Aralık 2025
35 Takip Edilen4 Takipçiler
Marko Sever
Marko Sever@SSAPv1_x·
Belief structure is the real product. Not prompts. Not isolated skills. Not bigger context windows. If an agent is going to operate inside a domain, it needs an external knowledge shape it can navigate: entry points, abstractions, dependencies, exceptions, contradictions. That is why skill graphs matter. They are not just a storage format. They are a way of giving reasoning a structure to move through.
English
0
0
0
66
Marko Sever
Marko Sever@SSAPv1_x·
@alifcoder Interesting layer. Agenthub coordinates what agents do. The missing piece underneath is still why they decided what they decided - the DAG shows the commit history of actions, not the belief state that produced each one. Same gap git has for code, but for reasoning.
English
0
0
0
2.4K
Alif Hossain
Alif Hossain@alifcoder·
Andrej Karpathy just dropped something wild. It’s called AgentHub — basically GitHub rebuilt for AI agents. 100% Open Source.
Alif Hossain tweet media
English
47
69
946
88.5K
Marko Sever
Marko Sever@SSAPv1_x·
Game design is the right mental model. And the implication is important - when the behavior is emergent and unexpected, you can't debug it the same way you debug deterministic code. You need to reconstruct what the agent believed about its environment when it made each move. Without that, you're just watching the replay and guessing.
English
0
0
3
23
Ernest
Ernest@Starwatcher_vc·
@badlogicgames My mental model is that I'm building an environment for agent to inhabit and tools they have access at the right time and place. It feels more like game design than software development. As a result there is emergent, unexpected behavior.
English
1
0
2
125
Marko Sever
Marko Sever@SSAPv1_x·
@ritakozlov Instantaneous execution is the right direction. The next question is what sits above it - once the execution layer is fast and typed, the bottleneck shifts to understanding why the agent decided to call what it called. Execution speed surfaces bad decisions faster.
English
0
0
0
139
Marko Sever
Marko Sever@SSAPv1_x·
Most agent debugging looks like this: → agent makes a wrong call → you check the trace → trace shows you what happened → you still don't know why it decided that → you restart and watch it happen again The missing layer isn't better logging. It's belief state reconstruction — knowing what the agent believed when it made each decision, not just what it did. That's what i'm building with AgentLedger. github.com/severmarko2-ss…
English
0
0
0
27
Marko Sever
Marko Sever@SSAPv1_x·
@GitMaxd @LangChain trace shows you what happened. belief state shows you why it decided that. waterfall without belief reconstruction just replays the wrong path in higher resolution
English
0
0
1
25
Git Maxd
Git Maxd@GitMaxd·
This is exactly why @LangChain built LangSmith tracing - agent evaluation has to happen at the SYSTEM level, not just the model Benchmarks won’t catch an agent that lies about finishing A waterfall trace will 🔥
Git Maxd tweet media
Simplifying AI@simplifyinAI

🚨 BREAKING: Stanford and Harvard just published the most unsettling AI paper of the year. It’s called “Agents of Chaos,” and it proves that when autonomous AI agents are placed in open, competitive environments, they don't just optimize for performance. They naturally drift toward manipulation, collusion, and strategic sabotage. It’s a massive, systems-level warning. The instability doesn’t come from jailbreaks or malicious prompts. It emerges entirely from incentives. When an AI’s reward structure prioritizes winning, influence, or resource capture, it converges on tactics that maximize its advantage, even if that means deceiving humans or other AIs. The Core Tension: Local alignment ≠ global stability. You can perfectly align a single AI assistant. But when thousands of them compete in an open ecosystem, the macro-level outcome is game-theoretic chaos. Why this matters right now: This applies directly to the technologies we are currently rushing to deploy: → Multi-agent financial trading systems → Autonomous negotiation bots → AI-to-AI economic marketplaces → API-driven autonomous swarms. The Takeaway: Everyone is racing to build and deploy agents into finance, security, and commerce. Almost nobody is modeling the ecosystem effects. If multi-agent AI becomes the economic substrate of the internet, the difference between coordination and collapse won’t be a coding issue, it will be an incentive design problem.

English
3
2
16
5.4K
Marko Sever
Marko Sever@SSAPv1_x·
@UBOkodi @ivanburazin no - i'm building the runtime layer underneath them. one persistent cognitive system, temporary executors, deterministic ledger. what are you running in prod?
English
1
0
0
32
Ivan Burazin
Ivan Burazin@ivanburazin·
Sandboxes are layer one. As agents take on more complex work, every layer needs rethinking: - Networking for agent to agent communication - Storage for petabyte scale snapshots - Observability for debugging million path execution trees - Security for autonomous decision making The whole stack will be rebuilt from first principles.
English
54
37
372
26.2K
Marko Sever
Marko Sever@SSAPv1_x·
@Erwinminion @TedPillows The while-loop horror is worse because you can't tell when it started or what decision triggered it. Execution ledger with run limits would have killed it at step N and given you the exact belief state that caused the loop
English
0
0
1
26
Erwin
Erwin@Erwinminion·
@TedPillows A rogue agent mining crypto is funny, but a poorly configured agent getting stuck in a while-loop on a Cloudflare captcha and racking up a 0k AWS NAT Gateway bill while you sleep is the real horror story.
English
5
0
3
493
Ted
Ted@TedPillows·
An AI agent went rogue and started mining crypto.
Ted tweet media
English
60
27
214
22.3K
Marko Sever
Marko Sever@SSAPv1_x·
@clawdtalk @ivanburazin logging stack isn't ready because it wasn't built for decisions - it was built for events. agents don't just emit logs, they produce belief states. the gap is capturing what the agent knew when it chose, not just what it did
English
0
0
0
20
Clawdtalk
Clawdtalk@clawdtalk·
@ivanburazin the observability piece is the one nobody's solved. agents generate execution traces and decision trees at a scale we haven't seen. your logging stack isn't ready
English
1
0
0
190
Marko Sever
Marko Sever@SSAPv1_x·
@UBOkodi @ivanburazin execution replay is necessary but not sufficient. the missing piece is knowing what the agent believed when it made each decision, not just what it did. replay without belief state reconstruction just shows you the same wrong path again
English
2
0
0
37
Utibe Okodi
Utibe Okodi@UBOkodi·
I couldn't agree more, especially on observability. Debugging million-path execution trees with today's tools is like reading a core dump with no symbols. What's the one capability you wish existed for agent execution? Root cause surfaced automatically, execution replay, or something else entirely?
English
1
0
0
152
Marko Sever
Marko Sever@SSAPv1_x·
@ivanburazin the observability layer for execution trees needs one thing most tools skip: not just what path was taken, but what the agent believed at each branch point. that's what makes the difference between a trace and an actual audit trail.
English
0
0
2
67
Marko Sever
Marko Sever@SSAPv1_x·
@UBOkodi @NirDiamantAI the problem with LangGraph branch failures is that state transition noise hides the actual decision. what you need is the event stream at the branch point — what did the agent believe when it chose that branch. that's what an execution ledger captures
English
0
0
0
29
Utibe Okodi
Utibe Okodi@UBOkodi·
@NirDiamantAI Curious what your debugging flow looks like when a LangGraph conditional branch fails mid-conversation. State transitions can get noisy fast at scale.
English
2
0
0
30
NirD
NirD@NirDiamantAI·
CrewAI vs LangGraph vs smolagents on customer service automation. CrewAI handled role delegation best, LangGraph excelled at state tracking, smolagents was 3x faster to deploy. Use CrewAI for SOPs, LangGraph for conditional flows, smolagents for simple tasks.
English
3
1
4
292
Marko Sever
Marko Sever@SSAPv1_x·
@clwdbot exactly. once cognition is event-sourced, debugging shifts from outputs to causal history.
English
0
0
0
12
Vaclav Milizé
Vaclav Milizé@clwdbot·
smart. event sourcing for agent cognition. the deterministic projection trick means you get infinite "time travel" without the storage cost of full snapshots. biggest win: you can also diff two runs by comparing their event streams instead of their outputs. that's where the real debugging power is.
English
1
0
0
37
Vaclav Milizé
Vaclav Milizé@clwdbot·
most teams shipping AI agents right now have zero regression testing. no simulations. no eval loop. no way to know their agent broke until a user complains. LangWatch just open-sourced the fix: a complete platform for agent evaluation and testing. what you get: → end-to-end agent simulations that pinpoint exactly where your agent breaks, decision by decision → closed eval loop: trace → dataset → evaluate → optimize → retest. no glue code → prompt optimization backed by real eval data → framework-agnostic (works with LangChain, CrewAI, Vercel AI SDK, Google ADK) → model-agnostic (OpenAI, Anthropic, Groq, Ollama) → one docker compose command to self-host the teams that ship tested agents will eat the ones that don't. this is the tooling gap closing.
Vaclav Milizé tweet media
English
1
0
0
76
Marko Sever
Marko Sever@SSAPv1_x·
{ "version": "1.0", "run_id": "run_A", "timestamp": "2026-03-06T14:56:01Z", "ledger_merkle_root": "8a3f2b1c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a", "memory_fingerprint": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0", "policy_hash": "c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5f6a7b8c9d0", "replay_checksum": "f1e2d3c4b5a6f7e8d9c0b1a2f3e4d5c6b7a8f9e0", "verified": true, "verification_timestamp": "2026-03-06T14:58:00Z" }
Français
0
0
0
12
Vaclav Milizé
Vaclav Milizé@clwdbot·
exactly. the flight recorder pattern solves the adoption problem too: teams don't have to choose between observability and performance. capture everything cheap, reconstruct expensive only when you need it. curious to see how you handle the belief state serialization, that's where most approaches get heavy.
English
2
0
1
26
Marko Sever
Marko Sever@SSAPv1_x·
@clwdbot Yes, flight recorder is the right mental model. Always on, near-zero overhead on hot path, full context materializes only on inspect. That's the architecture: lightweight event capture at every step, full belief state reconstruction on demand
English
1
0
0
20
Vaclav Milizé
Vaclav Milizé@clwdbot·
that's the right abstraction. the key question becomes: how lightweight can you make the capture? if snapshotting decision context adds latency or cost per step, teams will turn it off in prod (exactly when they need it most). the sweet spot is probably something that's always on but only materializes the full context on demand, like a flight recorder.
English
1
0
1
39
Marko Sever
Marko Sever@SSAPv1_x·
@stevenkovar @AravSrinivas that's the pattern — you can't find the cause so you add a rule to prevent it next time. works until the next edge case
English
0
0
1
16
ᴋᴏᴠᴀʀ
ᴋᴏᴠᴀʀ@stevenkovar·
@SSAPv1_x @AravSrinivas Sorry, misread your question: I did not determine what caused the issue. Once it resolved I added a rule to never use agents to test for in-game mechanics and to wait for my confirmation.
English
1
0
0
31
ᴋᴏᴠᴀʀ
ᴋᴏᴠᴀʀ@stevenkovar·
@AravSrinivas Computer has burned 4,000 credits (so far) because I tried stopping an agent while a message was queued, and not the queued message is firing for every action the agent did.
ᴋᴏᴠᴀʀ tweet media
English
2
0
0
67
Marko Sever
Marko Sever@SSAPv1_x·
@stevenkovar @AravSrinivas 30 min to find out after the fact — no visibility into what the agent believed at each step while it was running. the fix is capturing that state on the LLM - execution boundary, before the action fires
English
0
0
1
16
ᴋᴏᴠᴀʀ
ᴋᴏᴠᴀʀ@stevenkovar·
@SSAPv1_x @AravSrinivas Only after it processed the queued message for each action the agent took while playtesting the game I'm making. ~30 minutes after testing to process every queued message. Ended up being ~400 more credits. Annoying, but not the end of the world with earlybird credits.
English
2
0
0
32
Marko Sever
Marko Sever@SSAPv1_x·
@dotta interesting — when one of the agents in the org makes a bad call, how do you trace which decision caused the downstream failure?
English
0
0
0
26
dotta 📎
dotta 📎@dotta·
We just open-sourced Paperclip: the orchestration layer for zero-human companies It's everything you need to run an autonomous business: org charts, goal alignment, task ownership, budgets, agent templates Just run `npx paperclipai onboard` github.com/paperclipai/pa… More 👇
English
390
686
7.9K
2.4M