Abhi

32 posts

Abhi

@ai_monger

Building the mission control for your AI agents. Ex-YC

San Francisco Katılım Mayıs 2026

3 Takip Edilen27 Takipçiler

Abhi@ai_monger·10h

So when something breaks in production, there's no control group. No variable isolation. Just a list of suspects and a deadline. Web teams solved this 15 years ago with A/B testing. Agent teams are still guessing. #buildinpublic #aiagents #agentengineering

English

Abhi@ai_monger·10h

70% of AI teams modify prompts monthly. 50% swap models monthly. Almost none run those changes as controlled experiments.

English

Abhi retweetledi

Syrin AI@syrinlabs·2d

Our 144-case SAGE benchmark: ✅ Hallucination (obvious): 100% accuracy ✅ Hallucination (subtle): 100% (fixed from 75%) ✅ Clean outputs: 83.3% (3 false positives — corpus quality issues) ⚠️ Goal drift (macro): 61.1% ⚠️ Goal drift (subtle): ~20% Overall precision: 96.7% The gap on subtle goal drift is real. It shares >60% vocabulary with normal outputs. This is an open research problem. #buildinpublic #aiagents

English

Abhi retweetledi

Syrin AI@syrinlabs·1d

The weirdest finding from building SAGE👇 Shannon entropy detects overconfidence in LLMs. High-confidence hallucinations → repetitive language → low entropy Uncertain/confused agents → hedge-filled language → high entropy. Both deviate from the agent's normal entropy distribution. Z-score threshold: |z| > 2.0 = high alert. This replaced a hardcoded confidence keyword detector and performs better on every domain.

English

Abhi@ai_monger·2d

The V6 to V7 jump on coding domain was the most surprising. Learning APIs from baseline beats having an LLM guess what's valid.

English

Abhi@ai_monger·2d

V7 (zero HC) - Hallucination: 100% - Coding domain: 100% - Hardcoding: None - Cost/call: Low - Generalizes: Yes

English

Abhi@ai_monger·2d

How SAGE V7 compares to earlier approaches: V5 (embeddings) - Hallucination: Good - Coding domain: Partial - Hardcoding: Some - Cost/call: Low - Generalizes: Partial

English

Abhi@ai_monger·3d

3. API registry never populated: the code path that extracts APIs from knowledge bases was broken from day 1. 4. Empty centroid false positives: [] is truthy in JS. Every clean output triggered false alarms. Embarrassing😶. Sharing anyway. #buildinpublic #AIagents

English

Abhi@ai_monger·3d

4 real bugs that nearly killed our project: 1. Weight lookup bug: kebab-case keys, camelCase lookups. Every detector silently fell back to equal weighting. Took weeks to notice. 2. Cache key bug: output.substring(0,100) as cache key. Appended text returned stale results.

English

Abhi retweetledi

Syrin AI@syrinlabs·5d

You're running a multi-agent AI pipeline. An agent starts producing outputs that subtly diverge from its original goal. How do you catch it? Genuinely curious. Vote for your answer.

English

Abhi@ai_monger·5d

v0.12.0 is live. Now comes the hard question: How do you know your agent is still performing well a week after deployment? We've been running an experiment on our own agents. Same queries, every day, for 30 days. The accuracy curve is surprising. Sharing more soon.

English

Abhi@ai_monger·5d

@syrinlabs v0.12.0 is live. Now the hard question: How do you know your agent is still performing well a week after deployment? We've been running an experiment on our own agents. Same queries, every day, for 30 days. The accuracy curve is surprising. Sharing soon.

English

Abhi retweetledi

Divyanshu Shekhar@dshekhar17·6d

most interesting finding from our drift benchmark so far: day 1: agent hedges 30% of responses ("I think", "probably", "I'm not sure") day 15: hedges 12% day 30: hedges 3% the agent didn't get smarter. it got bolder. and accuracy went DOWN while assertiveness went UP. confidence calibration is the single most predictive dimension for production failure. when your agent stops saying "I think" — that's when you should worry. full data very soon.

English

Abhi retweetledi

Divyanshu Shekhar@dshekhar17·5 May

the agent that crashes is safer than the agent that confidently lies. a crash stops the damage. you know immediately. alerts fire. users see an error page. blast radius: one session. a confident wrong answer? nothing happens. no alert. the user trusts it. acts on it. tells others. the wrong information propagates for days. crash = contained failure with a $0 blast radius. confident wrong answer = trust erosion with unbounded blast radius. yet every team optimizes for uptime instead of correctness.

English

Abhi retweetledi

Syrin AI@syrinlabs·6d

Agent Drift benchmark preview👇 We ran the same 30 queries against an agent every day for 30 days. Day 1: 91% accuracy Day 15: 84% accuracy Day 30: 72% accuracy Same model, prompt, and code. Zero changes. The agent got worse. Nobody changed anything. And in production, nobody would have noticed without measuring. We'll share the full methodology and data very soon.

English

Abhi@ai_monger·6d

Preview from our drift benchmark👇 The most surprising finding wasn't that accuracy dropped. It was where it dropped. The agent maintained near-perfect accuracy on simple queries for 30 days. The degradation was concentrated in complex, multi-step tasks. Full data soon.

English

Abhi@ai_monger·5 May

@syrinlabs v0.12.0 is live. Now comes the hard question: how do you know your agent is still performing well a week after deployment? We've been running an experiment on our own agents. Same queries, every day, for 30 days. More data coming this week📉

English

Abhi@ai_monger·4 May

🔥🔥🔥

Syrin AI@syrinlabs

The AI Harness for Python Developers Syrin v0.12.0 is out🚀 The headline feature: sandboxed execution. Your agents can now run bash scripts and Python in fully isolated subprocesses - no shared state, hard timeouts, auto-cleanup. What's new 👇 exec_bash() — run shell scripts with SANDBOX_WORKSPACE env var packages= — auto-install Python packages before first exec async with Sandbox() — guaranteed cleanup via context manager Sandbox auto-propagates through RLMLoop to all spawned children Development Status bumped to Beta 8 example bugs + 1 source bug fixed 6 breaking changes with a full migration guide Install/upgrade: pip install --upgrade syrin

ART

Abhi@ai_monger·3 May

Not just more features. More guardrails too. pip install syrin --upgrade (tomorrow) syrin.ai

English

Abhi@ai_monger·3 May

Tomorrow we ship the version of @syrinlabs I've wanted to exist for 8 months. The pitch is simple: your agent should have a budget it can't exceed, a sandbox it can't escape, and approvals that don't vanish when the process restarts. That's v0.12.0.

English

Keşfet

@syrinlabs @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine