Abhi

32 posts

Abhi banner
Abhi

Abhi

@ai_monger

Building the mission control for your AI agents. Ex-YC

San Francisco Katılım Mayıs 2026
3 Takip Edilen27 Takipçiler
Abhi
Abhi@ai_monger·
So when something breaks in production, there's no control group. No variable isolation. Just a list of suspects and a deadline. Web teams solved this 15 years ago with A/B testing. Agent teams are still guessing. #buildinpublic #aiagents #agentengineering
English
0
0
0
3
Abhi
Abhi@ai_monger·
70% of AI teams modify prompts monthly. 50% swap models monthly. Almost none run those changes as controlled experiments.
English
1
0
1
5
Abhi retweetledi
Syrin AI
Syrin AI@syrinlabs·
Our 144-case SAGE benchmark: ✅ Hallucination (obvious): 100% accuracy ✅ Hallucination (subtle): 100% (fixed from 75%) ✅ Clean outputs: 83.3% (3 false positives — corpus quality issues) ⚠️ Goal drift (macro): 61.1% ⚠️ Goal drift (subtle): ~20% Overall precision: 96.7% The gap on subtle goal drift is real. It shares >60% vocabulary with normal outputs. This is an open research problem. #buildinpublic #aiagents
English
0
2
2
11
Abhi retweetledi
Syrin AI
Syrin AI@syrinlabs·
The weirdest finding from building SAGE👇 Shannon entropy detects overconfidence in LLMs. High-confidence hallucinations → repetitive language → low entropy Uncertain/confused agents → hedge-filled language → high entropy. Both deviate from the agent's normal entropy distribution. Z-score threshold: |z| > 2.0 = high alert. This replaced a hardcoded confidence keyword detector and performs better on every domain.
English
0
2
3
11
Abhi
Abhi@ai_monger·
The V6 to V7 jump on coding domain was the most surprising. Learning APIs from baseline beats having an LLM guess what's valid.
English
0
0
0
2
Abhi
Abhi@ai_monger·
V7 (zero HC) - Hallucination: 100% - Coding domain: 100% - Hardcoding: None - Cost/call: Low - Generalizes: Yes
English
1
0
1
3
Abhi
Abhi@ai_monger·
How SAGE V7 compares to earlier approaches: V5 (embeddings) - Hallucination: Good - Coding domain: Partial - Hardcoding: Some - Cost/call: Low - Generalizes: Partial
English
1
1
1
5
Abhi
Abhi@ai_monger·
3. API registry never populated: the code path that extracts APIs from knowledge bases was broken from day 1. 4. Empty centroid false positives: [] is truthy in JS. Every clean output triggered false alarms. Embarrassing😶. Sharing anyway. #buildinpublic #AIagents
English
0
0
1
3
Abhi
Abhi@ai_monger·
4 real bugs that nearly killed our project: 1. Weight lookup bug: kebab-case keys, camelCase lookups. Every detector silently fell back to equal weighting. Took weeks to notice. 2. Cache key bug: output.substring(0,100) as cache key. Appended text returned stale results.
English
1
1
2
9
Abhi retweetledi
Syrin AI
Syrin AI@syrinlabs·
You're running a multi-agent AI pipeline. An agent starts producing outputs that subtly diverge from its original goal. How do you catch it? Genuinely curious. Vote for your answer.
English
0
3
3
49
Abhi
Abhi@ai_monger·
v0.12.0 is live. Now comes the hard question: How do you know your agent is still performing well a week after deployment? We've been running an experiment on our own agents. Same queries, every day, for 30 days. The accuracy curve is surprising. Sharing more soon.
English
0
0
0
5
Abhi
Abhi@ai_monger·
@syrinlabs v0.12.0 is live. Now the hard question: How do you know your agent is still performing well a week after deployment? We've been running an experiment on our own agents. Same queries, every day, for 30 days. The accuracy curve is surprising. Sharing soon.
English
0
0
1
5
Abhi retweetledi
Divyanshu Shekhar
Divyanshu Shekhar@dshekhar17·
most interesting finding from our drift benchmark so far: day 1: agent hedges 30% of responses ("I think", "probably", "I'm not sure") day 15: hedges 12% day 30: hedges 3% the agent didn't get smarter. it got bolder. and accuracy went DOWN while assertiveness went UP. confidence calibration is the single most predictive dimension for production failure. when your agent stops saying "I think" — that's when you should worry. full data very soon.
English
0
2
3
31
Abhi retweetledi
Divyanshu Shekhar
Divyanshu Shekhar@dshekhar17·
the agent that crashes is safer than the agent that confidently lies. a crash stops the damage. you know immediately. alerts fire. users see an error page. blast radius: one session. a confident wrong answer? nothing happens. no alert. the user trusts it. acts on it. tells others. the wrong information propagates for days. crash = contained failure with a $0 blast radius. confident wrong answer = trust erosion with unbounded blast radius. yet every team optimizes for uptime instead of correctness.
English
0
1
2
27
Abhi retweetledi
Syrin AI
Syrin AI@syrinlabs·
Agent Drift benchmark preview👇 We ran the same 30 queries against an agent every day for 30 days. Day 1: 91% accuracy Day 15: 84% accuracy Day 30: 72% accuracy Same model, prompt, and code. Zero changes. The agent got worse. Nobody changed anything. And in production, nobody would have noticed without measuring. We'll share the full methodology and data very soon.
Syrin AI tweet media
English
0
1
2
24
Abhi
Abhi@ai_monger·
Preview from our drift benchmark👇 The most surprising finding wasn't that accuracy dropped. It was where it dropped. The agent maintained near-perfect accuracy on simple queries for 30 days. The degradation was concentrated in complex, multi-step tasks. Full data soon.
English
0
1
2
7
Abhi
Abhi@ai_monger·
@syrinlabs v0.12.0 is live. Now comes the hard question: how do you know your agent is still performing well a week after deployment? We've been running an experiment on our own agents. Same queries, every day, for 30 days. More data coming this week📉
English
0
1
2
12
Abhi
Abhi@ai_monger·
Not just more features. More guardrails too. pip install syrin --upgrade (tomorrow) syrin.ai
English
0
0
1
20
Abhi
Abhi@ai_monger·
Tomorrow we ship the version of @syrinlabs I've wanted to exist for 8 months. The pitch is simple: your agent should have a budget it can't exceed, a sandbox it can't escape, and approvals that don't vanish when the process restarts. That's v0.12.0.
English
1
0
1
35