Syrin AI
116 posts

Syrin AI
@syrinlabs
The mission control for your AI agents in production.
San Francisco Katılım Şubat 2026
5 Takip Edilen5 Takipçiler

Drop your startup below. I read every single one. The best get featured to 45k founders in Launch Llama 👇 #buildinpublic

English

@syrinlabs This is the right direction. Production A/B testing beats offline opinion every time, as long as the guardrails are strong enough to keep bad variants from shipping.
English

We built A/B testing for AI agents.
Not offline evals. Not test datasets.
Actual production traffic split between config variants.
Prompt A vs Prompt B.
GPT-4o vs GPT-4o-mini.
Temperature 0.3 vs 0.7.
All running simultaneously on real users.
With statistical confidence before you commit.
Would you like to give it a try for free? Link in comments.
English

How SAGE detects internal contradictions:
1. Split output into sentences
2. Embed each sentence via @OpenAI
3. Find sentence pairs with high similarity + opposing meaning
4. Flag semantic negation patterns ("use X" + "don't use X") Zero keywords. Pure embedding math.
The scary thing: this fires in production.
Real LLMs contradict themselves. Often in the same paragraph.
English

Flesch-Kincaid Grade Level is a formula from 1975. It detects AI agent drift in 2026.
0.39×(words/sentences) + 11.8×(syllables/words) - 15.59
When an agent drifts, its writing complexity often changes.
A technical coding agent suddenly writing like a press release: FKGL drops, drift fires.
Old formula. Real signal. No LLM judge required.
English

The weirdest finding from building SAGE👇
Shannon entropy detects overconfidence in LLMs.
High-confidence hallucinations → repetitive language → low entropy Uncertain/confused agents → hedge-filled language → high entropy.
Both deviate from the agent's normal entropy distribution.
Z-score threshold: |z| > 2.0 = high alert.
This replaced a hardcoded confidence keyword detector and performs better on every domain.
English

Our 144-case SAGE benchmark:
✅ Hallucination (obvious): 100% accuracy
✅ Hallucination (subtle): 100% (fixed from 75%)
✅ Clean outputs: 83.3% (3 false positives — corpus quality issues)
⚠️ Goal drift (macro): 61.1%
⚠️ Goal drift (subtle): ~20%
Overall precision: 96.7%
The gap on subtle goal drift is real. It shares >60% vocabulary with normal outputs.
This is an open research problem.
#buildinpublic #aiagents
English

200-agent stress test results:
👉 200 LLM calls, 20 workflows, 5 domains
👉 14 drift alerts (7.78% of handoffs)
👉 0% false positive rate
👉 2.8 seconds per agent
👉 0 crashes
Domain drift rates:
- Customer Support: 11.1% (highest)
- Engineering/Finance/Product: 8.3%
- Marketing: 2.8% (lowest)
10-agent pipeline = 50.7% chance of at least one drifting handoff.
English
Syrin AI retweetledi

Our 144-case SAGE benchmark👇
3 domains × 6 agents × multiple test categories.
Results by category:
✅ Hallucination (obvious): 100% accuracy
✅ Hallucination (subtle): 100% (fixed from 75%)
✅ Clean outputs: 83.3% (3 false positives — corpus quality)
⚠️ Goal drift (macro): 61.1%
⚠️ Goal drift (subtle): ~20%
Overall precision: 96.7%
Subtle goal drift shares >60% vocabulary with normal outputs. This is an open research problem.
English









