ZeroEval

12 posts

ZeroEval banner
ZeroEval

ZeroEval

@ZeroEval

The self-improving layer for agents.

NYC Katılım Temmuz 2025
7 Takip Edilen119 Takipçiler
ZeroEval retweetledi
LLM Stats
LLM Stats@LlmStats·
A Failure-Focused Evaluation of Frontier Models Benchmark scores tell you which model is "best on average", but not where they fail. We reproduced a set of difficult evaluations on seven frontier models to investigate two signals: consistent failures and task-specific advantages. Our findings: → 85.2% average failure rate on Humanity’s Last Exam across all seven models evaluated. → 46.2% of Humanity’s Last Exam questions were failed by all seven models under these evaluation conditions. → Nearly 80% of engineering problems, including structural analysis, thermodynamics, and control systems, remained unsolved by all models. Let’s dig deeper (1/8)
LLM Stats tweet media
English
3
7
10
607
ZeroEval retweetledi
LLM Stats
LLM Stats@LlmStats·
How does the Veo 3 family stack up in video generation? 🎬 I ran a series of tests to understand the capabilities of this model lineup. To my surprise, despite being part of the same family, there are significant differences in how each version approaches and solves the same prompt. Tested 4 different versions of Google's Veo to see which one handles video generation best: ✅ Veo 3.1 ✅ Veo 3.1 Fast ✅ Veo 3.0 ✅ Veo 3.0 Fast
English
2
1
7
300
ZeroEval retweetledi
LLM Stats
LLM Stats@LlmStats·
🟩 Nemotron 3 Nano is out: → Hybrid Mamba-Transformer architecture: longer context that stays fast and cheap. → 31.6B params but only 3.6B active per token: frontier-adjacent performance at fraction of compute. → 4x faster inference than Nemotron 2 Nano → Open weights available through HF Info: llm-stats.com/models/nemotro… Blog: llm-stats.com/blog/research/…
LLM Stats tweet media
English
1
2
3
330
ZeroEval retweetledi
seb
seb@sebcrossa·
what if you could teach the ai that powers your products on what's good and what's bad? after chatting with hundreds of AI co's about prompt engineering, the same things comes up again and again: 95% of them are purely vibe prompting and hate the process. we just built a new feature for @ZeroEval that lets you improve your prompts through human feedback, powered by @DSPyOSS. plug into our sdk, give feedback (ui, sdk or api) and generate prompt improvements. as easy as that. let me show you how it works
English
1
1
2
565
ZeroEval retweetledi
Hi
Hi@hi_ventures_·
🚀 AI 100 — Latin America’s Early AI Startups Map by Country Following our first edition of the AI 100 Map (by sector), we’re excited to share a new perspective — this time highlighting where innovation is happening across Latin America. This updated version showcases the country of origin for startups that: • Are building core AI products or applying AI in transformative ways • Are VC-backed and have raised no more than $10M • Represent the ambition, creativity, and technical depth that we love at Hi Ventures This geographic view gives us a glimpse into the emerging AI hubs driving the region’s tech revolution — from Mexico City to São Paulo, Buenos Aires, Bogotá, and beyond. Find the download link in the comments. If there’s a startup we missed, tag them below or DM us — we’re always discovering new talent shaping Latin America’s AI future. @mappa_ai, @getdarwinai, @Winclap, @territoriumlife, @JelouAI, @oimagie, @UpFluxPM, @neuralmedai, @start_carreiras, @Leadsales_io, @WeKallco, @heyyatlass, @yana_oficial, @ViewMind_, @Fintalk_ai, @ZapiaAI , @ArkhamInc, @inner_ai_, @joingaus, @Allie_Systems, @Saptiva_AI, @CedalioTech, @TimeToHire_Ai, @kapso_ai, @chambasai, @Leona_health, @pathpilotAI, @ZeroEval, @PicaioAI, @VerveMarketCo, @BircleAI, @SaludNowMX, @instacrops
Hi tweet media
English
2
4
7
653
ZeroEval retweetledi
LLM Stats
LLM Stats@LlmStats·
LLM Stats is live on Product Hunt 🥳🎉 We're doubling down on independent benchmarking for AI models and bringing transparency and reproducibility to model performance. Are there any benchmarks you'd like to see or wish existed? Reply below. producthunt.com/products/llm-s…
English
3
2
8
762
Ollie Forsyth
Ollie Forsyth@ollieforsyth·
Meet @ycombinator 's latest batch of startups (S25) Ahead of YC's demo day tomorrow, we’ve just published the full market map of 160+ new startups building across 👇 1. B2B infrastructure & dev tools 2. B2B engineering, product & design tools 3. B2B productivity & ops tools 4. Healthcare 5. B2B sales & marketing 6. Consumer 7. Supply chain & logistics 8. FinTech 9. Industrial 10. B2B operations 11. Real estate & construction 12. B2B legal 13. B2B recruiting 14. Government 15. Education 16. B2B accounting & finance 17. B2B security What do we know so far about the startups? 1. 30% are building for developers. 2. Notable sectors include B2B productivity, healthcare, industrials, and fintech. 3. 60% mention ‘AI’ in their tagline. Best of luck tomorrow YC Founders! Discover the latest YC companies in full! --> neweconomies.co/p/ycstartups20…
Ollie Forsyth tweet media
English
45
150
1.1K
154.7K
ZeroEval retweetledi
The AI Colony R&D
The AI Colony R&D@TheAIColonyRD·
➡️ ZeroEval / @ZeroEval If you want AI agents that actually get smarter, this is it! ZeroEval builds agents that learn from their mistakes. It runs evaluations that train your models to improve over time, no retraining needed.
English
1
2
6
583
ZeroEval
ZeroEval@ZeroEval·
The new GPQA Diamond Ranking: GPT-5 is now the leader.
ZeroEval tweet media
English
0
2
1
655