ZeroEval

12 posts

ZeroEval

@ZeroEval

The self-improving layer for agents.

NYC Katılım Temmuz 2025

7 Takip Edilen119 Takipçiler

ZeroEval retweetledi

LLM Stats@LlmStats·4 Şub

A Failure-Focused Evaluation of Frontier Models Benchmark scores tell you which model is "best on average", but not where they fail. We reproduced a set of difficult evaluations on seven frontier models to investigate two signals: consistent failures and task-specific advantages. Our findings: → 85.2% average failure rate on Humanity’s Last Exam across all seven models evaluated. → 46.2% of Humanity’s Last Exam questions were failed by all seven models under these evaluation conditions. → Nearly 80% of engineering problems, including structural analysis, thermodynamics, and control systems, remained unsolved by all models. Let’s dig deeper (1/8)

English

607

ZeroEval retweetledi

LLM Stats@LlmStats·13 Oca

How does the Veo 3 family stack up in video generation? 🎬 I ran a series of tests to understand the capabilities of this model lineup. To my surprise, despite being part of the same family, there are significant differences in how each version approaches and solves the same prompt. Tested 4 different versions of Google's Veo to see which one handles video generation best: ✅ Veo 3.1 ✅ Veo 3.1 Fast ✅ Veo 3.0 ✅ Veo 3.0 Fast

English

300

ZeroEval retweetledi

LLM Stats@LlmStats·17 Ara

🟩 Nemotron 3 Nano is out: → Hybrid Mamba-Transformer architecture: longer context that stays fast and cheap. → 31.6B params but only 3.6B active per token: frontier-adjacent performance at fraction of compute. → 4x faster inference than Nemotron 2 Nano → Open weights available through HF Info: llm-stats.com/models/nemotro… Blog: llm-stats.com/blog/research/…

English

330

ZeroEval retweetledi

seb@sebcrossa·26 Kas

what if you could teach the ai that powers your products on what's good and what's bad? after chatting with hundreds of AI co's about prompt engineering, the same things comes up again and again: 95% of them are purely vibe prompting and hate the process. we just built a new feature for @ZeroEval that lets you improve your prompts through human feedback, powered by @DSPyOSS. plug into our sdk, give feedback (ui, sdk or api) and generate prompt improvements. as easy as that. let me show you how it works

English

565

ZeroEval retweetledi

Hi@hi_ventures_·29 Eki

🚀 AI 100 — Latin America’s Early AI Startups Map by Country Following our first edition of the AI 100 Map (by sector), we’re excited to share a new perspective — this time highlighting where innovation is happening across Latin America. This updated version showcases the country of origin for startups that: • Are building core AI products or applying AI in transformative ways • Are VC-backed and have raised no more than $10M • Represent the ambition, creativity, and technical depth that we love at Hi Ventures This geographic view gives us a glimpse into the emerging AI hubs driving the region’s tech revolution — from Mexico City to São Paulo, Buenos Aires, Bogotá, and beyond. Find the download link in the comments. If there’s a startup we missed, tag them below or DM us — we’re always discovering new talent shaping Latin America’s AI future. @mappa_ai, @getdarwinai, @Winclap, @territoriumlife, @JelouAI, @oimagie, @UpFluxPM, @neuralmedai, @start_carreiras, @Leadsales_io, @WeKallco, @heyyatlass, @yana_oficial, @ViewMind_, @Fintalk_ai, @ZapiaAI , @ArkhamInc, @inner_ai_, @joingaus, @Allie_Systems, @Saptiva_AI, @CedalioTech, @TimeToHire_Ai, @kapso_ai, @chambasai, @Leona_health, @pathpilotAI, @ZeroEval, @PicaioAI, @VerveMarketCo, @BircleAI, @SaludNowMX, @instacrops

English

653

ZeroEval retweetledi

LLM Stats@LlmStats·28 Eki

LLM Stats is live on Product Hunt 🥳🎉 We're doubling down on independent benchmarking for AI models and bringing transparency and reproducibility to model performance. Are there any benchmarks you'd like to see or wish existed? Reply below. producthunt.com/products/llm-s…

English

762

ZeroEval@ZeroEval·8 Eyl

@ollieforsyth @agentmail @dedaluslabs @DeepAwareAI @deepgrove_ai @Jerr_Wu @onkernel @luminal_ai @manufact @modelence @nuntiusai @LilacML @agenthublabs @vibeflowai @try_channel3 @monarcha_ai @nottecore Thank you, Ollie <3

English

Ollie Forsyth@ollieforsyth·8 Eyl

@agentmail @dedaluslabs @DeepAwareAI @deepgrove_ai @Jerr_Wu @onkernel @luminal_ai @manufact @modelence @nuntiusai @LilacML @ZeroEval @agenthublabs @vibeflowai @try_channel3 @monarcha_ai @nottecore Best of luck tomorrow everyone!!

English

143

Ollie Forsyth@ollieforsyth·8 Eyl

Meet @ycombinator 's latest batch of startups (S25) Ahead of YC's demo day tomorrow, we’ve just published the full market map of 160+ new startups building across 👇 1. B2B infrastructure & dev tools 2. B2B engineering, product & design tools 3. B2B productivity & ops tools 4. Healthcare 5. B2B sales & marketing 6. Consumer 7. Supply chain & logistics 8. FinTech 9. Industrial 10. B2B operations 11. Real estate & construction 12. B2B legal 13. B2B recruiting 14. Government 15. Education 16. B2B accounting & finance 17. B2B security What do we know so far about the startups? 1. 30% are building for developers. 2. Notable sectors include B2B productivity, healthcare, industrials, and fintech. 3. 60% mention ‘AI’ in their tagline. Best of luck tomorrow YC Founders! Discover the latest YC companies in full! --> neweconomies.co/p/ycstartups20…

English

150

1.1K

154.7K

ZeroEval retweetledi

The AI Colony R&D@TheAIColonyRD·25 Ağu

➡️ ZeroEval / @ZeroEval If you want AI agents that actually get smarter, this is it! ZeroEval builds agents that learn from their mistakes. It runs evaluations that train your models to improve over time, no retraining needed.

English

583

ZeroEval@ZeroEval·19 Ağu

Video by @AustinPeirson

English

348

ZeroEval@ZeroEval·19 Ağu

ZeroEval is a tool to help you evaluate and optimize your agents with human feedback. Learn more at zeroeval.com

Y Combinator@ycombinator

𝜃 @ZeroEval helps you build reliable AI agents through evaluations that learn from their mistakes and get better over time. ycombinator.com/launches/OEC-z… Congrats on the launch, @sebcrossa and @pirchavez!

English

ZeroEval@ZeroEval·7 Ağu

The new GPQA Diamond Ranking: GPT-5 is now the leader.

English

655

Keşfet

@DSPyOSS @mappa_ai @getdarwinai @Winclap @territoriumlife @JelouAI @oimagie @UpFluxPM