Katrina Drozdov (Evtimova)

367 posts

Katrina Drozdov (Evtimova)

@stochasticdoggo

Research Scientist @ValsAI | PhD from @NYUDataScience | Bulgarian yogurt, prime numbers, and dogs bring me joy | she/her

San Francisco, CA Katılım Eylül 2017

379 Takip Edilen395 Takipçiler

Katrina Drozdov (Evtimova) retweetledi

Hao Wang@MogicianTony·9 Nis

SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits. Our agent scored 100% on both. It solved 0 tasks. Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵

English

667

801.6K

Katrina Drozdov (Evtimova) retweetledi

Ben James@BenJames_____·6 Nis

I made a USB-Clawd who gets my attention when Claude Code finishes a response

English

421

1.3K

19.8K

1.3M

Katrina Drozdov (Evtimova) retweetledi

Vals AI@ValsAI·28 Mar

Costs per million tokens don’t necessarily reflect the true cost of completing tasks; we see this in our evaluations as well. That’s why our benchmarks report cost per test in addition to cost per million tokens, providing a more complete picture of model performance relative to actual cost.

lingjiao chen@ChenLingjiao

🚨 Are lower-priced AI models really cheaper? Beware of the "Price Reversal" phenomenon in Reasoning Language Models (RLMs)! 💸 We evaluated frontier RLMs and found sth shocking: a model with lower API pricing can actually cost more! 🧵👇

English

1.7K

Katrina Drozdov (Evtimova) retweetledi

lingjiao chen@ChenLingjiao·27 Mar

English

132

27.3K

Katrina Drozdov (Evtimova) retweetledi

Vals AI@ValsAI·18 Mar

Initial results are in for Minimax 2.7, and it comes in at #12 overall on the Vals Index. If the weights are released, it will be #2 on the open-weight index (only 0.5% behind #1).

English

337

32.5K

Katrina Drozdov (Evtimova) retweetledi

Vals AI@ValsAI·16 Mar

Picking the right model for your work is overwhelming. Hundreds of models, endless benchmarks, constant releases. This is why we’re introducing Vals Model Guide. Fill out a short survey, query real models, and pick the one that works best for you!

English

1.1K

Katrina Drozdov (Evtimova) retweetledi

Krista Opsahl-Ong@kristahopsalong·10 Mar

Most AI benchmarks test reasoning in isolation. Real enterprise tasks require grounded reasoning: 1️⃣ Find the right documents 2️⃣ Extract the right values 3️⃣ Perform analyses OfficeQA Pro evaluates this end-to-end. Frontier agents still score <50%. 🧵Paper & details below!

English

110

44.3K

Katrina Drozdov (Evtimova) retweetledi

Vals AI@ValsAI·5 Mar

GPT 5.4 is #1 on Vibe Code Bench at 67.4%, +5.7% higher than the previous SOTA. This is our benchmark that measures model’s ability to produce an entire working application from a short text specification.

English

555

66.9K

Katrina Drozdov (Evtimova) retweetledi

Vals AI@ValsAI·4 Mar

Benchmark scores without uncertainty are fundamentally incomplete. Vals AI now reports standard error on all our results.

English

Katrina Drozdov (Evtimova) retweetledi

Vals AI@ValsAI·19 Şub

The gap is revealing: AI can write coherent clinical narratives but struggles with structured, rule-based reasoning. As healthcare rapidly adopts AI, these benchmarks provide the first real-world evaluation of whether systems are actually ready.

English

349

Katrina Drozdov (Evtimova)@stochasticdoggo·10 Şub

What an insightful retrospective, thank you for sharing @bneyshabur. Excited to follow your progress and wishing you lots of continued success in this new chapter!

Behnam Neyshabur@bneyshabur

Working at Anthropic was a wonderful experience. Extremely high talent density, amazing culture, mission-driven, zero politics, leadership with real technical depth. Over the past year, I’ve learned so much about what made Anthropic successful and developed great respect for the founders and the team. I'm grateful to have been part of such an extraordinary organization. Reflecting on the last 20 years, I see three phases—and I'm now entering a fourth:

English

157

Katrina Drozdov (Evtimova) retweetledi

Behnam Neyshabur@bneyshabur·5 Şub

English

252

34.3K

Katrina Drozdov (Evtimova) retweetledi

Vals AI@ValsAI·5 Şub

Claude Opus 4.6 is #1 on the Vals Index 🏆 It sets a new state-of-the-art on FinanceAgent, ProofBench, TaxEval, and SWE-Bench. (1/n)

English

153

18.5K

Katrina Drozdov (Evtimova) retweetledi

Harmonic@HarmonicMath·31 Oca

BREAKING: Aristotle achieves top ranking on ProofBench from @ValsAI, nearly 2x the performance of the #2 model, Claude Opus 4.5. Aristotle is currently generally available free of charge. Give it a try!

Vals AI@ValsAI

We’re releasing ProofBench, a challenging benchmark that measures models’ ability to write formally verifiable graduate-level proofs!

English

176

43.3K

Katrina Drozdov (Evtimova) retweetledi

Vals AI@ValsAI·31 Oca

We’re releasing ProofBench, a challenging benchmark that measures models’ ability to write formally verifiable graduate-level proofs!

English

155

45.8K

Katrina Drozdov (Evtimova) retweetledi

Vals AI@ValsAI·27 Oca

Kimi 2.5 Thinking is the new #1 open-weight model, taking the top spot on our index (both multimodal and text-only)🥇 The model even compares favorably to leading closed-source providers: it places in the top 10 among all models on both indices 🚀 Congrats @Kimi_Moonshot

English

182

18.9K

Katrina Drozdov (Evtimova) retweetledi

Vals AI@ValsAI·24 Oca

We've upgraded our Terminal-Bench leaderboard to version 2. The new benchmark features more, better, and more relevant tasks.

English

151

9.5K

Katrina Drozdov (Evtimova)@stochasticdoggo·9 Oca

Paper of the day: Recursive Language Models. Instead of stuffing everything into context, treat the prompt as an external environment the model can query, decompose, and recurse over. Scales to 10M+ tokens with strong performance. #AIPaperADay

English

Katrina Drozdov (Evtimova)@stochasticdoggo·8 Oca

Paper of the day: From Entropy to Epiplexity. Not all information is equally useful for learning. Epiplexity tries to formalize the structure models can actually extract under compute limits. #AIPaperADay

English

213

Katrina Drozdov (Evtimova)@stochasticdoggo·12 Ara

ZXX

Katrina Drozdov (Evtimova)@stochasticdoggo·12 Ara

Link to the paper: arxiv.org/abs/2508.04183…

English

Katrina Drozdov (Evtimova)@stochasticdoggo·12 Ara

Paper for Dec 11: Characterizing Deep Research. Deep research = search + reasoning, not just long outputs. LiveDRBench offers a concrete way to measure effectiveness and evaluate next-gen research agents. #aiadventcalendar

English

Keşfet

@bneyshabur @ValsAI @Kimi_Moonshot @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates