Katrina Drozdov (Evtimova)

367 posts

Katrina Drozdov (Evtimova) banner
Katrina Drozdov (Evtimova)

Katrina Drozdov (Evtimova)

@stochasticdoggo

Research Scientist @ValsAI | PhD from @NYUDataScience | Bulgarian yogurt, prime numbers, and dogs bring me joy | she/her

San Francisco, CA Katılım Eylül 2017
379 Takip Edilen395 Takipçiler
Katrina Drozdov (Evtimova) retweetledi
Hao Wang
Hao Wang@MogicianTony·
SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits. Our agent scored 100% on both. It solved 0 tasks. Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵
Hao Wang tweet media
English
22
90
667
801.6K
Katrina Drozdov (Evtimova) retweetledi
Ben James
Ben James@BenJames_____·
I made a USB-Clawd who gets my attention when Claude Code finishes a response
English
421
1.3K
19.8K
1.3M
Katrina Drozdov (Evtimova) retweetledi
Vals AI
Vals AI@ValsAI·
Costs per million tokens don’t necessarily reflect the true cost of completing tasks; we see this in our evaluations as well. That’s why our benchmarks report cost per test in addition to cost per million tokens, providing a more complete picture of model performance relative to actual cost.
lingjiao chen@ChenLingjiao

🚨 Are lower-priced AI models really cheaper? Beware of the "Price Reversal" phenomenon in Reasoning Language Models (RLMs)! 💸 We evaluated frontier RLMs and found sth shocking: a model with lower API pricing can actually cost more! 🧵👇

English
0
3
23
1.7K
Katrina Drozdov (Evtimova) retweetledi
lingjiao chen
lingjiao chen@ChenLingjiao·
🚨 Are lower-priced AI models really cheaper? Beware of the "Price Reversal" phenomenon in Reasoning Language Models (RLMs)! 💸 We evaluated frontier RLMs and found sth shocking: a model with lower API pricing can actually cost more! 🧵👇
lingjiao chen tweet media
English
10
27
132
27.3K
Katrina Drozdov (Evtimova) retweetledi
Vals AI
Vals AI@ValsAI·
Initial results are in for Minimax 2.7, and it comes in at #12 overall on the Vals Index. If the weights are released, it will be #2 on the open-weight index (only 0.5% behind #1).
Vals AI tweet media
English
12
23
337
32.5K
Katrina Drozdov (Evtimova) retweetledi
Vals AI
Vals AI@ValsAI·
Picking the right model for your work is overwhelming. Hundreds of models, endless benchmarks, constant releases. This is why we’re introducing Vals Model Guide. Fill out a short survey, query real models, and pick the one that works best for you!
English
1
1
11
1.1K
Katrina Drozdov (Evtimova) retweetledi
Krista Opsahl-Ong
Krista Opsahl-Ong@kristahopsalong·
Most AI benchmarks test reasoning in isolation. Real enterprise tasks require grounded reasoning: 1️⃣ Find the right documents 2️⃣ Extract the right values 3️⃣ Perform analyses OfficeQA Pro evaluates this end-to-end. Frontier agents still score <50%. 🧵Paper & details below!
Krista Opsahl-Ong tweet media
English
7
27
110
44.3K
Katrina Drozdov (Evtimova) retweetledi
Vals AI
Vals AI@ValsAI·
GPT 5.4 is #1 on Vibe Code Bench at 67.4%, +5.7% higher than the previous SOTA. This is our benchmark that measures model’s ability to produce an entire working application from a short text specification.
Vals AI tweet media
English
30
41
555
66.9K
Katrina Drozdov (Evtimova) retweetledi
Vals AI
Vals AI@ValsAI·
Benchmark scores without uncertainty are fundamentally incomplete. Vals AI now reports standard error on all our results.
Vals AI tweet media
English
2
3
28
2K
Katrina Drozdov (Evtimova) retweetledi
Vals AI
Vals AI@ValsAI·
The gap is revealing: AI can write coherent clinical narratives but struggles with structured, rule-based reasoning. As healthcare rapidly adopts AI, these benchmarks provide the first real-world evaluation of whether systems are actually ready.
English
1
1
0
349
Katrina Drozdov (Evtimova) retweetledi
Behnam Neyshabur
Behnam Neyshabur@bneyshabur·
Working at Anthropic was a wonderful experience. Extremely high talent density, amazing culture, mission-driven, zero politics, leadership with real technical depth. Over the past year, I’ve learned so much about what made Anthropic successful and developed great respect for the founders and the team. I'm grateful to have been part of such an extraordinary organization. Reflecting on the last 20 years, I see three phases—and I'm now entering a fourth:
English
2
2
252
34.3K
Katrina Drozdov (Evtimova) retweetledi
Vals AI
Vals AI@ValsAI·
Claude Opus 4.6 is #1 on the Vals Index 🏆 It sets a new state-of-the-art on FinanceAgent, ProofBench, TaxEval, and SWE-Bench. (1/n)
Vals AI tweet media
English
17
14
153
18.5K
Katrina Drozdov (Evtimova) retweetledi
Vals AI
Vals AI@ValsAI·
We’re releasing ProofBench, a challenging benchmark that measures models’ ability to write formally verifiable graduate-level proofs!
Vals AI tweet media
English
11
16
155
45.8K
Katrina Drozdov (Evtimova) retweetledi
Vals AI
Vals AI@ValsAI·
Kimi 2.5 Thinking is the new #1 open-weight model, taking the top spot on our index (both multimodal and text-only)🥇 The model even compares favorably to leading closed-source providers: it places in the top 10 among all models on both indices 🚀 Congrats @Kimi_Moonshot
Vals AI tweet media
English
3
15
182
18.9K
Katrina Drozdov (Evtimova) retweetledi
Vals AI
Vals AI@ValsAI·
We've upgraded our Terminal-Bench leaderboard to version 2. The new benchmark features more, better, and more relevant tasks.
Vals AI tweet media
English
9
9
151
9.5K
Katrina Drozdov (Evtimova)
Katrina Drozdov (Evtimova)@stochasticdoggo·
Paper of the day: Recursive Language Models. Instead of stuffing everything into context, treat the prompt as an external environment the model can query, decompose, and recurse over. Scales to 10M+ tokens with strong performance. #AIPaperADay
Katrina Drozdov (Evtimova) tweet media
English
0
0
0
94
Katrina Drozdov (Evtimova)
Katrina Drozdov (Evtimova)@stochasticdoggo·
Paper of the day: From Entropy to Epiplexity. Not all information is equally useful for learning. Epiplexity tries to formalize the structure models can actually extract under compute limits. #AIPaperADay
Katrina Drozdov (Evtimova) tweet media
English
0
0
1
213
Katrina Drozdov (Evtimova)
Katrina Drozdov (Evtimova)@stochasticdoggo·
Paper for Dec 11: Characterizing Deep Research. Deep research = search + reasoning, not just long outputs. LiveDRBench offers a concrete way to measure effectiveness and evaluate next-gen research agents. #aiadventcalendar
English
1
0
0
94