Victoria Graf

16 posts

Victoria Graf

Victoria Graf

@VictoriaWGraf

PhD student @uwnlp, Student Researcher @allen_ai, prev @princeton_nlp

شامل ہوئے Haziran 2024
80 فالونگ208 فالوورز
Victoria Graf
Victoria Graf@VictoriaWGraf·
Loved talking to everyone at our IFBench poster at NeurIPS! If you missed us at the poster and want to chat, reach out!
Victoria Graf tweet media
English
1
1
13
1.2K
Luca Soldaini 🎀
Luca Soldaini 🎀@soldni·
please DO ask me for stickers, I have to many Ai2/Olmo 3/moo moo rawr swag
Luca Soldaini 🎀 tweet media
English
2
0
30
1.7K
Victoria Graf ری ٹویٹ کیا
Scott Geng
Scott Geng@scottgeng00·
🤔 How do we train AI models that surpass their teachers? 🚨 In #COLM2025: ✨Delta learning ✨makes LLM post-training cheap and easy – with only weak data, we beat open 8B SOTA 🤯 The secret? Learn from the *differences* in weak data pairs! 📜 arxiv.org/abs/2507.06187 🧵 below
Scott Geng tweet media
English
7
51
165
24.2K
Victoria Graf
Victoria Graf@VictoriaWGraf·
Worried about overfitting to IFEval? 🤔 Use ✨IFBench✨ our new, challenging instruction-following benchmark! Loved working w/ @valentina__py! Personal highlight: our multi-turn eval setting makes it possible to isolate constraint-following from the rest of the instruction 🔍
Valentina Pyatkin @ ICLR 🇧🇷@valentina__py

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.

English
2
14
55
10.5K
Victoria Graf ری ٹویٹ کیا
Nathan Lambert
Nathan Lambert@natolambert·
This new benchmark created by @valentina__py should be the new default replacing IFEval. Some of the best frontier models get <50% and it comes with separate training prompts so people don’t effectively train on test. Wild gap from o3 to Gemini 2.5 pro of like 30 points.
Ai2@allen_ai

Introducing IFBench, a benchmark to measure how well AI models follow new, challenging, and diverse verifiable instructions. Top models like Gemini 2.5 Pro or Claude 4 Sonnet are only able to score up to 50%, presenting an open frontier for post-training. 🧵

English
10
22
195
22.9K
Victoria Graf ری ٹویٹ کیا
Ai2
Ai2@allen_ai·
Introducing IFBench, a benchmark to measure how well AI models follow new, challenging, and diverse verifiable instructions. Top models like Gemini 2.5 Pro or Claude 4 Sonnet are only able to score up to 50%, presenting an open frontier for post-training. 🧵
Ai2 tweet media
English
4
48
313
47.5K
Victoria Graf
Victoria Graf@VictoriaWGraf·
Super excited to release Tülu 3, a family of fully-open state-of-the-art post-trained models, including its data, eval, code, and training recipes in a comprehensive guide for post-training techniques! allenai.org/papers/tulu-3-…
English
0
1
7
256
Victoria Graf ری ٹویٹ کیا
Ai2
Ai2@allen_ai·
Meet Tülu 3 -- a set of state-of-the-art instruct models with fully open data, eval code, and training algorithms. We invented new methods for fine-tuning language models with RL and built upon best practices in the community to scale synthetic instruction and preference data. Demo, GitHub, technical report, and models below 👇
Ai2 tweet media
English
14
132
526
218.2K
Victoria Graf
Victoria Graf@VictoriaWGraf·
Had a wonderful time at #NAACL2024 this week! Thanks to everyone who came to my oral presentation on defending LLMs against backdoor attacks!
Victoria Graf tweet media
English
0
0
9
216