Vaskar Nath

18 posts

Vaskar Nath

Vaskar Nath

@vaskar_n

Researcher @ Scale AI

New York City Katılım Temmuz 2024
32 Takip Edilen62 Takipçiler
Tiffany Zhao
Tiffany Zhao@tiffzhao05·
I left Google DeepMind, moved from SF to NYC, all within 2 weeks to join @quadrillion_ai — to build the future of automated research intelligence with the highest slope founder and most talent dense team. I grew up in Silicon Valley — the old Facebook office was my second home. I’d hang out there after school, drawing with my crayons while looking around at the sea of computers with lines of code. Since a young age, I felt empowered to have an array of interests beyond tech: piano, ballet, figure skating, art. The valley embraced diversity of thought, and that’s what inspired me to stay for Stanford and my career thus far. But today, SF is one big hive-mind. So, I moved to NYC, away from family and friends to build a company that doesn’t need to rely on a bubble to survive. I’m meeting customers day after day in all kinds of verticals, connecting with them in different ways and seeing our product bring real value. Here, I’m able to live in diversity of thought. I’m excited to build the future of research in the city of opportunity. Let’s chat if this excites you.
Tiffany Zhao tweet media
English
94
18
335
212.3K
Vaskar Nath retweetledi
Anisha Gunjal
Anisha Gunjal@anisha_gunjal·
🤔 How do we train LLMs on real-world tasks where it’s hard to define a single verifiable answer? Our work at @scale_AI introduces Rubrics as Rewards (RaR) — a framework for on-policy post-training that uses structured, checklist-style rubrics as interpretable reward signals. 🧵
Anisha Gunjal tweet media
English
5
41
251
56.8K
Vaskar Nath retweetledi
Matt Schlicht
Matt Schlicht@MattPRD·
Waking up to see this new paper from @scale_AI charting on the @yesnoerror trending feed. Authors: @anisha_gunjal, @aytwang, Elaine Lau, @vaskar_n, @BingLiu1011, and @SeanHendryx "Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains" Simplified: Teaching computers with detailed check-lists instead of vague thumbs-up ratings lets them learn better answers in medicine and science questions and makes it clear why they got a reward. Key findings: • Implicitly aggregated rubric rewards boost medical benchmark score by 28 % relative to Likert baseline. • Matches or exceeds rewards based on expert reference answers despite using smaller judges. What can it be used for: • Fine-tuning clinical decision support chatbots with medical safety rubrics. • Training policy-analysis or legal-reasoning models where multiple subjective factors matter. Detailed summary: Rubrics as Rewards (RaR) is proposed as an interpretable alternative to opaque preference-based reward models when fine-tuning large language models (LLMs) with reinforcement learning. Instead of asking humans to rank whole answers, domain experts (or a strong LLM guided by expert references) write a prompt-specific checklist of 7–20 binary criteria that capture essential facts, reasoning steps, style and common pitfalls. Each criterion is tagged Essential, Important, Optional, or Pitfall and given a weight. During on-policy training the policy model (Qwen-2.5-7B in the paper) samples 16 candidate answers per prompt. A separate judge LLM (GPT-4o-mini or smaller) is prompted either to score each criterion separately (explicit aggregation) or to read the full rubric and output one holistic Likert rating 1–10 (implicit aggregation). The normalized score becomes the scalar reward and the policy is updated with the GRPO algorithm. The authors curate two 20 k-example training sets—RaR-Medical-20k and RaR-Science-20k—by combining existing medical and science reasoning corpora and generating synthetic rubrics with o3-mini or GPT-4o. Evaluation on HealthBench-1k (medical reasoning) and GPQA-Diamond (graduate-level physics/chemistry/biology) shows that RaR-Implicit yields up to a 28 % relative improvement over simple Likert-only rewards and matches or exceeds rewards computed by comparing to expert reference answers. Implicit aggregation consistently outperforms explicit, demonstrating that letting the judge decide how to combine criteria works better than fixed hand-tuned weights. Rubric supervision also helps smaller judge models. When asked to rate preferred versus perturbed answers, rubric-guided judges choose the preferred answer far more reliably than equally sized Likert-only judges, narrowing the gap between a 7 B evaluator and GPT-4o-mini. Ablations reveal that prompt-specific rubrics beat generic ones, multiple criteria beat essential-only lists, and access to an expert reference while drafting rubrics materially boosts downstream performance. Even human-written and high-quality synthetic rubrics perform on par, suggesting scalability. RaR generalises Reinforcement Learning with Verifiable Rewards (RLVR): when the rubric has just one correctness check, the framework collapses to RLVR’s exact-match reward. By exposing each aspect of quality explicitly, RaR is more transparent, auditable and potentially harder to reward-hack than neural reward models. The authors discuss extensions to real-world agentic tasks, dynamic curriculum via rubric weights, and formal robustness studies. -- Over 500,000 pages of research are published on @arXiv every month. Hidden within are breakthrough insights that could transform your work — but finding them is like searching for diamonds in an ocean of data. @yesnoerror cuts through the noise to surface the most impactful research for your projects, investments, and discoveries. // $yne
Matt Schlicht tweet media
English
9
20
73
3K
Vaskar Nath retweetledi
Sean Hendryx
Sean Hendryx@SeanHendryx·
@karpathy a neat quality specific to language models is that you can just tell them what to do differently when they fail. And if you use importance sampling, gradients are aligned with the unguided context and it gets into the weights directly. No sleep needed x.com/SeanHendryx/st…
Sean Hendryx@SeanHendryx

For online RL, we introduce Guide, a class of algorithms which incorporate guidance into the model’s context when all rollouts fail and adjusts the importance sampling ratio in order to optimize the policy for contexts in which guidance is no longer present.

English
0
1
5
568
Vaskar Nath retweetledi
Sean Hendryx
Sean Hendryx@SeanHendryx·
What will the learning environments of the future look like that train artificial super intelligence? In recent work at @scale_AI , we show that training systems that combine verifiable rewards with multi-agent interaction accelerate learning.
Sean Hendryx tweet mediaSean Hendryx tweet media
English
12
28
130
23.8K
Vaskar Nath retweetledi
Mohit Raghavendra (@ICLR)
Mohit Raghavendra (@ICLR)@mohit_r9a·
Over the last year, I have worked on data curation and scaling to effectively improve performance through SFT and RLHF. Check out the blog post I wrote, detailing my findings. notion.so/mohit-raghaven… (Thank you @natolambert for the shoutout in the latest Interconnects post!)
English
1
2
3
376
Vaskar Nath retweetledi
Alexandr Wang
Alexandr Wang@alexandr_wang·
GPT-4.5 Preview evals results are out on SEAL 👀 ⚡ #2 in Tool Use - Chat 🏢 #3 in Tool Use - Enterprise 🥉 #3 in EnigmaEval (behind Claude 3.7 Sonnet) 📚 #4 in MultiChallenge 🎓 #5 in Humanity’s Last Exam 🔍 #6 in VISTA (multimodal) See rankings here: scale.com/leaderboard
English
23
15
226
60.5K
Vaskar Nath retweetledi
Summer Yue
Summer Yue@summeryue0·
GPT-4.5 Preview Just Dropped~ We put it to the test, and the results are... mixed 👀 ⚡ #2 in Tool Use - Chat (trailing o1) 🏢 #3 in Tool Use - Enterprise (coming after Claude 3.7 Sonnet) 🥉 #3 in EnigmaEval (following Claude 3.7 Sonnet Thinking) 📚 #4 in MultiChallenge (behind Claude 3.5 & 3.7 Sonnet) 🎓 #5 in Humanity’s Last Exam (behind Gemini 2.0 & Claude 3.7 Sonnet Thinking) 🔍 #6 in VISTA (outperformed by Claude & Gemini series) The verdict? Impressive tool use, but not a clear leap forward elsewhere. See the full rankings here: scale.com/leaderboard
Summer Yue tweet media
English
0
5
14
4.5K
Vaskar Nath retweetledi
Sean Hendryx
Sean Hendryx@SeanHendryx·
If you’ve ever finetuned a pretrained language model on a reasoning task at the edge of its capabilities, you were probably skeptical of the superficial alignment hypothesis. Turns out you were right. 1/🤔
Sean Hendryx tweet media
English
8
44
261
50.7K
Vaskar Nath retweetledi
Scale AI
Scale AI@scale_AI·
Contrary to prior work, new research from Scale finds that LLMs continue to learn new knowledge during post-training following a power law similar to well known pre-training scaling laws 🧵 scl.ai/revisting-sah
Scale AI tweet media
English
3
7
23
2.9K
Vaskar Nath retweetledi
Hugh Zhang
Hugh Zhang@hughbzhang·
Enabling LLMs to reason more deeply at inference time via search is one of the most exciting directions in AI right now. We introduce PlanSearch, a novel method for code generation that searches over high-level "plans" in natural language as a means of encouraging diversity.
Hugh Zhang tweet media
English
16
98
627
136.3K
Vaskar Nath retweetledi
Sean Hendryx
Sean Hendryx@SeanHendryx·
Reasoning at length will be a key part of LLMs solving more challenging problems, but how can we make sure that their chain of thought stays on track? At @scale_AI, we’ve developed a method to learn token-wise expected rewards from pairwise preference labels 🧵
English
3
6
18
4.3K
Vaskar Nath retweetledi
Scale AI
Scale AI@scale_AI·
Our researchers at Scale have developed a novel method to evaluate LLM output during generation instead of waiting until it’s complete — like a GPS recalculating when you go off route, before you’re at the wrong place. Learn more on the Scale blog: bit.ly/aligning-chatb…
Scale AI tweet media
English
2
5
32
9.1K
Vaskar Nath retweetledi
Vaskar Nath
Vaskar Nath@vaskar_n·
Excited to see the results of ToolComp drop today! It’s been incredible to be part of the work that helps advance tool-use capabilities in AI models. Amazing to see the progress across the board—congrats to everyone involved! 🚀🛠️🤖
Sean Hendryx@SeanHendryx

We’re releasing the results on ToolComp today, a Scale AI SEAL leaderboard that tests the ability of agents to plan, reason, and compose multiple, dependent tool calls together. OpenAI models lead with Claude showing strong performance in the Chat setting. 1/🛠️🤖

English
0
0
2
240