Vaskar Nath

18 posts

Vaskar Nath

@vaskar_n

Researcher @ Scale AI

New York City Katılım Temmuz 2024

32 Takip Edilen62 Takipçiler

Vaskar Nath@vaskar_n·5d

@tiffzhao05 @quadrillion_ai Congrats!!

English

Tiffany Zhao@tiffzhao05·6d

I left Google DeepMind, moved from SF to NYC, all within 2 weeks to join @quadrillion_ai — to build the future of automated research intelligence with the highest slope founder and most talent dense team. I grew up in Silicon Valley — the old Facebook office was my second home. I’d hang out there after school, drawing with my crayons while looking around at the sea of computers with lines of code. Since a young age, I felt empowered to have an array of interests beyond tech: piano, ballet, figure skating, art. The valley embraced diversity of thought, and that’s what inspired me to stay for Stanford and my career thus far. But today, SF is one big hive-mind. So, I moved to NYC, away from family and friends to build a company that doesn’t need to rely on a bubble to survive. I’m meeting customers day after day in all kinds of verticals, connecting with them in different ways and seeing our product bring real value. Here, I’m able to live in diversity of thought. I’m excited to build the future of research in the city of opportunity. Let’s chat if this excites you.

English

335

212.3K

Vaskar Nath retweetledi

Anisha Gunjal@anisha_gunjal·24 Tem

🤔 How do we train LLMs on real-world tasks where it’s hard to define a single verifiable answer? Our work at @scale_AI introduces Rubrics as Rewards (RaR) — a framework for on-policy post-training that uses structured, checklist-style rubrics as interpretable reward signals. 🧵

English

251

56.8K

Vaskar Nath retweetledi

Matt Schlicht@MattPRD·24 Tem

Waking up to see this new paper from @scale_AI charting on the @yesnoerror trending feed. Authors: @anisha_gunjal, @aytwang, Elaine Lau, @vaskar_n, @BingLiu1011, and @SeanHendryx "Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains" Simplified: Teaching computers with detailed check-lists instead of vague thumbs-up ratings lets them learn better answers in medicine and science questions and makes it clear why they got a reward. Key findings: • Implicitly aggregated rubric rewards boost medical benchmark score by 28 % relative to Likert baseline. • Matches or exceeds rewards based on expert reference answers despite using smaller judges. What can it be used for: • Fine-tuning clinical decision support chatbots with medical safety rubrics. • Training policy-analysis or legal-reasoning models where multiple subjective factors matter. Detailed summary: Rubrics as Rewards (RaR) is proposed as an interpretable alternative to opaque preference-based reward models when fine-tuning large language models (LLMs) with reinforcement learning. Instead of asking humans to rank whole answers, domain experts (or a strong LLM guided by expert references) write a prompt-specific checklist of 7–20 binary criteria that capture essential facts, reasoning steps, style and common pitfalls. Each criterion is tagged Essential, Important, Optional, or Pitfall and given a weight. During on-policy training the policy model (Qwen-2.5-7B in the paper) samples 16 candidate answers per prompt. A separate judge LLM (GPT-4o-mini or smaller) is prompted either to score each criterion separately (explicit aggregation) or to read the full rubric and output one holistic Likert rating 1–10 (implicit aggregation). The normalized score becomes the scalar reward and the policy is updated with the GRPO algorithm. The authors curate two 20 k-example training sets—RaR-Medical-20k and RaR-Science-20k—by combining existing medical and science reasoning corpora and generating synthetic rubrics with o3-mini or GPT-4o. Evaluation on HealthBench-1k (medical reasoning) and GPQA-Diamond (graduate-level physics/chemistry/biology) shows that RaR-Implicit yields up to a 28 % relative improvement over simple Likert-only rewards and matches or exceeds rewards computed by comparing to expert reference answers. Implicit aggregation consistently outperforms explicit, demonstrating that letting the judge decide how to combine criteria works better than fixed hand-tuned weights. Rubric supervision also helps smaller judge models. When asked to rate preferred versus perturbed answers, rubric-guided judges choose the preferred answer far more reliably than equally sized Likert-only judges, narrowing the gap between a 7 B evaluator and GPT-4o-mini. Ablations reveal that prompt-specific rubrics beat generic ones, multiple criteria beat essential-only lists, and access to an expert reference while drafting rubrics materially boosts downstream performance. Even human-written and high-quality synthetic rubrics perform on par, suggesting scalability. RaR generalises Reinforcement Learning with Verifiable Rewards (RLVR): when the rubric has just one correctness check, the framework collapses to RLVR’s exact-match reward. By exposing each aspect of quality explicitly, RaR is more transparent, auditable and potentially harder to reward-hack than neural reward models. The authors discuss extensions to real-world agentic tasks, dynamic curriculum via rubric weights, and formal robustness studies. -- Over 500,000 pages of research are published on @arXiv every month. Hidden within are breakthrough insights that could transform your work — but finding them is like searching for diamonds in an ocean of data. @yesnoerror cuts through the noise to surface the most impactful research for your projects, investments, and discoveries. // $yne

English

Vaskar Nath retweetledi

Sean Hendryx@SeanHendryx·15 Tem

@karpathy a neat quality specific to language models is that you can just tell them what to do differently when they fail. And if you use importance sampling, gradients are aligned with the unguided context and it gets into the weights directly. No sleep needed x.com/SeanHendryx/st…

Sean Hendryx@SeanHendryx

For online RL, we introduce Guide, a class of algorithms which incorporate guidance into the model’s context when all rollouts fail and adjusts the importance sampling ratio in order to optimize the policy for contexts in which guidance is no longer present.

English

568

Vaskar Nath retweetledi

Sean Hendryx@SeanHendryx·23 Haz

What will the learning environments of the future look like that train artificial super intelligence? In recent work at @scale_AI , we show that training systems that combine verifiable rewards with multi-agent interaction accelerate learning.

English

130

23.8K

Vaskar Nath retweetledi

Mohit Raghavendra (@ICLR)@mohit_r9a·12 Mar

Over the last year, I have worked on data curation and scaling to effectively improve performance through SFT and RLHF. Check out the blog post I wrote, detailing my findings. notion.so/mohit-raghaven… (Thank you @natolambert for the shoutout in the latest Interconnects post!)

English

376

Vaskar Nath retweetledi

Alexandr Wang@alexandr_wang·1 Mar

GPT-4.5 Preview evals results are out on SEAL 👀 ⚡ #2 in Tool Use - Chat 🏢 #3 in Tool Use - Enterprise 🥉 #3 in EnigmaEval (behind Claude 3.7 Sonnet) 📚 #4 in MultiChallenge 🎓 #5 in Humanity’s Last Exam 🔍 #6 in VISTA (multimodal) See rankings here: scale.com/leaderboard

English

226

60.5K

Vaskar Nath retweetledi

Summer Yue@summeryue0·28 Şub

GPT-4.5 Preview Just Dropped~ We put it to the test, and the results are... mixed 👀 ⚡ #2 in Tool Use - Chat (trailing o1) 🏢 #3 in Tool Use - Enterprise (coming after Claude 3.7 Sonnet) 🥉 #3 in EnigmaEval (following Claude 3.7 Sonnet Thinking) 📚 #4 in MultiChallenge (behind Claude 3.5 & 3.7 Sonnet) 🎓 #5 in Humanity’s Last Exam (behind Gemini 2.0 & Claude 3.7 Sonnet Thinking) 🔍 #6 in VISTA (outperformed by Claude & Gemini series) The verdict? Impressive tool use, but not a clear leap forward elsewhere. See the full rankings here: scale.com/leaderboard

English

4.5K

Vaskar Nath retweetledi

Sean Hendryx@SeanHendryx·8 Kas

If you’ve ever finetuned a pretrained language model on a reasoning task at the edge of its capabilities, you were probably skeptical of the superficial alignment hypothesis. Turns out you were right. 1/🤔

English

261

50.7K

Vaskar Nath retweetledi

Scale AI@scale_AI·8 Kas

Read the full paper here from authors @mohit_r18 @vaskar_n @SeanHendryx: scl.ai/revisting-sah

English

2.1K

Vaskar Nath retweetledi

Scale AI@scale_AI·8 Kas

Contrary to prior work, new research from Scale finds that LLMs continue to learn new knowledge during post-training following a power law similar to well known pre-training scaling laws 🧵 scl.ai/revisting-sah

English

2.9K

Vaskar Nath retweetledi

Hugh Zhang@hughbzhang·6 Eyl

Enabling LLMs to reason more deeply at inference time via search is one of the most exciting directions in AI right now. We introduce PlanSearch, a novel method for code generation that searches over high-level "plans" in natural language as a means of encouraging diversity.

English

627

136.3K

Vaskar Nath retweetledi

Sean Hendryx@SeanHendryx·24 Eki

Reasoning at length will be a key part of LLMs solving more challenging problems, but how can we make sure that their chain of thought stays on track? At @scale_AI, we’ve developed a method to learn token-wise expected rewards from pairwise preference labels 🧵

English

4.3K

Vaskar Nath retweetledi

Scale AI@scale_AI·24 Eki

Our paper on this work, “Learning Goal-Conditioned Representations for Language Reward Models,” by @vaskar_n, @dylanslack20, @_jeffda, @TommyMa9, @hughbzhang, Spencer Whitehead, and @SeanHendryx will be presented at NeurIPS 2024 main track — we hope to see you there! arxiv.org/abs/2407.13887

English

1.6K

Vaskar Nath retweetledi

Scale AI@scale_AI·24 Eki

Our researchers at Scale have developed a novel method to evaluate LLM output during generation instead of waiting until it’s complete — like a GPS recalculating when you go off route, before you’re at the wrong place. Learn more on the Scale blog: bit.ly/aligning-chatb…

English

9.1K

Vaskar Nath retweetledi

Riley Goodside@goodside·24 Eki

New research from Scale — detecting problems in LLM outputs before it's too late. This work will be presented at NeurIPS 2024 main track — congrats @vaskar_n, @dylanslack20, @_jeffda, @TommyMa9, @hughbzhang, Spencer Whitehead, and @SeanHendryx!

Scale AI@scale_AI

English

6.6K

Vaskar Nath retweetledi

Iskander Azangulov@IAzangulov·14 Eki

With @PPotaptchik and @GeorgeDeligian9 we show the first realistic bound on the iteration complexity of diffusion models! Our work explains why sampling from ImageNet needs only ~100 (intrinsic dim) steps instead of ~150k (extrinsic dim). arxiv.org/abs/2410.09046

English

3.9K

Vaskar Nath@vaskar_n·20 Eyl

Excited to see the results of ToolComp drop today! It’s been incredible to be part of the work that helps advance tool-use capabilities in AI models. Amazing to see the progress across the board—congrats to everyone involved! 🚀🛠️🤖

Sean Hendryx@SeanHendryx

We’re releasing the results on ToolComp today, a Scale AI SEAL leaderboard that tests the ability of agents to plan, reason, and compose multiple, dependent tool calls together. OpenAI models lead with Claude showing strong performance in the Chat setting. 1/🛠️🤖

English

240

Keşfet

@tiffzhao05 @quadrillion_ai @scale_AI @yesnoerror @anisha_gunjal @aytwang @BingLiu1011 @SeanHendryx