Swarnadeep Saha

646 posts

Swarnadeep Saha

@swarnaNLP

Research Scientist @AIatMeta (FAIR) working on co-improvement. Past: @Google PhD fellow @uncnlp. Gooner.

Seattle, Washington Katılım Mayıs 2014

824 Takip Edilen1.6K Takipçiler

Sabitlenmiş Tweet

Swarnadeep Saha@swarnaNLP·16 May

Progress of AI is bottlenecked by the quality of evaluation, motivating the need for powerful and generalist LLM judges that can think and reason. Here's our latest paper, J1, on how to train such Thinking-LLM-Judges with RL. 🧵👇

Jason Weston@jaseweston

🚨 New paper 🚨 J1: Incentivizing Thinking in LLM-as-a-Judge via RL - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data - Optimizes thoughts, scores, and judgments using GRPO - Outperforms all baselines at 8B & 70B scale, o1-mini, and on some benchmarks, even R1 - We find J1 uses various thought strategies: outlines evaluation criteria, compares against self-generated reference answers, and re-evaluates correctness 📝: arxiv.org/abs/2505.10320

English

6.3K

Swarnadeep Saha@swarnaNLP·4d

@jmin__cho @UNC Congrats Jaemin!

English

Jaemin Cho@jmin__cho·2 Nis

🥳 I am incredibly honored and grateful to receive the 2026 @UNC Distinguished Dissertation Award! This award recognizes four recipients across the whole university, and I’m humbled to represent the Mathematics, Physical Sciences, and Engineering category this year. Many thanks to my advisor @mohitban47, our MURGe-Lab family, and the @unccs @unc_ai_group for their constant support! 🙏 This is a great reminder of all the good memories from my PhD journey before I start my faculty career at The Johns Hopkins University 😊

English

6.3K

Swarnadeep Saha retweetledi

Jason Weston@jaseweston·3 Nis

🧮 Reasoning over Mathematical Objects 🧮 Our 70-page(!) paper is out on arXiv, as covered by several of our recent blog posts. We study how to improve reasoning on hard tasks (e.g., math expressions) via: • better training data (& new evals) • better reward models (on-policy trained) • better inference methods (on-policy trained) 📝: arxiv.org/pdf/2603.18886

English

194

13K

Swarnadeep Saha retweetledi

Jason Weston@jaseweston·30 Mar

🔗Learning to Aggregate through Online RL🎯 ParaGator🔀🐊: strong parallel reasoning aggregation Core claim: aggregation works best when training both stages together: - LLM generator should produce diverse candidates - LLM aggregator should synthesize into final answer ParaGator trains candidate generation with pass@k, and aggregation with pass@1 on-policy, end-to-end. Stops mode collapse/off-policy mismatch. Improves math & scientific reasoning. 🚀🏆 Read more in the blog post: facebookresearch.github.io/RAM/blogs/para…

English

121

10.3K

Swarnadeep Saha@swarnaNLP·24 Mar

Builds on our J1 paper (arxiv.org/abs/2505.10320, to be presented at ICLR’26. We study RL-trained reasoning reward models for post-training, highlighting on-policy training and generator–verifier gap as key ingredients of the recipe. More details in chapter2 of our report 👇

Jason Weston@jaseweston

🌐Unified Post-Training via On-Policy-Trained LM-as-RM🔧 RLLM = RL + LM-as-RM: - post-training framework that unifies RL across easy-, hard-to-verify, and non-verifiable tasks. - trains the LM-as-RM reward model on-policy from the policy’s own outputs, then uses those generative rewards to optimize the policy. 🔗📈 - uses the LLM’s reasoning + instruction-following for higher-quality rewards — boosting performance on all task types. 🚀🤖🏆 Read more in the blog post: facebookresearch.github.io/RAM/blogs/rllm/

English

2.4K

Swarnadeep Saha retweetledi

Seungone Kim@seungonekim·20 Mar

🧮New work from @AIatMeta & @LTIatCMU! LM reasoning benchmarks mostly use simple answers like numbers (AIME) or multiple-choice options (GPQA). But for complex mathematical objects, performance drops sharply. We propose a set of solutions to solve this: arxiv.org/abs/2603.18886

English

8.8K

Swarnadeep Saha retweetledi

Jason Weston@jaseweston·20 Mar

🧮 Principia: Training LLMs to Reason over Mathematical Objects 📐 We release: - PrincipiaBench, a new eval for *mathematical objects* (not just numerical values or MCQ) - Principia Collection: training data that improves reasoning across the board. For models to help with scientific and mathematical work, you need to train on such data & test whether they can derive things like equations, sets, matrices, intervals, and piecewise functions. We show that this ends up improving the overall reasoning ability of your model for all tasks. Read more in the blog post: facebookresearch.github.io/RAM/blogs/prin…

English

127

12.2K

Swarnadeep Saha retweetledi

Jason Weston@jaseweston·5 Mar

Sign up for the Meta Networking Mixer at ICLR 2026: events.atmeta.com/iclrnetworking… Members of my team in FAIR co-authored 7 papers accepted to ICLR: 1/ The Alignment Waltz: Jointly Training Agents to Collaborate for Safety arxiv.org/abs/2510.08240 - Makes LLM safety a positive-sum game between a conversation & feedback agent - At inference feedback is adaptive, used when needed - Improves safety & reduces overrefusals without degrading capabilities. 2/ J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning arxiv.org/abs/2505.10320 - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data - Optimizes thoughts, scores, and judgments using GRPO - Outperforms all baselines at 8B & 70B scale, o1-mini, and on some benchmarks, even R1 - We find J1 uses various thought strategies: outlines evaluation criteria, compares against self-generated reference answers, and re-evaluates correctness 3/ Scaling Agent Learning via Experience Synthesis arxiv.org/abs/2511.03773 - Scaling training environments for RL by simulating them with reasoning LLMs! - Environment models + Replay-buffer + New tasks = cheap RL for any environments! - Strong improvements over non-RL-ready environments and multiple model families! - Works better in sim-2-real RL settings → Warm-start for high-cost environments 4/ OptimalThinkingBench: Evaluating Over and Underthinking in LLMs arxiv.org/abs/2508.13141 - Thinking LLMs use a lot of tokens & overthink; non-thinking LLMs underthink & underperform. - We introduce a benchmark which scores models in the quest to find the best mix. - We evaluate 33 different SOTA models & find improvements are needed! 5/ RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization arxiv.org/abs/2510.02172 - RESTRAIN turns spurious votes → self-Improving signals. No labels needed - Does this through self-penalizing unreliable reasoning paths. Uses all rollouts, offsets low-consistency rollout advantage. Down-weights low-consensus prompts. - Results: beats existing techniques on both training-time (label-free) and test-time scaling — all without labels. 6/ LLM Pretraining with Continuous Concepts arxiv.org/abs/2502.08524 - An LLM pretraining framework that predicts concepts and mixes them into its hidden state to improve next token prediction. - More sample-efficient: outperforms next token prediction, knowledge distillation, and inserting pause tokens. - Boosts interpretability & steerability by analyzing and modifying predicted concepts. 7/ Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense arxiv.org/abs/2510.07242 - HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method - Tackles brittleness of 0-1 signals & the noise of pure reward models -> better results! - Results: +11.7 pts vs RM-only, +9.2 pts vs verifier-only on hard-to-verify reasoning tasks

English

110

17.5K

Swarnadeep Saha retweetledi

Archiki Prasad@ArchikiPrasad·18 Şub

🚨 I’m on the 2026 Research Scientist Job Market! I am a PhD student at UNC Chapel Hill (advised by @mohitban47) and recipient of the Apple Scholars in AI/ML PhD Fellowship. My research centers around: 🔸Reasoning & RL/Post-Training: Evaluating and interpreting the reasoning process, and improving post-training and alignment through self-generated and reward-based signals (Intrinsic Dim., ReCEVAL, ScPO, LASeR). 🔸Agents & Planning: Designing adaptive agent frameworks to that use extra test-time compute & reasoning upon failure (ADaPT, System-1.x, PRInTS). 🔸Reward & Skill Discovery in Code: Leveraging execution signals to build reliable rewards, automate debugging, and discover abstractions in code (UTGen, ReGAL). Prev (Research Intern): Google DeepMind, Meta FAIR, Allen Institute for AI (AI2), and Adobe Research. Feel free to reach out via DM or email if you’re interested, have leads, or would like to connect! 🌐 archiki.github.io 📧 archiki@cs.unc.edu #NLP #AI #JobSearch

English

343

55.4K

Swarnadeep Saha@swarnaNLP·14 Ara

@mohitban47 Congrats, Mohit! 🥳

Català

120

Mohit Bansal@mohitban47·14 Ara

Deeply happy and honored to be elected as an ACL Fellow -- and to be a part of the respected cohort of this+past years' fellows (congrats everyone)! 🙏 All the credit (and sincere gratitude) to all my amazing students, postdocs, collaborators, mentors, and family! 🤗💙

English

177

42.3K

Swarnadeep Saha retweetledi

Jason Weston@jaseweston·5 Ara

🤝 New Position Paper !!👤🔄🤖 @j_foerst and I wrote a position piece on what we think is the path to safer superintelligence: co-improvement. Everyone is focused on self-improving AI, but (1) we don't know how to do it yet, and (2) it might be misaligned with humans. Co-improvement: instead, build AI that collaborates *with us* to solve AI faster, and to help fix the alignment problem together. More details in the paper! Read it here: 📝:github.com/facebookresear…

English

509

85.2K

Swarnadeep Saha retweetledi

Jason Weston@jaseweston·29 Eki

🌶️SPICE: Self-Play in Corpus Environments🌶️ 📝: arxiv.org/abs/2510.24684 - Challenger creates tasks based on *corpora* - Reasoner solves them - Both trained together ⚔️ -> automatic curriculum! 🔥 Outperforms standard (ungrounded) self-play Grounding fixes hallucination & lack of diversity 🧵1/6

English

333

79.9K

Swarnadeep Saha@swarnaNLP·23 Eki

Yi Lin is a brilliant researcher with abundance of knowledge in different aspects of LLM/VLM training. Hire him!

Yi Lin Sung@yilin_sung

Tough week! I also got impacted less than 3 months after joining. Ironically, I just landed some new RL infra features the day before. Life moves on. My past work spans RL, PEFT, Quantization, and Multimodal LLMs. If your team is working on these areas, I’d love to connect.

English

3.2K

Swarnadeep Saha retweetledi

Mimansa Jaiswal@MimansaJ·23 Eki

I was impacted by Meta layoffs today. As a Research Scientist working on LLM posttraining (reward models, DPO/GRPO) & automated evaluation pipelines, I’ve focused on understanding why/wehere models fail & how to make them better. I’m looking for opportunities; please reach out!

Susan Zhang@suchenzang

👀

English

121

230

3.1K

840.9K

Swarnadeep Saha retweetledi

Jason Weston@jaseweston·13 Eki

Hybrid Reinforcement (HERO): When Reward Is Sparse, It’s Better to Be Dense 🦸‍♂️ 💪 📝: arxiv.org/abs/2510.07242 - HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method - Tackles the brittleness of binary signals and the noise of pure reward models -> better results! ✔️ Stratified normalization anchors dense scores within verifier groups ✔️ Variance-aware weighting emphasizes harder, high-variance prompts ✔️ Stable + informative rewards, no drift 📈 Results: 🔥 +11.7 pts vs RM-only, +9.2 pts vs verifier-only on hard-to-verify reasoning tasks 🔥 Generalizes across Qwen and OctoThinker models 🔥 Works well when training with easy-to-verify/hard-to-verify/mixed samples. Hybrid reward → stable, dense, reliable supervision, advancing reasoning RL 🧵(1/5)

English

350

68K

Swarnadeep Saha@swarnaNLP·12 Eki

Hope all attendees enjoyed the workshop as much as we did in organizing it!

Jason Weston@jaseweston

Was super fun to organize this workshop!! Thanks everyone: speakers, panelists, audience. facebookresearch.github.io/RAM/workshop/C…

English

1.6K

Swarnadeep Saha retweetledi

Mohit Bansal@mohitban47·5 Eki

🚨 "Think the right amount" for improving both reasoning accuracy and efficiency! --> Large reasoning models under-adapt = underthink on hard problems and overthink on easy ones --> ✨TRAAC✨ is an online RL, difficulty-adaptive, attention-based compression method that prunes redundant reasoning steps using the model’s self-attention, and adaptively decides the “fine-grained thinking budget” based on problem difficulty --> Improves both reasoning accuracy (+8.4%) and efficiency (+36.8%)! 🧵👇

Joykirat@joykiratsingh

🚨 Excited to announce TRAAC, an online difficulty-adaptive, attention-based method that handles the tradeoff of under & overthinking in reasoning models to improve both accuracy and efficiency. Underthinking ❌: Models terminate reasoning too early on harder problems, leading to errors. Overthinking 🔁: Models think excessively for simpler tasks, generating unnecessary steps and inflating test-time computation. 1️⃣: TRAAC dynamically allocates the “thinking budget” via: Difficulty Calibration – adapt to problem hardness and Attention-Based Compression – prune redundant steps. 2️⃣: Across a variety of tasks (AIME, AMC, GPQA-D, BBEH), TRAAC (Qwen3-4B) achieves an average absolute accuracy gain of 8.4% with a relative reduction in reasoning length of 36.8% compared to the base model. 3️⃣: Analysis shows task-difficulty calibration and attention-based compression are effective for both accuracy and efficiency gains.

English

9.2K

Swarnadeep Saha retweetledi

Gabriel Synnaeve@syhw·25 Eyl

(🧵) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. ai.meta.com/research/publi…

English

312

1.8K

918.7K

Swarnadeep Saha@swarnaNLP·10 Eyl

@wzhao_nlp @Alibaba_Qwen Congrats, Wenting and best wishes! Loved working with you.

English

489

Wenting Zhao@wzhao_nlp·10 Eyl

I’ve recently joined @Alibaba_Qwen! We’re building the next generation of frontier models through careful science and world-class engineering, and we are making rapid progress. Excited for what’s ahead 💜

English

905

115.9K

Swarnadeep Saha@swarnaNLP·9 Eyl

Turns out we can use RLVR to teach a model to aggregate multiple solutions. Check out our new work on parallel test-time scaling!👇

Jason Weston@jaseweston

🌀New Test-time scaling method 🌀 📝: arxiv.org/abs/2509.06870 - Use RL to train an LLM solution aggregator – Reasons, reviews, reconciles, and synthesizes a final solution -> Much better than existing techniques! - Simple new method. Strong results across 4 math benchmarks. 🧵1/5

English

1.9K

Swarnadeep Saha@swarnaNLP·4 Eyl

Post-training with RL causes diversity collapse!! We found a way to directly incorporate semantic diversity as an additional reward that improves both quality and diversity of outputs. 👇

Jason Weston@jaseweston

🌀Diversity Aware RL (DARLING)🌀 📝: arxiv.org/abs/2509.02534 - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks 🧵1/5

English

1.1K

Keşfet

@jmin__cho @UNC @mohitban47 @unccs @unc_ai_group @AIatMeta @LTIatCMU @j_foerst