Swarnadeep Saha

646 posts

Swarnadeep Saha banner
Swarnadeep Saha

Swarnadeep Saha

@swarnaNLP

Research Scientist @AIatMeta (FAIR) working on co-improvement. Past: @Google PhD fellow @uncnlp. Gooner.

Seattle, Washington Katılım Mayıs 2014
824 Takip Edilen1.6K Takipçiler
Sabitlenmiş Tweet
Jaemin Cho
Jaemin Cho@jmin__cho·
🥳 I am incredibly honored and grateful to receive the 2026 @UNC Distinguished Dissertation Award! This award recognizes four recipients across the whole university, and I’m humbled to represent the Mathematics, Physical Sciences, and Engineering category this year. Many thanks to my advisor @mohitban47, our MURGe-Lab family, and the @unccs @unc_ai_group for their constant support! 🙏 This is a great reminder of all the good memories from my PhD journey before I start my faculty career at The Johns Hopkins University 😊
Jaemin Cho tweet media
English
10
17
81
6.3K
Swarnadeep Saha retweetledi
Jason Weston
Jason Weston@jaseweston·
🧮 Reasoning over Mathematical Objects 🧮 Our 70-page(!) paper is out on arXiv, as covered by several of our recent blog posts. We study how to improve reasoning on hard tasks (e.g., math expressions) via: • better training data (& new evals) • better reward models (on-policy trained) • better inference methods (on-policy trained) 📝: arxiv.org/pdf/2603.18886
Jason Weston tweet media
English
3
36
194
13K
Swarnadeep Saha retweetledi
Jason Weston
Jason Weston@jaseweston·
🔗Learning to Aggregate through Online RL🎯 ParaGator🔀🐊: strong parallel reasoning aggregation Core claim: aggregation works best when training both stages together: - LLM generator should produce diverse candidates - LLM aggregator should synthesize into final answer ParaGator trains candidate generation with pass@k, and aggregation with pass@1 on-policy, end-to-end. Stops mode collapse/off-policy mismatch. Improves math & scientific reasoning. 🚀🏆 Read more in the blog post: facebookresearch.github.io/RAM/blogs/para…
Jason Weston tweet media
English
3
24
121
10.3K
Swarnadeep Saha retweetledi
Seungone Kim
Seungone Kim@seungonekim·
🧮New work from @AIatMeta & @LTIatCMU! LM reasoning benchmarks mostly use simple answers like numbers (AIME) or multiple-choice options (GPQA). But for complex mathematical objects, performance drops sharply. We propose a set of solutions to solve this: arxiv.org/abs/2603.18886
Seungone Kim tweet media
English
1
22
91
8.8K
Swarnadeep Saha retweetledi
Jason Weston
Jason Weston@jaseweston·
🧮 Principia: Training LLMs to Reason over Mathematical Objects 📐 We release: - PrincipiaBench, a new eval for *mathematical objects* (not just numerical values or MCQ) - Principia Collection: training data that improves reasoning across the board. For models to help with scientific and mathematical work, you need to train on such data & test whether they can derive things like equations, sets, matrices, intervals, and piecewise functions. We show that this ends up improving the overall reasoning ability of your model for all tasks. Read more in the blog post: facebookresearch.github.io/RAM/blogs/prin…
Jason Weston tweet media
English
0
35
127
12.2K
Swarnadeep Saha retweetledi
Jason Weston
Jason Weston@jaseweston·
Sign up for the Meta Networking Mixer at ICLR 2026: events.atmeta.com/iclrnetworking… Members of my team in FAIR co-authored 7 papers accepted to ICLR: 1/ The Alignment Waltz: Jointly Training Agents to Collaborate for Safety arxiv.org/abs/2510.08240 - Makes LLM safety a positive-sum game between a conversation & feedback agent - At inference feedback is adaptive, used when needed - Improves safety & reduces overrefusals without degrading capabilities. 2/ J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning arxiv.org/abs/2505.10320 - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data - Optimizes thoughts, scores, and judgments using GRPO - Outperforms all baselines at 8B & 70B scale, o1-mini, and on some benchmarks, even R1 - We find J1 uses various thought strategies: outlines evaluation criteria, compares against self-generated reference answers, and re-evaluates correctness 3/ Scaling Agent Learning via Experience Synthesis arxiv.org/abs/2511.03773 - Scaling training environments for RL by simulating them with reasoning LLMs! - Environment models + Replay-buffer + New tasks = cheap RL for any environments! - Strong improvements over non-RL-ready environments and multiple model families! - Works better in sim-2-real RL settings → Warm-start for high-cost environments 4/ OptimalThinkingBench: Evaluating Over and Underthinking in LLMs arxiv.org/abs/2508.13141 - Thinking LLMs use a lot of tokens & overthink; non-thinking LLMs underthink & underperform. - We introduce a benchmark which scores models in the quest to find the best mix. - We evaluate 33 different SOTA models & find improvements are needed! 5/ RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization arxiv.org/abs/2510.02172 - RESTRAIN turns spurious votes → self-Improving signals. No labels needed - Does this through self-penalizing unreliable reasoning paths. Uses all rollouts, offsets low-consistency rollout advantage. Down-weights low-consensus prompts. - Results: beats existing techniques on both training-time (label-free) and test-time scaling — all without labels. 6/ LLM Pretraining with Continuous Concepts arxiv.org/abs/2502.08524 - An LLM pretraining framework that predicts concepts and mixes them into its hidden state to improve next token prediction. - More sample-efficient: outperforms next token prediction, knowledge distillation, and inserting pause tokens. - Boosts interpretability & steerability by analyzing and modifying predicted concepts. 7/ Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense arxiv.org/abs/2510.07242 - HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method - Tackles brittleness of 0-1 signals & the noise of pure reward models -> better results! - Results: +11.7 pts vs RM-only, +9.2 pts vs verifier-only on hard-to-verify reasoning tasks
English
2
15
110
17.5K
Swarnadeep Saha retweetledi
Archiki Prasad
Archiki Prasad@ArchikiPrasad·
🚨 I’m on the 2026 Research Scientist Job Market! I am a PhD student at UNC Chapel Hill (advised by @mohitban47) and recipient of the Apple Scholars in AI/ML PhD Fellowship. My research centers around: 🔸Reasoning & RL/Post-Training: Evaluating and interpreting the reasoning process, and improving post-training and alignment through self-generated and reward-based signals (Intrinsic Dim., ReCEVAL, ScPO, LASeR). 🔸Agents & Planning: Designing adaptive agent frameworks to that use extra test-time compute & reasoning upon failure (ADaPT, System-1.x, PRInTS). 🔸Reward & Skill Discovery in Code: Leveraging execution signals to build reliable rewards, automate debugging, and discover abstractions in code (UTGen, ReGAL). Prev (Research Intern): Google DeepMind, Meta FAIR, Allen Institute for AI (AI2), and Adobe Research. Feel free to reach out via DM or email if you’re interested, have leads, or would like to connect! 🌐 archiki.github.io 📧 archiki@cs.unc.edu #NLP #AI #JobSearch
English
15
59
343
55.4K
Mohit Bansal
Mohit Bansal@mohitban47·
Deeply happy and honored to be elected as an ACL Fellow -- and to be a part of the respected cohort of this+past years' fellows (congrats everyone)! 🙏 All the credit (and sincere gratitude) to all my amazing students, postdocs, collaborators, mentors, and family! 🤗💙
Mohit Bansal tweet media
English
25
34
177
42.3K
Swarnadeep Saha retweetledi
Jason Weston
Jason Weston@jaseweston·
🤝 New Position Paper !!👤🔄🤖 @j_foerst and I wrote a position piece on what we think is the path to safer superintelligence: co-improvement. Everyone is focused on self-improving AI, but (1) we don't know how to do it yet, and (2) it might be misaligned with humans. Co-improvement: instead, build AI that collaborates *with us* to solve AI faster, and to help fix the alignment problem together. More details in the paper! Read it here: 📝:github.com/facebookresear…
Jason Weston tweet media
English
26
95
509
85.2K
Swarnadeep Saha retweetledi
Jason Weston
Jason Weston@jaseweston·
🌶️SPICE: Self-Play in Corpus Environments🌶️ 📝: arxiv.org/abs/2510.24684 - Challenger creates tasks based on *corpora* - Reasoner solves them - Both trained together ⚔️ -> automatic curriculum! 🔥 Outperforms standard (ungrounded) self-play Grounding fixes hallucination & lack of diversity 🧵1/6
Jason Weston tweet media
English
8
55
333
79.9K
Swarnadeep Saha retweetledi
Mimansa Jaiswal
Mimansa Jaiswal@MimansaJ·
I was impacted by Meta layoffs today. As a Research Scientist working on LLM posttraining (reward models, DPO/GRPO) & automated evaluation pipelines, I’ve focused on understanding why/wehere models fail & how to make them better. I’m looking for opportunities; please reach out!
Susan Zhang@suchenzang

👀

English
121
230
3.1K
840.9K
Swarnadeep Saha retweetledi
Jason Weston
Jason Weston@jaseweston·
Hybrid Reinforcement (HERO): When Reward Is Sparse, It’s Better to Be Dense 🦸‍♂️ 💪 📝: arxiv.org/abs/2510.07242 - HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method - Tackles the brittleness of binary signals and the noise of pure reward models -> better results! ✔️ Stratified normalization anchors dense scores within verifier groups ✔️ Variance-aware weighting emphasizes harder, high-variance prompts ✔️ Stable + informative rewards, no drift 📈 Results: 🔥 +11.7 pts vs RM-only, +9.2 pts vs verifier-only on hard-to-verify reasoning tasks 🔥 Generalizes across Qwen and OctoThinker models 🔥 Works well when training with easy-to-verify/hard-to-verify/mixed samples. Hybrid reward → stable, dense, reliable supervision, advancing reasoning RL 🧵(1/5)
Jason Weston tweet media
English
4
56
350
68K
Swarnadeep Saha retweetledi
Mohit Bansal
Mohit Bansal@mohitban47·
🚨 "Think the right amount" for improving both reasoning accuracy and efficiency! --> Large reasoning models under-adapt = underthink on hard problems and overthink on easy ones --> ✨TRAAC✨ is an online RL, difficulty-adaptive, attention-based compression method that prunes redundant reasoning steps using the model’s self-attention, and adaptively decides the “fine-grained thinking budget” based on problem difficulty --> Improves both reasoning accuracy (+8.4%) and efficiency (+36.8%)! 🧵👇
Joykirat@joykiratsingh

🚨 Excited to announce TRAAC, an online difficulty-adaptive, attention-based method that handles the tradeoff of under & overthinking in reasoning models to improve both accuracy and efficiency. Underthinking ❌: Models terminate reasoning too early on harder problems, leading to errors. Overthinking 🔁: Models think excessively for simpler tasks, generating unnecessary steps and inflating test-time computation. 1️⃣: TRAAC dynamically allocates the “thinking budget” via: Difficulty Calibration – adapt to problem hardness and Attention-Based Compression – prune redundant steps. 2️⃣: Across a variety of tasks (AIME, AMC, GPQA-D, BBEH), TRAAC (Qwen3-4B) achieves an average absolute accuracy gain of 8.4% with a relative reduction in reasoning length of 36.8% compared to the base model. 3️⃣: Analysis shows task-difficulty calibration and attention-based compression are effective for both accuracy and efficiency gains.

English
1
18
77
9.2K
Swarnadeep Saha retweetledi
Gabriel Synnaeve
Gabriel Synnaeve@syhw·
(🧵) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. ai.meta.com/research/publi…
English
60
312
1.8K
918.7K
Wenting Zhao
Wenting Zhao@wzhao_nlp·
I’ve recently joined @Alibaba_Qwen! We’re building the next generation of frontier models through careful science and world-class engineering, and we are making rapid progress. Excited for what’s ahead 💜
English
61
23
905
115.9K
Swarnadeep Saha
Swarnadeep Saha@swarnaNLP·
Turns out we can use RLVR to teach a model to aggregate multiple solutions. Check out our new work on parallel test-time scaling!👇
Jason Weston@jaseweston

🌀New Test-time scaling method 🌀 📝: arxiv.org/abs/2509.06870 - Use RL to train an LLM solution aggregator – Reasons, reviews, reconciles, and synthesizes a final solution -> Much better than existing techniques! - Simple new method. Strong results across 4 math benchmarks. 🧵1/5

English
0
0
19
1.9K
Swarnadeep Saha
Swarnadeep Saha@swarnaNLP·
Post-training with RL causes diversity collapse!! We found a way to directly incorporate semantic diversity as an additional reward that improves both quality and diversity of outputs. 👇
Jason Weston@jaseweston

🌀Diversity Aware RL (DARLING)🌀 📝: arxiv.org/abs/2509.02534 - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks 🧵1/5

English
0
0
10
1.1K