Sarthak Mittal

316 posts

Sarthak Mittal

@sarthmit

Graduate Student at @Mila_Quebec and Student Researcher at @GoogleResearch. Previously interned at @Meta @Apple @MorganStanley @NVIDIAAI and @YorkUniversity

Montréal, Québec Katılım Şubat 2019

831 Takip Edilen1K Takipçiler

Sabitlenmiş Tweet

Sarthak Mittal@sarthmit·18 Eki

Meta on meta: thrilled to share our work on Meta-learning… at Meta! 🔥🧠 We make two major contributions: 1️⃣ Unified framework revealing insights into various amortizations 🧠 2️⃣ Greedy belief-state updates to handle long context-lengths 🚀

English

225

45.8K

Sarthak Mittal@sarthmit·15 Mar

We evaluate RSA on more competitive benchmarks with stronger models with gains across the board!

Moksh Jain@JainMoksh

We have been pushing the limits of test-time scaling with RSA for single-turn reasoning problems in science and math. Check out our blog post with new results on ARC-AGI-2, ArXivMath, and FrontierScience! A lot of gains with just test-time scaling! rsa-llm.github.io/blog

English

1.1K

Sarthak Mittal@sarthmit·25 Şub

Absolutely incredible!!

Stefano Ermon@StefanoErmon

Mercury 2 is live 🚀🚀 The world’s first reasoning diffusion LLM, delivering 5x faster performance than leading speed-optimized LLMs. Watching the team turn years of research into a real product never gets old, and I’m incredibly proud of what we’ve built. We’re just getting started on what diffusion can do for language.

English

667

Sarthak Mittal retweetledi

Volodymyr Kuleshov 🇺🇦@volokuleshov·24 Şub

@StefanoErmon 🚨Hot off the presses: the official Artificial Analysis benchmarking results are in! 🚀Mercury morels set a new frontier of speed and agentic quality

English

119

55.1K

Sarthak Mittal@sarthmit·26 Oca

RSA goes bigger and bolder! 🔥 We observe amazing gains when combined with Gemini 3 Flash on the public ARC-AGI-2. Thanks to @GeminiApp for a beast of a model

Siddarth Venkatraman@siddarthv66

Recursive Self-Aggregation (RSA) + Gemini 3 Flash scores 59.31% on the public ARC-AGI-2 evals, placing it firmly among the top performers! Here are the highlights: > Outperforms Gemini DeepThink at about 1/10th the cost > Bridges the performance gap with GPT-5.2-xHigh for a similar cost > Nearly matches Poetiq while using a much simpler pipeline. Poetiq uses scaffolded refinement (often via generated code), while RSA does not Also, Gemini 3 Flash is impressive; the Gemini team cooked with this one! We're also eager to run evals with GPT-5.2 + RSA in the future (anyone with credits? :P)

English

773

Sarthak Mittal@sarthmit·25 Oca

Incredible things in the pipeline already! Stay tuned 🚀

Siddarth Venkatraman@siddarthv66

Recursive Self-Aggregation > Gemini DeepThink. it really is the best test-time scaling algorithm does that claim sound too bold? well, we just crushed ARC-AGI 2 public evals with Gemini 3 Flash and RSA. stay tuned, more details tomorrow :) @arcprize

English

831

Sarthak Mittal@sarthmit·14 Oca

@NicolasZucchet @Stanford @scott_linderman @snsf_ch Congrats Nicolas!!

Español

Nicolas Zucchet@NicolasZucchet·12 Oca

Thrilled to announce I’m joining @Stanford as a postdoc in @scott_linderman’s lab, generously supported by a @snsf_ch Postdoc Mobility fellowship! Excited for what’s coming next! 🚀

English

180

10.7K

Sarthak Mittal@sarthmit·8 Oca

Amazing work leveraging a sparse meta-controller on a pre-trained model’s residual stream!

Seijin Kobayashi@SeijinKobayashi

Standard reinforcement learning in raw tokens is a disaster for sparse rewards! Here, we propose 𝗜𝗻𝘁𝗲𝗿𝗻𝗮𝗹 𝗥𝗟: acting on abstract actions emerging in the residual stream representation. A paradigm shift in using pretrained models to solve hard, long-horizon tasks! 🧵

English

473

Sarthak Mittal@sarthmit·6 Oca

We identify subtle pitfalls in RL fine-tuning for LLMs: widely used frameworks can produce biased gradients (e.g., GRPO + K3), hurting both performance and training stability. We revisit first principles to systematically evaluate such biased estimators. Tl;dr: go for unbiased!

Vedant Shah@veds_12

LOTs of discourse lately about the correctness of the KL-regularization term used in RLVR fine-tuning of LLMs. Which estimator to use? Whether to add it to the reward or loss? What’s even the difference? 🤔 In our new preprint, we evaluate these choices empirically. 🧵 1/n

English

854

Sarthak Mittal@sarthmit·5 Oca

@YouJiacheng Isn’t that exactly what the paper says, that the way K3 is used in GRPO (for eg.) leads to an incorrect gradient estimator. The issue is in implementation, in the form on which autograd is applied, which can lead to this bias.

English

211

You Jiacheng@YouJiacheng·4 Oca

the math here is SIMPLE. if an estimator is correct at both θ and θ+δ, and you say the gradient is wrong, then the only conclusion is that your derivation of the gradient is WRONG. δ→0 implies dot(df/dθ, δ)→[f(θ+δ) - f(θ)]

You Jiacheng@YouJiacheng

😅arxiv.org/abs/2512.21852 who said that "using k3 in loss = using path-wise grad"??? the correct way to use k3 in loss is to use the FULL grad. og GRPO used k3 without IS-correction (= path-wise grad), which is wrong. but it's not k3's fault!!!

English

11.2K

Sarthak Mittal retweetledi

Vedant Shah@veds_12·4 Oca

Hi @YouJiacheng. Thanks for going through our paper. Just to clarify, when we refer to "k3-in-loss" (or any estimator in loss) in the paper, we mean using it in the manner GRPO did, i.e. without the importance sampling correction, and we are just pointing out that it is biased. We mention this in Section 3. We are not saying that the estimator is wrong and agree that the correct way to use it is with the importance sampling ratio as pointed out in @yifan_zhang_'s paper.

English

16.8K

Sarthak Mittal retweetledi

Rosinality@rosinality·29 Ara

One more study (arxiv.org/abs/2510.01555) on KL penalty with K1, K3 estimators as a reward or a loss.

English

194

38.7K

Sarthak Mittal retweetledi

Xidulu@xidulu·29 Ara

It's unbelieveable that in 2025 we would need a paper to tell you to use calculus properly and don't move the gradient operator inside the expectation arbitrarily (still glad to see un-biased estimator is the best)

Rosinality@rosinality

One more study (arxiv.org/abs/2510.01555) on KL penalty with K1, K3 estimators as a reward or a loss.

English

257

29.5K

Sarthak Mittal retweetledi

Sarath Chandar@apsarathchandar·26 Ara

Free healthcare in Canada is a scam. You pay 50% of your hard earned money as tax and all you get in return is to wait in emergency room until you die!

Kirk Lubimov@KirkLubimov

Pay attention to the hospital staff in this video; Not only there are no one around, considering this men should have just gotten full hands on deck emergency care but the 2 that are standing there have like zero care or empathy to the situation while the wife is pouring her heart out while her husband laying dead next to her because of nothing short of negligence. The man waited in an Edmonton hospital ER for 8 hours for care, and even had a heart beat on 210 which should have alarmed everyone there but no one cared. Every single person involved should be fired, never work in healthcare again and some of them jailed. He should every single sign that points to something catastrophic might happen. But "free healthcare". Ironically, this would have never happened to them in India.

English

2.5K

Sarthak Mittal retweetledi

Moksh Jain@JainMoksh·20 Ara

Also check out our work on recursive self-aggregation which uses a similar principle of introducing bottlenecks through self-aggregation to enable effective deep thinking across various model families and architectures. arxiv.org/abs/2509.26626

Sanjeev Arora@prfsanjeevarora

I'm glad this paper of ours is getting attention. It shows that there are more efficient and effective ways for models to use their thinking tokens than generating a long uninterrupted thinking trace. Our PDR (parallel/distill/refine) orchestration gives much better final accuracy, while avoiding context bloat. (So it might be much cheaper to serve than today's thinking models.) I'm guessing that so-called "deep research models" rely on such orchestrations.

English

11.5K

Sarthak Mittal retweetledi

will brown@willccbb·14 Ara

@severinhacker the point of a PhD is not to get a PhD, it’s to do a PhD

English

1.8K

106.8K

Sarthak Mittal@sarthmit·7 Ara

This is the way.

Pasquale Minervini@PMinervini

When chatting with students I always say “REINFORCE” or “score function estimator” (instead of GRPO, DrGRPO, etc.) to abstract over specific implementation details

English

1.6K

Sarthak Mittal retweetledi

Pasquale Minervini@PMinervini·6 Ara

When chatting with students I always say “REINFORCE” or “score function estimator” (instead of GRPO, DrGRPO, etc.) to abstract over specific implementation details

(((ل()(ل() 'yoav))))👾@yoavgo

wait so the GRPO everyone are drooling about is just REINFORCE with the baseline computed as an average over a large sample (and the usual kl regularization in llm models)?

English

14.6K

Sarthak Mittal retweetledi

Machine Learning Street Talk@MLStreetTalk·3 Ara

"Superintelligence vs Extinction" - those are your two options" Professor Michael I. Jordan, a pioneer in the field says the AI discourse in 2025 is really hurting young researchers. He argues that it's demoralising, that bright futures are being snuffed out and that there is "zero" economic thinking behind it.

English

120

24.3K

Sarthak Mittal@sarthmit·7 Ara

Definitely check out our work on improving reasoning through test time scaling! Drop by if you are curious about RL post-training from the ground-level fundamentals without the often unnecessary something-something POs.

Siddarth Venkatraman@siddarthv66

We’ll be presenting this at the FoRLM workshop between 10:15-11:30am room 33 tomorrow! Drop by if you’d like to chat about this paper, or RL for LLMs in general (I got some juicy new insights)

English

2.4K

Sarthak Mittal retweetledi

Siddarth Venkatraman@siddarthv66·7 Ara

We’ll be presenting this at the FoRLM workshop between 10:15-11:30am room 33 tomorrow! Drop by if you’d like to chat about this paper, or RL for LLMs in general (I got some juicy new insights)

Siddarth Venkatraman@siddarthv66

NO verifiers. NO Tools. Qwen3-4B-Instruct can match DeepSeek-R1 and o3-mini (high) with ONLY test-time scaling. Presenting Recursive Self-Aggregation (RSA) — the strongest test-time scaling method I know of! Then we use aggregation-aware RL to push further!! 📈📈 🧵below!

English

11.6K

Keşfet

@StefanoErmon @GeminiApp @NicolasZucchet @Stanford @scott_linderman @snsf_ch @YouJiacheng @yifan_zhang_