Sarthak Mittal

316 posts

Sarthak Mittal

Sarthak Mittal

@sarthmit

Graduate Student at @Mila_Quebec and Student Researcher at @GoogleResearch. Previously interned at @Meta @Apple @MorganStanley @NVIDIAAI and @YorkUniversity

Montréal, Québec Katılım Şubat 2019
831 Takip Edilen1K Takipçiler
Sabitlenmiş Tweet
Sarthak Mittal
Sarthak Mittal@sarthmit·
Meta on meta: thrilled to share our work on Meta-learning… at Meta! 🔥🧠 We make two major contributions: 1️⃣ Unified framework revealing insights into various amortizations 🧠 2️⃣ Greedy belief-state updates to handle long context-lengths 🚀
Sarthak Mittal tweet media
English
5
31
225
45.8K
Sarthak Mittal retweetledi
Volodymyr Kuleshov 🇺🇦
Volodymyr Kuleshov 🇺🇦@volokuleshov·
@StefanoErmon 🚨Hot off the presses: the official Artificial Analysis benchmarking results are in! 🚀Mercury morels set a new frontier of speed and agentic quality
Volodymyr Kuleshov 🇺🇦 tweet media
English
4
8
119
55.1K
Nicolas Zucchet
Nicolas Zucchet@NicolasZucchet·
Thrilled to announce I’m joining @Stanford as a postdoc in @scott_linderman’s lab, generously supported by a @snsf_ch Postdoc Mobility fellowship! Excited for what’s coming next! 🚀
Nicolas Zucchet tweet media
English
8
5
180
10.7K
Sarthak Mittal
Sarthak Mittal@sarthmit·
We identify subtle pitfalls in RL fine-tuning for LLMs: widely used frameworks can produce biased gradients (e.g., GRPO + K3), hurting both performance and training stability. We revisit first principles to systematically evaluate such biased estimators. Tl;dr: go for unbiased!
Vedant Shah@veds_12

LOTs of discourse lately about the correctness of the KL-regularization term used in RLVR fine-tuning of LLMs. Which estimator to use? Whether to add it to the reward or loss? What’s even the difference? 🤔 In our new preprint, we evaluate these choices empirically. 🧵 1/n

English
0
1
6
854
Sarthak Mittal
Sarthak Mittal@sarthmit·
@YouJiacheng Isn’t that exactly what the paper says, that the way K3 is used in GRPO (for eg.) leads to an incorrect gradient estimator. The issue is in implementation, in the form on which autograd is applied, which can lead to this bias.
English
0
0
1
211
You Jiacheng
You Jiacheng@YouJiacheng·
the math here is SIMPLE. if an estimator is correct at both θ and θ+δ, and you say the gradient is wrong, then the only conclusion is that your derivation of the gradient is WRONG. δ→0 implies dot(df/dθ, δ)→[f(θ+δ) - f(θ)]
You Jiacheng@YouJiacheng

😅arxiv.org/abs/2512.21852 who said that "using k3 in loss = using path-wise grad"??? the correct way to use k3 in loss is to use the FULL grad. og GRPO used k3 without IS-correction (= path-wise grad), which is wrong. but it's not k3's fault!!!

English
1
2
62
11.2K
Sarthak Mittal retweetledi
Vedant Shah
Vedant Shah@veds_12·
Hi @YouJiacheng. Thanks for going through our paper. Just to clarify, when we refer to "k3-in-loss" (or any estimator in loss) in the paper, we mean using it in the manner GRPO did, i.e. without the importance sampling correction, and we are just pointing out that it is biased. We mention this in Section 3. We are not saying that the estimator is wrong and agree that the correct way to use it is with the importance sampling ratio as pointed out in @yifan_zhang_'s paper.
Vedant Shah tweet media
English
3
8
34
16.8K
Sarthak Mittal retweetledi
Rosinality
Rosinality@rosinality·
One more study (arxiv.org/abs/2510.01555) on KL penalty with K1, K3 estimators as a reward or a loss.
Rosinality tweet mediaRosinality tweet media
English
5
25
194
38.7K
Sarthak Mittal retweetledi
Xidulu
Xidulu@xidulu·
It's unbelieveable that in 2025 we would need a paper to tell you to use calculus properly and don't move the gradient operator inside the expectation arbitrarily (still glad to see un-biased estimator is the best)
Rosinality@rosinality

One more study (arxiv.org/abs/2510.01555) on KL penalty with K1, K3 estimators as a reward or a loss.

English
4
19
257
29.5K
Sarthak Mittal retweetledi
Sarthak Mittal retweetledi
Sarthak Mittal retweetledi
will brown
will brown@willccbb·
@severinhacker the point of a PhD is not to get a PhD, it’s to do a PhD
English
24
93
1.8K
106.8K
Sarthak Mittal retweetledi
Machine Learning Street Talk
Machine Learning Street Talk@MLStreetTalk·
"Superintelligence vs Extinction" - those are your two options" Professor Michael I. Jordan, a pioneer in the field says the AI discourse in 2025 is really hurting young researchers. He argues that it's demoralising, that bright futures are being snuffed out and that there is "zero" economic thinking behind it.
English
3
13
120
24.3K
Sarthak Mittal retweetledi
Siddarth Venkatraman
Siddarth Venkatraman@siddarthv66·
We’ll be presenting this at the FoRLM workshop between 10:15-11:30am room 33 tomorrow! Drop by if you’d like to chat about this paper, or RL for LLMs in general (I got some juicy new insights)
Siddarth Venkatraman@siddarthv66

NO verifiers. NO Tools. Qwen3-4B-Instruct can match DeepSeek-R1 and o3-mini (high) with ONLY test-time scaling. Presenting Recursive Self-Aggregation (RSA) — the strongest test-time scaling method I know of! Then we use aggregation-aware RL to push further!! 📈📈 🧵below!

English
3
7
30
11.6K