Piyush Tiwary
353 posts

Piyush Tiwary
@backpropogator
Ph.D. at @iiscbangalore, SR at @GoogleDeepMind | Previously @AdobeResearch @IITPAT Tweets may seem random individually but show patterns at scale.




Another work that confirms Self-Distillation’s effectiveness, this time from @AIatMeta ‘s MSL, by Zhao et al In « Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models », the authors perform Self-Distill similarly to the other SD setups: —> Student gets to output —> Teacher is same model, but outputs with ground truth + CoT —> Model trained on KL divergence between Student and Teacher Worth noting: —> Setup applied to maths here —> Teacher = initial student, not using EMA but thus not trained —> Trained on full vocabulary or sampled tokens only, with just like the other studies better results on full vocabulary Findings: —> Performs similarly to GRPO on similar setup of params —> Much, much more token efficient than GRPO, to an actually absurd degree The fact there are so many works colliding on the strength of SD means we’re only at the beginning =)














