

Dheeraj Mishra
133 posts

@mishra945
MTech, (CSP & ML) ,EECS@ IIT Bombay | Founder @ EECS Academy & POSTGATE -EdTech Start up for GATE Test Prep Online Platform



Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.

Reinforcement Learning (RL) is quickly becoming the most important skill for AI researchers. Here are the best resources for learning RL for LLMs… TL;DR: RL is more important now than it has ever been, but (probably due to its complexity) there aren’t a ton of great resources for learning it online. I’ve been doing a lot of reading / learning on RL recently, so I wanted to share the best resources I’ve found. Links to all resources are provided in the image below. (1) RLHF book. Nathan is a long-time RL researcher and an expert on LLM alignment / post-training. He decided to write an entire book on (LLM-focused) RL techniques and has been slowly expanding / iterating on the book over the last year. This is the most comprehensive RL resource that is currently available, and it’s an especially great resource for those who are unfamiliar with RL and still need to learn the basics. (2) The Spinning up with Deep RL Course from OpenAI–despite being created in ~2018–has stood the test of time and is one of the best tutorials for learning RL. This course builds up to understanding PPO, which is one of the most widely used algorithms for RL with LLMs. Plus, understanding related algorithms (policy gradients, TRPO, etc.) will help a lot with gaining an understanding of new RL algorithms like GRPO. (3) PPO / GRPO blog. Jimmy Shi (DeepMind) recently wrote a great blog explaining both PPO (RL algo traditionally used for RLHF) and GRPO (RL algo used for reasoning models). This blog is great and it’s written in a way that is understandable for non-RL people. (4) HuggingFace RL. HuggingFace has also published numerous useful blogs on the topic of RL. Most recently, they published a blog that explains GRPO and PPO from the ground up (i.e., not assuming any background knowledge on RL). These blogs are inspired by the recent initiative from HuggingFace to create a fully open replication of DeepSeek-R1.

















