
🚀 Big time! We can finally do LLM RL fine-tuning with rewards and leverage offline/off-policy data! ❌ You want rewards, but GRPO only works online? ❌ You want offline, but DPO is limited to preferences? ✅ QRPO can do both! 🧵Here's how we do it:
Skander Moalla
155 posts

@SkanderMoalla
RS Intern @ Meta FAIR | PhD @ EPFL, Caglar Gulcehre Lab for AI Research (CLAIRE) | Reinforcement learning, Large Language Model post-training

🚀 Big time! We can finally do LLM RL fine-tuning with rewards and leverage offline/off-policy data! ❌ You want rewards, but GRPO only works online? ❌ You want offline, but DPO is limited to preferences? ✅ QRPO can do both! 🧵Here's how we do it:




A new milestone in automatic formalization: We translated an entire graduate math textbook into Lean using 30K LLM agents. Open-source, large-scale multi-agent inference that actually works > Blueprint+Lean: faabian.github.io/algebraic-comb… > Codebase+preprint: github.com/facebookresear… 1/7





📢 « Partition Generative Modeling (PGM): Masked Modeling without Masks » is out! 🚯 Masked diffusion models waste FLOPs processing countless mask tokens that carry no real information. ⚡We show how partitioning can replace masking, boosting throughput by >5.3x on text and up to 7.5x on VQ-ImageNet! 📄 paper: arxiv.org/abs/2505.18883 💻 Code: github.com/jdeschena/pgm 🤗 Models: huggingface.co/jdeschena/pgm 1/9 🧵



New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity? When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?" Read more: alignment.anthropic.com/2026/hot-mess-…



🚨 Postdoc available in my lab @EPFL on safe AI, alignment, LLMs, NLP, mech interp 🎯 Safe AI that truly cares about humans! No “lipstick-on-a-pig alignment” 🐖💄 after pretraining—let’s “raise” models aligned from token 1 onward 🍼 👉 Info & app: go.epfl.ch/safe-ai-postdoc






