Matt Jones retweetledi

Mini-R1: Reproduce @deepseek_ai R1 „aha moment“ a RL tutorial! Recreate an RL "aha moment" using Group Relative Policy Optimization (GRPO) and train an open model using reinforcement learning to teach it self-verification and search abilities all on its own to solve the Countdown Game.
TL;DR:
🤯 DeepSeek R1's "aha moment" demonstrates RL's potential for self-improvement in LLMs.
2️⃣ Using 2 reward functions, 1x for format (,) and 1x for correctness
🤖 Qwen2.5-3B-Instruct model learns self-verification and search abilities.
⚙️ Use @MSFTDeepSpeed and @vllm_project for efficient and distributed online RL Training with @huggingface TRL
🤟 Include Training Observations and Hyperparameter improvements
🧮 Uses Countdown Game (arithmetic puzzles) to teach models self-correction via and tags
📊 Achieves 50% success rate after 450 training steps on 4x H100 GPUs
⚡ Training takes ~6 hours on 4x H100 GPUs for 450 steps

English

































