Today we're releasing prime-rl v0.6.0 — enabling RL at trillion-parameter MoE scale on agentic workloads at the highest efficiency.
We've relentlessly optimized our RL infra.
The result: GLM-5 on agentic SWE tasks at 131k context and sub-5-minute step time.
Over a long run the trainer and inference policies slowly drift apart, and that mismatch can kill your training.
R3 (router replay) captures the routing decisions from the inference engine, replays them on the trainer - KL mismatch drops ~10x.
The trainer is 3D-parallel (FSDP2 + CP + EP), built on TorchTitan.
FSDP2 shards params, grads & optimizer state. EP keeps experts sharded and routes tokens with all2all instead of all-gathering ~80GB per layer. CP handles the 131k context and GLM-5's DSA attention.
Huge thanks to the @vllm_project team, and @robertshaw21 in particular, for all the help along the way.
Also to the llm-d and Dynamo teams for the collaboration on routing and inference.