Today we're releasing prime-rl v0.6.0 — enabling RL at trillion-parameter MoE scale on agentic workloads at the highest efficiency.
We've relentlessly optimized our RL infra.
The result: GLM-5 on agentic SWE tasks at 131k context and sub-5-minute step time.
In RL, inference is the bottleneck — we optimize for throughput, not latency.
High concurrency, FP8 precision, and wide expert parallelism over 32+ GPUs. Every GPU holds its own slice of experts and acts as its own endpoint.
One Mooncake store pools KV cache across all nodes, so any worker can reuse any prefix.
The router picks workers by a score over load, queue depth, KV usage and prefix overlap. You get cross-replica cache hits with balanced routing across the whole deployment.