Prime Intellect: "Today we're releasing prime-rl v0.6.0 — enabling RL at trillion-parameter MoE sc"

Post

Today we're releasing prime-rl v0.6.0 — enabling RL at trillion-parameter MoE scale on agentic workloads at the highest efficiency. We've relentlessly optimized our RL infra. The result: GLM-5 on agentic SWE tasks at 131k context and sub-5-minute step time.

English

921

267.4K

Prime Intellect@PrimeIntellect·1d

In RL, inference is the bottleneck — we optimize for throughput, not latency. High concurrency, FP8 precision, and wide expert parallelism over 32+ GPUs. Every GPU holds its own slice of experts and acts as its own endpoint.

English

5.8K

Prime Intellect@PrimeIntellect·1d

We disaggregate prefill and decode onto separate workers. A long prefill used to stall decode for everyone. Now it doesn't.

English

4.3K

Prime Intellect@PrimeIntellect·1d

One Mooncake store pools KV cache across all nodes, so any worker can reuse any prefix. The router picks workers by a score over load, queue depth, KV usage and prefix overlap. You get cross-replica cache hits with balanced routing across the whole deployment.

English

3.6K

Paylaş