Post

Prime Intellect
Prime Intellect@PrimeIntellect·
Today we're releasing prime-rl v0.6.0 — enabling RL at trillion-parameter MoE scale on agentic workloads at the highest efficiency. We've relentlessly optimized our RL infra. The result: GLM-5 on agentic SWE tasks at 131k context and sub-5-minute step time.
Prime Intellect tweet media
English
36
88
921
267.4K
Prime Intellect
Prime Intellect@PrimeIntellect·
In RL, inference is the bottleneck — we optimize for throughput, not latency. High concurrency, FP8 precision, and wide expert parallelism over 32+ GPUs. Every GPU holds its own slice of experts and acts as its own endpoint.
Prime Intellect tweet media
English
1
0
68
5.8K
Prime Intellect
Prime Intellect@PrimeIntellect·
We disaggregate prefill and decode onto separate workers. A long prefill used to stall decode for everyone. Now it doesn't.
Prime Intellect tweet media
English
1
0
57
4.3K
Prime Intellect
Prime Intellect@PrimeIntellect·
One Mooncake store pools KV cache across all nodes, so any worker can reuse any prefix. The router picks workers by a score over load, queue depth, KV usage and prefix overlap. You get cross-replica cache hits with balanced routing across the whole deployment.
Prime Intellect tweet media
English
1
0
50
3.6K
Paylaş