Erfan Miahi@erfan_mhi
Pretty wild to see our work on PULSE show up in a real 1T-scale post-training run done by @cursor_ai.
Cursor built Composer 2 in collaboration with Fireworks and trained it across multiple datacenters, getting huge savings by syncing only the weights that actually changed between RL checkpoints. Fireworks reports that more than 98% of BF16 weights can stay bit-identical from one checkpoint to the next, and they cited our paper on this, too.
That is basically the exact sparsity pattern we showed in our paper, where we introduced PULSE, a lossless method for 100x more efficient weight-sync communication for RL training. Their system is very close to this idea in practice: exploiting the fact that only a tiny fraction of weights actually change between RL steps.
The deeper reason for this is not that RL gradients are sparse. They are not. The gradients are still dense. What becomes sparse is the realized weight update. In RL, learning rates are tiny, and with Adam, the update size stays bounded around the learning rate. Then BF16 adds a hard threshold: if the update is too small relative to the weight, it just rounds away, and the stored weight does not change at all. So from one checkpoint to the next, most of the model literally stays identical.
That is why this is such a useful systems idea. Lower precision, like using BF16, does not just save compute. It can also save communication, because more tiny updates get absorbed and fewer weights need to be shipped. At that point, compute efficiency and comms efficiency stop being a tradeoff. They start reinforcing each other.
If you want the deeper story on why RL updates get this sparse, the theory behind it, and how to push weight-sync bandwidth down by 100x+, take a look at our paper:
arxiv.org/pdf/2602.03839
The Fireworks blog on Composer 2 that cited our work:
fireworks.ai/blog/frontier-…
The animation is taken from Fireworks!