
A fun fact is that the baseline here is fully asynchronous RL with a tuned train/inference layout!
Applied Compute@appliedcompute
Not all RL rollouts are equally informative. On a problem with a 10% success rate, each success is 81x more valuable than each failure. We built an algorithm to exploit this, training only on the most informative samples. The result was an improvement in compute efficiency, with held-out evaluation metrics increasing faster over time.
English