

Brian Bartoldson
277 posts




We have been pushing the limits of test-time scaling with RSA for single-turn reasoning problems in science and math. Check out our blog post with new results on ARC-AGI-2, ArXivMath, and FrontierScience! A lot of gains with just test-time scaling! rsa-llm.github.io/blog




Impressed to see that Sonnet 4.6 meets or exceeds Opus 4.5 capabilities across Bio tasks!




Does LLM RL post-training need to be on-policy?

⏳Traditional RL runs slow because on-policy training has searcher and trainer processes waiting for each other. TBA decouples these processes to go fast: - Multiple searchers generate LLM outputs constantly - A trainer learns asynchronously from the generated off-policy data

🔬 Ablation 1: KL reference policy Why are both TBA and Kimi K2 performant on highly off-policy data? Both use KL regularization. Both compute KL against a moving reference policy: TBA resets every ρ=50 steps (flexible), Kimi K2 uses the inference policy as the reference.


Since our March paper, methods like Kimi K2’s, CISPO, Dr. GRPO, and IcePop have pushed LLM RL forward. We compared async LLM RL performances and found TBA is still at the top (though it now has company). 2 small changes to use TBA, starting from Dr. GRPO: (1) add the KL estimate to the reward, and (2) reset the reference policy periodically. The math is below, note that our experiments hold IS clipping and loss normalization strategies constant.







LOTs of discourse lately about the correctness of the KL-regularization term used in RLVR fine-tuning of LLMs. Which estimator to use? Whether to add it to the reward or loss? What’s even the difference? 🤔 In our new preprint, we evaluate these choices empirically. 🧵 1/n

LOTs of discourse lately about the correctness of the KL-regularization term used in RLVR fine-tuning of LLMs. Which estimator to use? Whether to add it to the reward or loss? What’s even the difference? 🤔 In our new preprint, we evaluate these choices empirically. 🧵 1/n




NO verifiers. NO Tools. Qwen3-4B-Instruct can match DeepSeek-R1 and o3-mini (high) with ONLY test-time scaling. Presenting Recursive Self-Aggregation (RSA) — the strongest test-time scaling method I know of! Then we use aggregation-aware RL to push further!! 📈📈 🧵below!