sid bharthulwar
14 posts

sid bharthulwar
@bharthulwar
@jumptrading prev @harvard @twosigma @bridgewater


Here are some research directions I enjoyed in #neurips (will compile some more soon!) Bootstrapping long‑horizon reasoning: Recent work [1, 2] shows we can train LLMs on short-step problems and curriculum them into much longer chains. By composing simple problems into multi-step tasks and using outcome-only rewards, models learned to solve much harder problems. This suggests an efficient path to scale deep reasoning, would love to see this scale outside of non-verifiable domains. Reward shaping and PRMs: To get better reasoning, we need to reward beyond basic task completion. Posterior-GRPO uses process-based rewards in code generation outperforming ORM-based RL [3], RL-Tango uses an LLM PRM that is co-trained with the generator to achieve SOTA on maths benchmarks [4]. ToolRL focuses on PRMs for tool usage [5]. RL on non-verifiable tasks: I saw a really nice transition from verifiable tasks (maths/code) to more open-ended objectives (dialogue, automation, etc). One interesting trend here is using offline RL for non-verifiable rewards and online RL for verifiable rewards [6]. Would have loved to see more work on online RL for non-verifiable rewards [7]. Science behind RL: There are a lot of interesting questions on what capabilities RL is illicting in LLMs. [8] questions whether RL is adding any more reasoning capacity to the base model. [9] examines mechanisms to actively elicit meta-cognition to overcome these limitations. Would love to see more critical examination of the science behind RL. [1] H1 by @sumeetrm, @philiptorr, @riashatislam, @sytelus, @casdewitt, @CharlieLondon02 [2] Reasoning Curriculum by @bo_pang0, @silviocinguetta, @CaimingXiong, @yingbozhou_ai [3] Posterior-GRPO by @MouxiangC, @Zhongxin_Liu [4] RL-Tango by @KaiwenZha, @ZhengqiGao, @maohaos2, @ZhangWeiHong9, @dina_katabi [5] ToolRL by @emrecanacikgoz, @qiancheng1231, @dilekhakkanitur, @tur_gokhan, @hengjinlp [6] Writing Zero (Not in NeurIPS) by @YunyiYang2 [7] JEPO by @robinphysics, @sidawxyz, @louvishh [8] Does RL incentive reasoning by @YangYue_THU, @RayLu_THU, @_AndrewZhao [9] ReMA by @raywzy1, @MarkSchmidtUBC, @seawan, @linyi_yang

GPU parallelized envs have accelerated RL, but most implementations exhibit critical instability when running on-policy RL with short rollouts. We present Staggered Environment Resets. A few lines of code are all you need! Presenting today, 4:30PM poster 310 #NeurIPS2025 🧵(1/8)



















