

Vignesh Baskaran
365 posts

@tweetvbaskaran
Accelerating Super Intelligence | Building AI that builds AI | CTO and Cofounder Hexo AI













🚨How do we improve long-horizon reasoning capabilities by scaling RL with only existing data? Introducing our new paper: "h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning"🫡 > RL on existing datasets saturates very quickly > Reasoning over complex interdependent problems is incredibly important, but we currently lack enough long-horizon reasoning data > Long-horizon problems are hard, which means training signal is sparse. We’d need a way to provide dense supervision Our solution composes existing short-horizon data to form a synthetic curriculum that keeps growing in complexity! This allows us to scale RL on the same dataset while avoiding saturation, with curriculum acting as dense rewards. At a small scale, we see massive in-domain long-horizon improvements, which transfer to significantly harder benchmarks. Training on composed 6th grade math problems leads to strong gains on AIME! 1/N🤿🧵







