
Today we're launching Horizon: the first long-horizon learning benchmark made from real agent logs.
Read more below ⬇️⬇️
Bryan@bryan_houlton
Introducing Horizon from @0rinlabs: the first long-horizon learning benchmark made from real agent logs - SOTA is 21% on the hardest section - 7-35M tokens of real agent history per task - Models are hardly getting better on the hardest tasks - Humans can score 100% (1/7)
English