

this diagram took my good 1.5 hours😭
Antaripa Saha
5.3K posts

@doesdatmaksense
consulting companies in applied ai | doing maths in my free time


this diagram took my good 1.5 hours😭





What if understanding a video was more like navigating a map?🤔 And what if that made compute scale logarithmically (not linearly) with video length?! New preprint🎉: 🗺️VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Demis Hassabis at YC today: "We're only one or two technical breakthroughs away from AGI. But all the other parts are already in place."


Recursive Language Models (RLMs) have been floating around for a couple of months, but in the last two weeks the discussion has picked up fast, especially alongside ideas like GEPA. The issue they’re trying to address isn’t new. When you build agents, the context window becomes a bottleneck pretty quickly. Packing more into the prompt leads to context rot and we know it well. RLMs take a different angle. Instead of treating the input as a fixed blob of text, the model treats it more like an environment it can explore. You give the root model something like a REPL, and now it doesn’t have to read everything upfront. It can decide what’s worth inspecting and make recursive sub-calls when needed. So instead of one big forward pass, you get structured computation. The paper shows RLMs handling up to two orders of magnitude beyond context window But on simple retrieval i.e "needle in a haystack", there’s basically no difference compared to standard models. Difference appears once context gets large (around 16K tokens and beyond), which is expected. Things change with tasks like OOLONG, where the model has to aggregate information across many entries (linear complexity). Vanilla models degrade steadily as the input grows, while RLMs hold up much better. On OOLONG at 132K tokens, base GPT-5 scores 44% while the RLM scores 56.5% . A ~28% improvement. Another breakpoint shows up with OOLONG-Pairs, which requires pairwise comparisons (quadratic complexity). Standard models are essentially at zero. RLMs get to ~58% F1. This isn't surprising as these task can't be done with single forward pass as attention isn’t designed for that. On deeper research-style tasks (like browsing large document sets), RLMs also show strong gains, both in accuracy and token efficiency. One of the more interesting side effects is what people are calling “small model inversion.” With the right recursive setup, smaller models can outperform larger ones on long-context reasoning. There are cases where a GPT-5-mini-based RLM beats GPT-5 on harder splits, and where smaller fine-tuned models outperform much larger ones on million-token tasks. That suggests the bottleneck isn’t just model size. The main thing to keep in mind is that RLMs aren’t universally better. On short, simple tasks, they don’t really add value. But as context length and reasoning complexity increase, the advantage becomes hard to ignore. The OOLONG-Pairs result ~58% vs <0.1% is probably the clearest signal. Once a task requires structured computation rather than just pattern matching, giving the model the ability to act over the context changes what it can do.











Behind the scenes for our memory benchmark (we knew)


For Agentic tasks, Oracle-level performance is the maximum performance a system can achieve, assuming it is able to retrieve all relevant documents perfectly, every time. We're proud to show that Mixedbread Search approaches the Oracle on multiple knowledge intensive benchmarks.



Behind the scenes for our memory benchmark (we knew)


we need a new memory benchmark 3 teams now all reporting 95%+ on LongMemEval
