
Think streaming video AI needs heavy memory tricks? Think again.
SimpleStream—a dead-simple sliding window feeding just the last 4 frames to an off-the-shelf vision-language model—just beat 13 major streaming models on two public leaderboards.
Results:
— 67.7% on OVO-Bench (+8.5 pts over HERMES, the prior SOTA)
— 80.6% on StreamingBench
— Uses less GPU memory (≤18 GB) and responds in under 40 ms.
Adding more frames or memory? Often *worse* for real-time perception. The big insight: optimal window size depends on model backbone, not just scale. And more history trades off recall for present-scene accuracy.
The bar for “progress” just got higher: new systems must beat this minimalist baseline under identical conditions, and benchmarks should separate perception from memory.
Get the full analysis here: yesnoerror.com/abs/2604.02317
// alpha identified
// $YNE
English