Jai retweetledi

Today we release Lighthouse Attention, a selection-based hierarchical attention for long-context pre-training that delivers a 1.4-1.7× wall-clock speedup at 98K context.
It runs the same forward+backward pass ~17× faster than standard attention at 512K context on a single B200, without a custom sparse attention kernel, a straight-through estimator, or an auxiliary loss.
During training, queries, keys, and values are pooled symmetrically into a multi-resolution pyramid. We then score every pyramid heads, and a top-k cascade selects a small hierarchical dense sub-sequence, and after a sorting pass that enforces causality, we use standard attention for token mixing. A brief full attention resume at the end converts the checkpoint back into a competent dense-attention model.
Validated this using 530M parameter Llama-3 models across 50B tokens, with up to 1M-token benchmarks across 32 B200s under context parallelism.
The work on Lighthouse Attention was led by @bloc97_, @SubhoGhosh02, and @theemozilla.

English




















