
Massively Parallel Procrastinator
2.3K posts

Massively Parallel Procrastinator
@SHELLEYBLEND
Shelley the blender (∂ + m) ψ = 0 Quantum Entanglement






This is Alex Finn He’s costed so many people their hard earned money during his Mac mini grift Now he’ll reuse the same script for the DGX Spark He doesn’t know how to highlight any hardware strengths/weaknesses Zero substance or knowledge of local AI Go grift something else



Do you know which agent skills are useless in your setup? The ones you installed just in case and probably forgotten by now?

A Lighthouse layer replaces standard scaled dot-product attention with four stages that surround, but do not modify, the attention kernel. Q, K, and V are average-pooled into an L-level pyramid with pooling factor p. Per-head norms score every pyramid entry, and a coarse-to-fine top-k cascade selects survivors at each level. The chosen entries are gathered into a contiguous, causally-sorted sub-sequence on which standard attention runs, and the outputs are scattered back to their base positions. Because the gathered sub-sequence is dense and topologically causal, the standard lower-triangular mask works as is, the forward and backward pass of the attention itself remains unchanged, and every upstream attention improvement is inherited for free. The trained model has to remain a competent dense-attention model after sparse training, so the recipe is two-stage. For the majority of the budget, the model trains with Lighthouse selection enabled. For a brief tail, selection is disabled and training continues under standard attention, with the same optimizer state and dataloader continuation. We treat this as the load-bearing claim of the work: sparse training does not compromise the model's ability to use full attention at inference.

AI teams shouldn’t have to choose between expensive object storage and painful git workflows. @huggingface Storage is built for model weights, datasets, checkpoints and artifacts: - simple per-TB pricing - built-in CDN - Xet deduplication - private by default when needed Store your AI data where your AI work already happens: huggingface.co/storage







Today we release Lighthouse Attention, a selection-based hierarchical attention for long-context pre-training that delivers a 1.4-1.7× wall-clock speedup at 98K context. It runs the same forward+backward pass ~17× faster than standard attention at 512K context on a single B200, without a custom sparse attention kernel, a straight-through estimator, or an auxiliary loss. During training, queries, keys, and values are pooled symmetrically into a multi-resolution pyramid. We then score every pyramid heads, and a top-k cascade selects a small hierarchical dense sub-sequence, and after a sorting pass that enforces causality, we use standard attention for token mixing. A brief full attention resume at the end converts the checkpoint back into a competent dense-attention model. Validated this using 530M parameter Llama-3 models across 50B tokens, with up to 1M-token benchmarks across 32 B200s under context parallelism. The work on Lighthouse Attention was led by @bloc97_, @SubhoGhosh02, and @theemozilla.






One of my favorite submissions from our hackathon, I especially loved part of the intro video between 00:20 and 30 seconds, it made me laugh out loud. It seems like the quality of submissions just goes up with each time we do one of these ^^.













