

Eric Alcaide
1.1K posts

@eric_alcaide
Design is not finished. Common prosperity. LLMaxxing @poolsideai





Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.





Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.




PFlash now run @poolsideai's Laguna-XS.2 (33B-A3B MoE) on a single RTX 3090. - 111 tok/s decode @ short ctx - 128K TTFT in 15.91s, 5.4x faster prefill vs llama.cpp - NIAH passes every (ctx, keep) point up to 131K - first MoE target supported by PFlash - hand-rolled CUDA, ggml only, no libllama great collab w/ @eisokant, @eric_alcaide, and the rest of the @poolsideai team. looking forward to working more on their great coding models. repo + GGUF in first comment.

🌊 SGLang now supports @poolsideai's Laguna-XS.2, a 33.4B-A3B hybrid SWA + MoE model purpose-built for agentic coding and long-horizon SWE work ☑️ SWE-bench Verified 68.2%; Multilingual 62.4%; Pro 44.5%; Terminal-Bench 2.0 30.1% ☑️ 131K-token context for long agent traces ☑️ Native poolside_v1 reasoning + tool-call parsers (OpenAI-compatible) ☑️ BF16, FP8, and NVFP4 quantizations 👉 Cookbook: docs.sglang.io/cookbook/autor…

Did they just reinvent Patched Training for LLMs ? 🤔 They even have a very similar plot 👀


Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.









Qwen3.6-35B-A3B (TQ3_4S ~4bpw) on RTX 3060 (12GB) via llama.cpp-tq3 (TurboQuant): • ~619 t/s prompt (4K ctx) • ~60 t/s generation (128K ctx) • fits in ~12.4GB VRAM 128K context with usable decode speed on a single 3060 is kind of wild