
Boffins detail new algorithms to losslessly boost AI perf by up to 2.8x dlvr.it/TLyjVv
Nadav Timor
300 posts

@NadavTimor
AI inference, speculative decoding, open source. Built novel decoding algorithms – default in Hugging Face Transformers (155+ ⭐). Making AI faster + cheaper

Boffins detail new algorithms to losslessly boost AI perf by up to 2.8x dlvr.it/TLyjVv








Speculative decoding has shown a lot of promise, though broader adoption has taken time due to the complexity of building production-ready tooling and high-quality draft models. We’re releasing SpecBundle, a collection of large-scale EAGLE-3 draft models trained with SpecForge v0.2. This release brings major system improvements, including refactored training pipelines, multi-backend support with SGLang and @huggingface , and better usability at scale. We also built a performance dashboard to make real end-to-end speedups visible across models and settings. See the dashboard and blog in the thread 👇


Speculative decoding is a powerful way to improve inference performance, but in practice it has been hard to adopt. Training a unique draft model per LLM is time-consuming, and production-ready training utilities that work cleanly with vLLM have been limited. Speculators v0.3.0 closes this gap with end-to-end training support for Eagle3 draft models that run seamlessly with vLLM. The release adds offline data generation using vLLM and training support for single and multi-layer draft models, across both MoE and non-MoE verifiers. Here's a 🧵 on speculative decoding and how to get started today in @vllm_project (1/8):

inference is perhaps the most valuable emerging software category. as models get smarter and more economically valuable, compute will increasingly be spent drawing samples from the models. if you'd like to work on inference at openai, reach out — gdb@openai.com. include a description of an exceptional team you've been a part of, and your contribution towards that team's goals. also indicate any experience in inference, large-scale system optimization, or other areas where you've built up domain expertise. lots of exciting problems to work on, ranging from deeply understanding the model forward pass (including simulating/finding creative opportunities for optimization); to system-level efficiencies such as speculative decoding or kv offloading or workload-aware load balancing; to managing and making observable a massive fleet at scale.


Today’s LLMs are painfully slow and expensive. They are autoregressive and spit out words sequentially. One. At. A. Time. Our dLLMs generate text in parallel, delivering answers up to 10X faster. Now we’ve raised $50M to scale them. Full story from @russellbrandom in @TechCrunch. techcrunch.com/2025/11/06/inc…











NYC open-source AI infra contributors — we’ve launched a community research hub above Grand Central where GPUs go brrr 🔥🗽 A place to hack, benchmark, and collaborate — vLLM, SGLang, kernels, inference optimizations all welcome. Open space. Open source. Weekends too. Huge thanks to @Company for supporting this initiative 🙌 𝐋𝐢𝐦𝐢𝐭𝐞𝐝 𝐬𝐞𝐚𝐭𝐬. 𝐃𝐫𝐨𝐩 𝐲𝐨𝐮𝐫 𝐏𝐑𝐬 𝐢𝐧 𝐭𝐡𝐞 𝐜𝐨𝐦𝐦𝐞𝐧𝐭𝐬 𝐭𝐨 𝐣𝐨𝐢𝐧 𝐭𝐡𝐞 𝐧𝐞𝐱𝐭 𝐬𝐩𝐫𝐢𝐧𝐭!