Inferact

81 posts

Inferact banner
Inferact

Inferact

@inferact

Silicon Katılım Aralık 2025
3 Takip Edilen4.1K Takipçiler
Inferact
Inferact@inferact·
That's a wrap on #MLSys2026 in Bellevue! 🚢 It was great meeting so many of you this past week — researchers, contributors, and friends of @vllm_project. The energy around inference systems right now is something else, and the conversations reminded us why this community matters. A few highlights from our team: 🎤 @rogerw0108 (co-founder, vLLM core maintainer) gave an invited talk, "Rethinking Open Source Contribution in the Age of AI Agents" — a maintainer's-eye view of how AI-generated PRs are reshaping the economics of open source, with concrete examples from vLLM. 🎤 @yifandotqiao gave a Lightning Talk, "Rethink LLM Inference Abstractions: New Trends and Challenges in LLM Serving" — on the combinatorial explosion across models, hardware, and workloads, and why serving at scale is increasingly a distributed systems problem. And of course — congrats to everyone who played 20 Questions with vLLM at our booth 🎯 Thanks to the MLSys organizers for putting on such a great week. If we missed you in Bellevue, our DMs are open — always happy to talk inference, vLLM, and what we're building. On to the next one. 🛠️
Inferact tweet mediaInferact tweet mediaInferact tweet mediaInferact tweet media
English
1
3
45
2.6K
Inferact
Inferact@inferact·
Great cohosting this luncheon with @a16z and Mirendil at MLSys 2026 yesterday! 🙌 We brought together top researchers and AI systems engineers for an afternoon of rich conversations on @vllm_project, the frontier of inference, and where AI systems are headed next. Huge thanks to everyone who joined — the energy in the room was something else. This is exactly the kind of cross-pollination between labs, infra teams, and industry that pushes the whole stack forward. More to come. 👀 #MLSys2026 #vLLM
English
2
7
25
6.3K
Inferact
Inferact@inferact·
Shoutout to our co-founder @KaichaoYou for making this fix and writing up the full story. From a 2024 hackathon bug → in-tree workarounds in vLLM → PyTorch Foundation TAC → fix landed in PyTorch 2.11.0. This kind of unglamorous, multi-org debugging makes the whole stack better. 👇
PyTorch@PyTorch

vLLM and PyTorch worked together to fix a long-standing aarch64 install headache — as of PyTorch 2.11.0, pip install torch on GB200 / GB300 / GH200 just works. 🎉 What changed: PyTorch 2.11.0 now publishes CUDA-enabled aarch64 wheels to the default PyPI index. No more custom --index-url flags. No more transitive dependencies silently swapping your GPU build for the CPU wheel. New users on Grace Hopper and Grace Blackwell systems can follow the standard install instructions and have vLLM work the first time. In our latest blog, @KaichaoYou (co-founder @inferact, Lead Maintainer @vllm_project) shares the full story: 🐛 A 2024 hackathon bug bringing up vLLM on GH200 🔧 vLLM's in-tree workarounds (use_existing_torch.py and [tool.uv] build-isolation passthrough) 🤝 From GitHub issue to PyTorch Foundation TAC discussion 🚀 The fix landing in PyTorch 2.11.0, driven by NVIDIA and PyTorch core. A great example of cross-project collaboration under the PyTorch Foundation umbrella — and a reminder that boring infrastructure wins compound. Read the full story: pytorch.org/blog/vllm-and-… ✍️ : Piotr Bialecki (@nvidia) — @ptrblck_de, Alban Desmaison (@Meta), Andrey Talman (@Meta), Nikita Shulga (@Meta)

English
1
4
42
3.4K
Inferact
Inferact@inferact·
We’re at MLSys 2026 in Bellevue this week! ⛴️ Come find the Inferact team at Booth #2 in the Evergreen Ballroom. Talks: • @rogerw0108 (co-founder at Inferact) — “Rethinking Open Source Contribution in the Age of AI Agents”, Mon 5/18, 11:36 AM • @yifandotqiao (vLLM core contributor) — YPS Sponsor Lightning Talk — Mon 5/18, 11:36 AM At the booth: • 20 Questions with vLLM — a game with vLLM running on DGX Spark, with prizes 🎯 • vLLM + Inferact swag 🧢 • Inferact team members! happy to talk inference and vLLM If you’re attending, come say hi, chat about inference, or learn what we’re building!
Inferact tweet media
English
1
2
27
2.1K
Inferact
Inferact@inferact·
We're onto Inferact's second office this year! Yesterday, we finally broke it in with an office warming. It's amazing to see how far we've come. The vLLM ecosystem has been growing at lightning pace, and we've been lucky to scale alongside it: helping teams serve inference faster, cheaper, and at scale. Thank you to everyone who made it out yesterday — customers, partners, friends, and the whole Inferact team. It meant a lot to celebrate this milestone together. We're hiring across all teams. If you want to join one of the fastest-growing AI infra companies and power the next generation of AI, check out our careers page or DM us. Excited for many more office warmings to come!
Inferact tweet mediaInferact tweet mediaInferact tweet mediaInferact tweet media
English
11
10
115
16.5K
Inferact
Inferact@inferact·
Proud of what the team has shipped here! And prouder that all this work is in vLLM main or heading upstream 🚀
vLLM@vllm_project

vLLM tops the Artificial Analysis leaderboard 🎉 vLLM tops @ArtificialAnlys on DeepSeek V3.2 and ranks among the top deployments of MiniMax-M2.5 and Qwen 3.5 397B. The leading deployments of these models are now open source. How each result was built: 🔹 DeepSeek V3.2 — Aggressive op fusion across the attention path collapsed ~33 per-layer kernels down toward ~10. 🔹 MiniMax-M2.5 — Custom EAGLE3 draft trained against the target's own token distribution via TorchSpec, plus a custom QK-norm fusion for MiniMax's TP-aware attention. 🔹 Qwen 3.5 397B — Targeted fusions plus a QK-norm fix for Qwen's linear-attention path. Every optimization is in vLLM main or on its way upstream. Huge thank you to @inferact, @digitalocean, @nvidia, @RedHat_AI, and the vLLM community 🙏 Full breakdown 👇 vllm.ai/blog/vllm-tops…

English
0
0
14
1.9K
Inferact retweetledi
vLLM
vLLM@vllm_project·
vLLM tops the Artificial Analysis leaderboard 🎉 vLLM tops @ArtificialAnlys on DeepSeek V3.2 and ranks among the top deployments of MiniMax-M2.5 and Qwen 3.5 397B. The leading deployments of these models are now open source. How each result was built: 🔹 DeepSeek V3.2 — Aggressive op fusion across the attention path collapsed ~33 per-layer kernels down toward ~10. 🔹 MiniMax-M2.5 — Custom EAGLE3 draft trained against the target's own token distribution via TorchSpec, plus a custom QK-norm fusion for MiniMax's TP-aware attention. 🔹 Qwen 3.5 397B — Targeted fusions plus a QK-norm fix for Qwen's linear-attention path. Every optimization is in vLLM main or on its way upstream. Huge thank you to @inferact, @digitalocean, @nvidia, @RedHat_AI, and the vLLM community 🙏 Full breakdown 👇 vllm.ai/blog/vllm-tops…
English
2
29
149
22.3K
Inferact retweetledi
DigitalOcean
DigitalOcean@digitalocean·
Among the fastest DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B inference in the market, per Artificial Analysis benchmarks (April 2026). ⚡️🤖 Sub-1-second TTFT. 230 tokens per second. Co-designed every layer of the stack with @Inferact, performance optimized @vllm_project, all on @NVIDIA HGX B300. Live on DigitalOcean Serverless Inference now. Full breakdown in the comments. ⬇️
English
1
8
33
38.9K
Inferact retweetledi
Roger Wang
Roger Wang@rogerw0108·
🚀🚀🚀 github.com/vllm-project/v… just got merged to main! Huge shoutout to the entire team @inferact that worked on day-0 support of DeepSeek V4 and our partner @NVIDIAAI for the collaboration on day-0 large scale serving enablement! More optimizations coming soon - stay tuned!
English
0
7
61
4.2K
Inferact retweetledi
Roger Wang
Roger Wang@rogerw0108·
Big thanks to @NVIDIAAI and @SemiAnalysis_ for shipping this together with @inferact @vllm_project! Day-0 large-scale serving for a new architecture is a huge milestone. DeepSeekV4 is a beast, but more optimizations are coming to @vllm_project to make it faster and cheaper!
SemiAnalysis@SemiAnalysis_

DAVIS, APRIL 25, 2026 — InferenceX has added DeepSeekv4 for @vllm_project 's day 0 support for GB200 disagg! Great work to @flowpow123 @rogerw0108 @NVIDIAAIDev @inferact for the fast support and engineering!

English
3
5
33
3.9K
Inferact retweetledi
SemiAnalysis
SemiAnalysis@SemiAnalysis_·
DeepSeekv4 Pro 1.6T is supported on InferenceX on Day 0! We have already gotten H200 vLLM working and working on @vllm_project & @sgl_project MI355, B200, B300, GB200/300 disaggregated DeepSeekv4 day 0 performance benchmarking too to track the progress of improvement. Thank you to @NVIDIAAIDev & @simon_mo_ & @rogerw0108 from @inferact for helping pull an all nighter to debug dsv4 day 0.
SemiAnalysis tweet media
English
5
10
164
18.7K
Inferact retweetledi
vLLM
vLLM@vllm_project·
🎉 Day-0 support for @deepseek_ai V4 Pro and Flash on vLLM — a new generation of DeepSeek model, purpose-built for tasks up to 1M tokens. Alongside the release, we're publishing a first-principles walkthrough of the new long-context attention and how we implemented it in vLLM. The new attention mechanism, in four moves: • Shared K/V + inverse RoPE → 2× memory savings • c4a / c128a KV compression → 4×–128× savings • DeepSeek Sparse Attention over compressed tokens • Short sliding window for locality across compression boundaries At 1M context, per-layer KV state is ~8.7× smaller than a DeepSeek V3.2-style 61-layer stack (9.62 GiB vs 83.9 GiB, bf16). fp8 attention cache + fp4 indexer cache shrink it further. vLLM side: • Unified hybrid KV cache — single logical block size (256 native positions) across all compression rates; compressor state folded into the SWA KV cache spec so prefix caching, disagg prefill, CUDA graphs and MTP reuse the same abstraction • Three page-size buckets for the full 5-way cache stack → no cross-kind fragmentation • Fused kernels: compressor + RMSNorm + RoPE + cache insert (1.4–3×), inverse RoPE + fp8 quant (2–3×), Q-norm + KV RoPE + K insert (10–20×) • Multi-stream overlap of indexer vs main-KV compression vs SWA insertion Disaggregated serving is supported out of the box and strongly recommended for best performance. Follow our recipes site for verified commands for @nvidia Blackwell (B200, B300, GB200, GB300) and Hopper (H100/H200/H20) systems. Thanks to the @deepseek_ai team for open-sourcing DeepSeek V4, and to @inferact for landing day-0 support 🤝 📝 Blog: vllm.ai/blog/deepseek-… 📖 Recipes: recipes.vllm.ai/deepseek-ai/De… 🤗 huggingface.co/deepseek-ai/De…
vLLM tweet media
DeepSeek@deepseek_ai

🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/De… 🤗 Open Weights: huggingface.co/collections/de… 1/n

English
17
91
573
123.9K
Inferact retweetledi
Woosuk Kwon
Woosuk Kwon@woosuk_k·
Going from Ampere to Hopper and to Blackwell, we always find new ways to leverage the architectural innovations to accelerate inference performance. Excited to collaborate with @nvidia to advance @inferact’s mission to grow vLLM!
Inferact@inferact

We are thrilled to announce that @nvidia is the latest investor in @inferact. We look forward to continuing the momentum driven by our deep collaboration: (1) Engineering velocity: a significant uptick in @nvidia pull requests to the @vllm_project repo. (2) Product synergy: close integration with NVIDIA Dynamo, ModelOpt, Nemotron, and more products! It’s an exciting time for the growth and development of vLLM, the world's AI inference engine!

English
1
3
94
12.9K