Simon Mo

244 posts

Simon Mo

Simon Mo

@simon_mo_

building @inferact for @vllm_project

Katılım Temmuz 2018
355 Takip Edilen3K Takipçiler
Sabitlenmiş Tweet
Simon Mo
Simon Mo@simon_mo_·
vLLM has grown to 2000+ contributors scale with a diverse community of model, hardwares, and applications. I see @vllm_project on the path of becoming the world's inference engine and @inferact to accelerate AI progress. We cannot be more excited about the road ahead.
Woosuk Kwon@woosuk_k

Today, we're proud to announce @inferact, a startup founded by creators and core maintainers of @vllm_project, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. The Challenge Inference is not solved. It's getting harder. Models grow larger. New architectures proliferate: mixture-of-experts, multimodal, agentic. Every breakthrough demands new infrastructure. Meanwhile, hardware fragments: more accelerators, more programming models, and more combinations to optimize. The capability gap between models and the systems that serve them is widening. Left this way, the most capable models remain bottlenecked and with full scope of their capabilities accessible only to those who can build custom infrastructure. Close the gap, and we unlock new possibilities. And the problem is growing. Inference is shifting from a fraction of compute to the majority: test-time compute, RL training loops, synthetic data. We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building. Why Us vLLM sits at the intersection of models and hardware: a position that took years to build. When model vendors ship new architectures, they work with us to ensure day-zero support. When hardware vendors develop new silicon, they integrate with vLLM. When teams deploy at scale, they run vLLM, from frontier labs to hyperscalers to startups serving millions of users. Today, vLLM supports 500+ model architectures, runs on 200+ accelerator types, and powers inference at global scale. This ecosystem, built with 2,000+ contributors, is our foundation. We've been stewards of this engine since its first commit. We know it inside out. We deployed it at frontier scale—in research and in production. Open Source vLLM was built in the open. That's not changing. Inferact exists to supercharge vLLM adoption. The optimizations we develop flow back to the community. We plan to push vLLM's performance further, deepen support for emerging model architectures, and expand coverage across frontier hardware. The AI industry needs inference infrastructure that isn't locked behind proprietary walls. Join Us Through the open source community, we are fortunate to work with some of the best people we know. For @inferact, we're hiring engineers and researchers to work at the frontier of inference, where models meet hardware at scale. Come build with us. We're fortunate to be supported by investors who share our vision, including @a16z and @lightspeedvp who led our $150M seed, as well as @sequoia, @AltimeterCap, @Redpoint, @ZhenFund, The House Fund, @strikervp, @LaudeVentures, and @databricks. - @woosuk_k, @simon_mo_, @KaichaoYou, @rogerw0108, @istoica05 and the rest of the founding team

English
12
10
99
15.1K
Simon Mo
Simon Mo@simon_mo_·
a small version bump of EAGLE -> 2x interactivity boost!
Simon Mo tweet media
vLLM@vllm_project

🎉 Thrilled to collaborate with the EAGLE team (@hongyangzh) and TorchSpec (@lightseekorg) on EAGLE 3.1 - a robustness upgrade to speculative decoding! 💡 The EAGLE team traced deeper-step acceptance-length collapse to "attention drift". EAGLE 3.1 fixes it with FC normalization + post-norm hidden-state feedback into the next step. ✨ What's new: - More performant vs EAGLE 3 specifically on long-context workloads - Config-driven extension of EAGLE 3 in vLLM with backward compatibility - New eagle3.1 draft model checkpoint trained by TorchSpec with vLLM backend 🔗 Blog: vllm.ai/blog/2026-05-2…

English
0
0
2
136
Simon Mo retweetledi
aaron
aaron@aarnphm·
@simon_mo_ Wait can i grab one of the hats :))
English
1
0
0
47
Simon Mo retweetledi
vLLM
vLLM@vllm_project·
Great work at @baseten running vLLM-Omni in production — open-source, production-grade, cost-efficient omni-modal serving 🎙️ Multi-stage audio, streaming multi-modal, real-time TTS — workloads where closed-source APIs have been the default. → github.com/vllm-project/v…
Baseten@baseten

We serve Qwen3-TTS on vLLM-Omni at $3 per 1M characters. That's 90% lower in cost than comparable closed-source TTS APIs. Our engineers optimized a single-replica serving stack to get there. Details on the optimized stack and cost per concurrent stream here.

English
5
16
99
13.3K
Simon Mo retweetledi
Inferact
Inferact@inferact·
We're onto Inferact's second office this year! Yesterday, we finally broke it in with an office warming. It's amazing to see how far we've come. The vLLM ecosystem has been growing at lightning pace, and we've been lucky to scale alongside it: helping teams serve inference faster, cheaper, and at scale. Thank you to everyone who made it out yesterday — customers, partners, friends, and the whole Inferact team. It meant a lot to celebrate this milestone together. We're hiring across all teams. If you want to join one of the fastest-growing AI infra companies and power the next generation of AI, check out our careers page or DM us. Excited for many more office warmings to come!
Inferact tweet mediaInferact tweet mediaInferact tweet mediaInferact tweet media
English
11
10
115
16.5K
Simon Mo retweetledi
SemiAnalysis
SemiAnalysis@SemiAnalysis_·
THE MORE U BUY, THE MORE U SAVE: By ganging up multiple B200 8-GPU machines together over RoCEv2 CX-7 ethernet with Tomahawk switches with an inference optimization called PD disaggregation, the per GPU token throughput increases up to 7x. By increasing per GPU token throughput by up to 7x, this decreases cost per million tokens by up to 7x also. Great work to @inferact & @vllm_project for building this amazing OSS engine & for @NVIDIADC @KranenKyle for building dynamo inference orchestrator. More improvements to disagg b200 perf to come!
SemiAnalysis tweet media
English
5
12
138
26.4K
Simon Mo retweetledi
vLLM
vLLM@vllm_project·
vLLM tops the Artificial Analysis leaderboard 🎉 vLLM tops @ArtificialAnlys on DeepSeek V3.2 and ranks among the top deployments of MiniMax-M2.5 and Qwen 3.5 397B. The leading deployments of these models are now open source. How each result was built: 🔹 DeepSeek V3.2 — Aggressive op fusion across the attention path collapsed ~33 per-layer kernels down toward ~10. 🔹 MiniMax-M2.5 — Custom EAGLE3 draft trained against the target's own token distribution via TorchSpec, plus a custom QK-norm fusion for MiniMax's TP-aware attention. 🔹 Qwen 3.5 397B — Targeted fusions plus a QK-norm fix for Qwen's linear-attention path. Every optimization is in vLLM main or on its way upstream. Huge thank you to @inferact, @digitalocean, @nvidia, @RedHat_AI, and the vLLM community 🙏 Full breakdown 👇 vllm.ai/blog/vllm-tops…
English
2
29
149
22.3K
Alex Krentsel
Alex Krentsel@AlexKrentsel·
Congrats to Simon, this is incredible work
Joey Gonzalez@profjoeyg

Today I’m excited to congratulate @simon_mo_ on an outstanding PhD thesis defense on his work exploring the design of Inference Serving Systems. 🎉 Simon has been working on inference systems with me for nearly a decade -- long before most people even considered inference serving a research problem worth studying. Over that time, he helped drive inference systems projects spanning Clipper, @raydistributed Serve, and now @vllm_project. Together, these systems helped define the modern inference serving stack that powers today’s AI applications. Beyond being an exceptional researcher, Simon has also been a remarkable team and community builder, especially through his leadership on vLLM and the open-source ecosystem around it. Along with my colleagues @istoica05 and @koushik77, I am excited to see Simon leading @inferact as CEO and helping shape the future of inference systems and AI infrastructure. Congratulations, Simon!

English
1
0
6
592
Simon Mo retweetledi
Roger Wang
Roger Wang@rogerw0108·
Not just PR merge but stable release and reliability patch release too!😎
SemiAnalysis@SemiAnalysis_

POV of @vllm_project maintainers optimizing DeepSeekv4 performance on day 0 and merging their initial model support PR over the weekend. SPEED IS THE MOAT

English
0
1
21
1.7K
Simon Mo retweetledi
SemiAnalysis
SemiAnalysis@SemiAnalysis_·
POV of @vllm_project maintainers optimizing DeepSeekv4 performance on day 0 and merging their initial model support PR over the weekend. SPEED IS THE MOAT
English
2
13
207
63.4K
Jason Cui
Jason Cui@JasonSCui·
Congrats @simon_mo_ !! What a legendary run that's only just beginning. Thanks for helping make inference easier and more accessible alongside the entire @inferact team and @vllm_project community.
Joey Gonzalez@profjoeyg

Today I’m excited to congratulate @simon_mo_ on an outstanding PhD thesis defense on his work exploring the design of Inference Serving Systems. 🎉 Simon has been working on inference systems with me for nearly a decade -- long before most people even considered inference serving a research problem worth studying. Over that time, he helped drive inference systems projects spanning Clipper, @raydistributed Serve, and now @vllm_project. Together, these systems helped define the modern inference serving stack that powers today’s AI applications. Beyond being an exceptional researcher, Simon has also been a remarkable team and community builder, especially through his leadership on vLLM and the open-source ecosystem around it. Along with my colleagues @istoica05 and @koushik77, I am excited to see Simon leading @inferact as CEO and helping shape the future of inference systems and AI infrastructure. Congratulations, Simon!

English
2
0
7
824
Joey Gonzalez
Joey Gonzalez@profjoeyg·
Today I’m excited to congratulate @simon_mo_ on an outstanding PhD thesis defense on his work exploring the design of Inference Serving Systems. 🎉 Simon has been working on inference systems with me for nearly a decade -- long before most people even considered inference serving a research problem worth studying. Over that time, he helped drive inference systems projects spanning Clipper, @raydistributed Serve, and now @vllm_project. Together, these systems helped define the modern inference serving stack that powers today’s AI applications. Beyond being an exceptional researcher, Simon has also been a remarkable team and community builder, especially through his leadership on vLLM and the open-source ecosystem around it. Along with my colleagues @istoica05 and @koushik77, I am excited to see Simon leading @inferact as CEO and helping shape the future of inference systems and AI infrastructure. Congratulations, Simon!
Joey Gonzalez tweet media
English
15
13
272
26.6K
Simon Mo
Simon Mo@simon_mo_·
@yifandotqiao and team evaluated Mooncake's distributed cpu kv store and were quite happy about the throughput increase AND e2e latency savings! Let's go 🥮!
vLLM@vllm_project

🚀 New on the @vllm_project blog: Serving Agentic Workloads at Scale with vLLM x Mooncake. Agentic traces grow to 80K+ tokens with 94%+ reusable prefixes, but local KV caches evict them and cross-instance routing misses them. By integrating Mooncake Store as a distributed KV cache pool, vLLM gets: 🚀 3.8x higher throughput ⚡ 46x lower P50 TTFT ⏱️ 8.6x lower E2E latency 📈 Cache hit rate 1.7% -> 92.2% 🌐 Scales near-linearly to 60 GB200 GPUs at >95% hit rate 🔥 Powered by a deep collaboration between @Inferact and @KT_Project_AI 📖 Read more: vllm.ai/blog/mooncake-… 🧵👇

English
0
0
9
583
Simon Mo
Simon Mo@simon_mo_·
Thank you @profjoeyg! Grateful for the last decade of inference research — excited for the next chapter at @inferact and @vllm_project
Joey Gonzalez@profjoeyg

Today I’m excited to congratulate @simon_mo_ on an outstanding PhD thesis defense on his work exploring the design of Inference Serving Systems. 🎉 Simon has been working on inference systems with me for nearly a decade -- long before most people even considered inference serving a research problem worth studying. Over that time, he helped drive inference systems projects spanning Clipper, @raydistributed Serve, and now @vllm_project. Together, these systems helped define the modern inference serving stack that powers today’s AI applications. Beyond being an exceptional researcher, Simon has also been a remarkable team and community builder, especially through his leadership on vLLM and the open-source ecosystem around it. Along with my colleagues @istoica05 and @koushik77, I am excited to see Simon leading @inferact as CEO and helping shape the future of inference systems and AI infrastructure. Congratulations, Simon!

English
10
1
58
4.1K
Simon Mo
Simon Mo@simon_mo_·
@mitsuhiko can you help us 🥺 what are the terrible decisions from your pov?
English
1
0
5
1K
Armin Ronacher ⇌
Armin Ronacher ⇌@mitsuhiko·
Terrible decisions upstream lead to terrible consequences downstream. I hate everything.
Mario Zechner@badlogicgames

today was the day @mitsuhiko had his first look into the source code of a bunch of inference engines (not going to name names) and like me a few months ago, he now has stared into the abyss. we will never be the same.

English
13
7
241
39.3K
Mario Zechner
Mario Zechner@badlogicgames·
today was the day @mitsuhiko had his first look into the source code of a bunch of inference engines (not going to name names) and like me a few months ago, he now has stared into the abyss. we will never be the same.
English
25
7
371
70.6K