Simon Mo

A vLLM MoE deployment's DP/EP topology used to be locked in at launch — scaling or swapping config meant a full restart, in-flight traffic dropped. Elastic Expert Parallelism changes that. One API call resizes a live deployment: curl -X POST localhost:8000/scale_elastic_ep \ -d '{"new_data_parallel_size": 16}' Under the hood: standby comm groups span the target topology, EPLB redistributes experts across the new EP group, and weights are transferred directly between GPUs over NVIDIA NVLink/RDMA. The same runtime reconfiguration path is what fault-tolerant serving needs: evict failed ranks, redistribute their experts, bring replacements back, no restart. Thanks to @NVIDIAAI, Sky Computing, @anyscalecompute, @RedHat_AI, and the community. 📖 vllm.ai/blog/2026-05-1…

2

136

Simon Mo retweetledi

Vikram@msharmavikram·4d

Elastic parallelism for wideEP deployments is critical for operating large-scale inference systems efficiently. What started as an idea at a Dynamo after-party nearly a year ago is now finally available to everyone. Congratulations @nvidia dynamo nixl and @vllm_project team!

vLLM@vllm_project

English

9

71

7.3K

Simon Mo retweetledi

Kaichao You@KaichaoYou·16 May

vLLM has become the common language of LLM inference🥰

Baseten@baseten

We serve Qwen3-TTS on vLLM-Omni at $3 per 1M characters. That's 90% lower in cost than comparable closed-source TTS APIs. Our engineers optimized a single-replica serving stack to get there. Details on the optimized stack and cost per concurrent stream here.

English

5

15

321

41.9K

Simon Mo retweetledi

Roger Wang@rogerw0108·18 May

Giving a talk on behalf of @vllm_project about open source at #MLSys 2026 tomorrow and will be around in Bellevue May 18-21. mlsys.org/virtual/2026/i… The @inferact crew will be here too with a booth! Come say hi!🤗

English

6

8

63

5.5K

Simon Mo@simon_mo_·16 May

@aarnphm ofc!

1

89

aaron@aarnphm·15 May

@simon_mo_ Wait can i grab one of the hats :))

English

0

47

Simon Mo@simon_mo_·15 May

Very importantly! We have a beautiful patio on Market Street.

Inferact@inferact

We're onto Inferact's second office this year! Yesterday, we finally broke it in with an office warming. It's amazing to see how far we've come. The vLLM ecosystem has been growing at lightning pace, and we've been lucky to scale alongside it: helping teams serve inference faster, cheaper, and at scale. Thank you to everyone who made it out yesterday — customers, partners, friends, and the whole Inferact team. It meant a lot to celebrate this milestone together. We're hiring across all teams. If you want to join one of the fastest-growing AI infra companies and power the next generation of AI, check out our careers page or DM us. Excited for many more office warmings to come!

English

6

2

57

6.2K

Simon Mo retweetledi

vLLM@vllm_project·15 May

Great work at @baseten running vLLM-Omni in production — open-source, production-grade, cost-efficient omni-modal serving 🎙️ Multi-stage audio, streaming multi-modal, real-time TTS — workloads where closed-source APIs have been the default. → github.com/vllm-project/v…

Baseten@baseten

We serve Qwen3-TTS on vLLM-Omni at $3 per 1M characters. That's 90% lower in cost than comparable closed-source TTS APIs. Our engineers optimized a single-replica serving stack to get there. Details on the optimized stack and cost per concurrent stream here.

English

5

16

99

13.3K

Simon Mo retweetledi

Inferact@inferact·15 May

We're onto Inferact's second office this year! Yesterday, we finally broke it in with an office warming. It's amazing to see how far we've come. The vLLM ecosystem has been growing at lightning pace, and we've been lucky to scale alongside it: helping teams serve inference faster, cheaper, and at scale. Thank you to everyone who made it out yesterday — customers, partners, friends, and the whole Inferact team. It meant a lot to celebrate this milestone together. We're hiring across all teams. If you want to join one of the fastest-growing AI infra companies and power the next generation of AI, check out our careers page or DM us. Excited for many more office warmings to come!

English

11

10

115

16.5K

Simon Mo@simon_mo_·15 May

Super cool use of @vllm_project at @baseten. Open source is the way!

Baseten@baseten

We serve Qwen3-TTS on vLLM-Omni at $3 per 1M characters. That's 90% lower in cost than comparable closed-source TTS APIs. Our engineers optimized a single-replica serving stack to get there. Details on the optimized stack and cost per concurrent stream here.

English

4

35

10K

Simon Mo retweetledi

SemiAnalysis@SemiAnalysis_·12 May

THE MORE U BUY, THE MORE U SAVE: By ganging up multiple B200 8-GPU machines together over RoCEv2 CX-7 ethernet with Tomahawk switches with an inference optimization called PD disaggregation, the per GPU token throughput increases up to 7x. By increasing per GPU token throughput by up to 7x, this decreases cost per million tokens by up to 7x also. Great work to @inferact & @vllm_project for building this amazing OSS engine & for @NVIDIADC @KranenKyle for building dynamo inference orchestrator. More improvements to disagg b200 perf to come!

English

5

12

138

26.4K

Simon Mo retweetledi

vLLM@vllm_project·12 May

vLLM tops the Artificial Analysis leaderboard 🎉 vLLM tops @ArtificialAnlys on DeepSeek V3.2 and ranks among the top deployments of MiniMax-M2.5 and Qwen 3.5 397B. The leading deployments of these models are now open source. How each result was built: 🔹 DeepSeek V3.2 — Aggressive op fusion across the attention path collapsed ~33 per-layer kernels down toward ~10. 🔹 MiniMax-M2.5 — Custom EAGLE3 draft trained against the target's own token distribution via TorchSpec, plus a custom QK-norm fusion for MiniMax's TP-aware attention. 🔹 Qwen 3.5 397B — Targeted fusions plus a QK-norm fix for Qwen's linear-attention path. Every optimization is in vLLM main or on its way upstream. Huge thank you to @inferact, @digitalocean, @nvidia, @RedHat_AI, and the vLLM community 🙏 Full breakdown 👇 vllm.ai/blog/vllm-tops…

English

29

149

22.3K

Simon Mo@simon_mo_·9 May

@AlexKrentsel Thank you Alex!!!

English

149

Alex Krentsel@AlexKrentsel·8 May

Congrats to Simon, this is incredible work

Joey Gonzalez@profjoeyg

Today I’m excited to congratulate @simon_mo_ on an outstanding PhD thesis defense on his work exploring the design of Inference Serving Systems. 🎉 Simon has been working on inference systems with me for nearly a decade -- long before most people even considered inference serving a research problem worth studying. Over that time, he helped drive inference systems projects spanning Clipper, @raydistributed Serve, and now @vllm_project. Together, these systems helped define the modern inference serving stack that powers today’s AI applications. Beyond being an exceptional researcher, Simon has also been a remarkable team and community builder, especially through his leadership on vLLM and the open-source ecosystem around it. Along with my colleagues @istoica05 and @koushik77, I am excited to see Simon leading @inferact as CEO and helping shape the future of inference systems and AI infrastructure. Congratulations, Simon!

English

SemiAnalysis@SemiAnalysis_

0

6

592

Simon Mo retweetledi

Roger Wang@rogerw0108·8 May

Not just PR merge but stable release and reliability patch release too!😎

POV of @vllm_project maintainers optimizing DeepSeekv4 performance on day 0 and merging their initial model support PR over the weekend. SPEED IS THE MOAT

English

1

21

1.7K

Simon Mo retweetledi

SemiAnalysis@SemiAnalysis_·8 May

POV of @vllm_project maintainers optimizing DeepSeekv4 performance on day 0 and merging their initial model support PR over the weekend. SPEED IS THE MOAT

English

13

207

63.4K

Simon Mo@simon_mo_·7 May

@JasonSCui @inferact @vllm_project Thank you, Jason!

English

1

107

Jason Cui@JasonSCui·7 May

Congrats @simon_mo_ !! What a legendary run that's only just beginning. Thanks for helping make inference easier and more accessible alongside the entire @inferact team and @vllm_project community.

Joey Gonzalez@profjoeyg

Today I’m excited to congratulate @simon_mo_ on an outstanding PhD thesis defense on his work exploring the design of Inference Serving Systems. 🎉 Simon has been working on inference systems with me for nearly a decade -- long before most people even considered inference serving a research problem worth studying. Over that time, he helped drive inference systems projects spanning Clipper, @raydistributed Serve, and now @vllm_project. Together, these systems helped define the modern inference serving stack that powers today’s AI applications. Beyond being an exceptional researcher, Simon has also been a remarkable team and community builder, especially through his leadership on vLLM and the open-source ecosystem around it. Along with my colleagues @istoica05 and @koushik77, I am excited to see Simon leading @inferact as CEO and helping shape the future of inference systems and AI infrastructure. Congratulations, Simon!

English

0

7

824

Simon Mo@simon_mo_·7 May

@JamesAlcorn94 @profjoeyg @istoica05 Thank you, James!

English

42

James Alcorn@JamesAlcorn94·7 May

@profjoeyg @simon_mo_ Congrats @simon_mo_ and sorry I wasn't in the room barracking for you mate! Another skylab grad on their way to make a fat dent in the universe - kudos, yet again @istoica05 @profjoeyg. I am so grateful for what you do for your students.

English

0

3

571

Joey Gonzalez@profjoeyg·7 May

Today I’m excited to congratulate @simon_mo_ on an outstanding PhD thesis defense on his work exploring the design of Inference Serving Systems. 🎉 Simon has been working on inference systems with me for nearly a decade -- long before most people even considered inference serving a research problem worth studying. Over that time, he helped drive inference systems projects spanning Clipper, @raydistributed Serve, and now @vllm_project. Together, these systems helped define the modern inference serving stack that powers today’s AI applications. Beyond being an exceptional researcher, Simon has also been a remarkable team and community builder, especially through his leadership on vLLM and the open-source ecosystem around it. Along with my colleagues @istoica05 and @koushik77, I am excited to see Simon leading @inferact as CEO and helping shape the future of inference systems and AI infrastructure. Congratulations, Simon!

English

15

13

272

26.6K

Simon Mo@simon_mo_·7 May

@yifandotqiao and team evaluated Mooncake's distributed cpu kv store and were quite happy about the throughput increase AND e2e latency savings! Let's go 🥮!

vLLM@vllm_project

🚀 New on the @vllm_project blog: Serving Agentic Workloads at Scale with vLLM x Mooncake. Agentic traces grow to 80K+ tokens with 94%+ reusable prefixes, but local KV caches evict them and cross-instance routing misses them. By integrating Mooncake Store as a distributed KV cache pool, vLLM gets: 🚀 3.8x higher throughput ⚡ 46x lower P50 TTFT ⏱️ 8.6x lower E2E latency 📈 Cache hit rate 1.7% -> 92.2% 🌐 Scales near-linearly to 60 GB200 GPUs at >95% hit rate 🔥 Powered by a deep collaboration between @Inferact and @KT_Project_AI 📖 Read more: vllm.ai/blog/mooncake-… 🧵👇

English

9

583

Simon Mo@simon_mo_·7 May

Thank you @profjoeyg! Grateful for the last decade of inference research — excited for the next chapter at @inferact and @vllm_project ⚡

Joey Gonzalez@profjoeyg

Today I’m excited to congratulate @simon_mo_ on an outstanding PhD thesis defense on his work exploring the design of Inference Serving Systems. 🎉 Simon has been working on inference systems with me for nearly a decade -- long before most people even considered inference serving a research problem worth studying. Over that time, he helped drive inference systems projects spanning Clipper, @raydistributed Serve, and now @vllm_project. Together, these systems helped define the modern inference serving stack that powers today’s AI applications. Beyond being an exceptional researcher, Simon has also been a remarkable team and community builder, especially through his leadership on vLLM and the open-source ecosystem around it. Along with my colleagues @istoica05 and @koushik77, I am excited to see Simon leading @inferact as CEO and helping shape the future of inference systems and AI infrastructure. Congratulations, Simon!

English

10

1

58

4.1K

Simon Mo@simon_mo_·7 May

@mitsuhiko can you help us 🥺 what are the terrible decisions from your pov?

English

Mario Zechner@badlogicgames

0

5

1K

Armin Ronacher ⇌@mitsuhiko·6 May

Terrible decisions upstream lead to terrible consequences downstream. I hate everything.

today was the day @mitsuhiko had his first look into the source code of a bunch of inference engines (not going to name names) and like me a few months ago, he now has stared into the abyss. we will never be the same.

English

13

7

241

39.3K

Simon Mo@simon_mo_·7 May

@badlogicgames @mitsuhiko we tryin to make it cleaner and fast 🙇

English