vLLM

875 posts

vLLM

@vllm_project

A high-throughput and memory-efficient inference and serving engine for LLMs. Join https://t.co/lxJ0SfX5pJ to discuss together with the community!

Katılım Mart 2024

32 Takip Edilen33.6K Takipçiler

vLLM@vllm_project·30m

📊 @RunPod's State of AI report — real production data from 500K developers: "vLLM has become the de facto standard for LLM serving, with half of text-only endpoints running vLLM variants." Thanks to everyone building with vLLM in production 🙏 Full report 👇

Runpod@runpod

The AI market looks nothing like the narrative. We have the production data to prove it. The State of AI: what 500K developers are actually running in 2026. Read the full report now 👇

English

627

vLLM@vllm_project·3h

Boston, we’re coming. 🎉 Join the vLLM meetup on March 31 for an evening of deep technical sessions, live demos, and real conversations on LLM inference at scale. We’ll cover vLLM updates, model compression, speculative decoding, agentic AI, and distributed inference with llm-d + Kubernetes. Thanks to @RedHat, @IBM, @NVIDIAAI, The Open Accelerator, and @MITIBMLab for the support. Register: luma.com/4rmkrrb7

Red Hat AI@RedHat_AI

vLLM meetup is coming to Boston on March 31! Workshop + evening sessions covering: - @vllm_project update - Model compression and speculative decoding - Agentic AI with vLLM - Distributed inference at scale with @_llm_d_ and Kubernetes Pre-event workshop at 3:30 PM: Deploy Llama 3.1 8B and benchmark llm-d's cache-aware routing live. Shoutout to our sponsors: @RedHat, @IBM, @NVIDIAAI, The Open Accelerator, and @MITIBMLab! Register here 👇 luma.com/4rmkrrb7

English

2.2K

vLLM@vllm_project·3h

Thanks to the TorchSpec team for their integration with vLLM! Check it out!

PyTorch@PyTorch

We’re excited to introduce TorchSpec, a torch-native framework for scalable speculative decoding training developed by the TorchSpec and Mooncake teams. By streaming hidden states from inference engines to training workers via Mooncake, TorchSpec enables fully disaggregated pipelines where inference and training scale independently. 🔗 Read our latest blog from TorchSpec & Mooncake teams: pytorch.org/blog/torchspec… @lightseekorg @KT_Project_AI #PyTorch #TorchSpec #Mooncake #OpenSourceAI

English

vLLM@vllm_project·11h

Closing out NVIDIA GTC? Join @inferact for a vLLM Happy Hour tonight. partiful.com/e/h7FAvy8cmDvW…

English

1.5K

vLLM@vllm_project·21h

🎉 Congrats to @Baidu_Inc on Qianfan-OCR! Runs in vLLM: `vllm serve baidu/Qianfan-OCR --trust-remote-code` A 4B end-to-end document intelligence model topping OmniDocBench v1.5 (93.12), OlmOCR Bench, and KIE benchmarks — tables, LaTeX formulas, handwriting, 192 languages, with Layout-as-Thought for complex layouts.

Baidu Inc.@Baidu_Inc

🚀 Introducing Qianfan-OCR: a 4B-parameter end-to-end model for document intelligence. One model. No pipeline. Table extraction, formula recognition, chart understanding, and key information extraction, all in a single pass. Paper: arxiv.org/abs/2603.13398 Models: huggingface.co/collections/ba… 🧵 Key results ↓

English

121

9.5K

vLLM@vllm_project·23h

Great to see @AMD select vLLM as one of the designated inference frameworks for the GPU MODE Hackathon. 🎉 The challenge: push Kimi K2.5 1T FP4 end-to-end inference performance on 8× AMD Instinct MI355X — using vLLM or AMD ATOM. Grand prize: $650,000. What makes this different: winning optimizations must be mergeable into AMD ATOM or vLLM upstream. Improvements that land in vLLM benefit the whole community. Phase 1 (kernel optimization) runs through April 6. More details ⬇️

AMD@AMD

Join the GPU MODE Hackathon, sponsored by AMD, and push the boundaries of LLM inference performance on leading open models, optimized for AMD Instinct MI355X GPUs. Finalists will compete for the $1.1M total cash prize pool across two independent tracks, each focused on a specific model and inference stack. Learn more and get registered here: luma.com/cqq4mojz

English

100

11.9K

vLLM retweetledi

Roger Wang@rogerw0108·1d

So excited to see more research and breakthrough in omni-modality! This is exactly why we are buidling vLLM-Omni github.com/vllm-project/v… for the next generation of Intelligence!🚀

Fuli Luo@_LuoFuli

MiMo-V2-Pro & Omni & TTS is out. Our first full-stack model family built truly for the Agent era. I call this a quiet ambush — not because we planned it, but because the shift from Chat to Agent paradigm happened so fast, even we barely believed it. Somewhere in between was a process that was thrilling, painful, and fascinating all at once. The 1T base model started training months ago. The original goal was long-context reasoning efficiency. Hybrid Attention carries real innovation, without overreaching — and it turns out to be exactly the right foundation for the Agent era. 1M context window. MTP inference for ultra-low latency and cost. These architectural decisions weren't trendy. They were a structural advantage we built before we needed it. What changed everything was experiencing a complex agentic scaffold — what I'd call orchestrated Context — for the first time. I was shocked on day one. I tried to convince the team to use it. That didn't work. So I gave a hard mandate: anyone on MiMo Team with fewer than 100 conversations tomorrow can quit. It worked. Once the team's imagination was ignited by what agentic systems could do, that imagination converted directly into research velocity. People ask why we move so fast. I saw it firsthand building DeepSeek R1. My honest summary: — Backbone and Infra research has long cycles. You need strategic conviction a year before it pays off. — Posttrain agility is a different muscle: product intuition driving evaluation, iteration cycles compressed, paradigm shifts caught early. — And the constant: curiosity, sharp technical instinct, decisive execution, full commitment — and something that's easy to underestimate: a genuine love for the world you're building for. We will open-source — when the models are stable enough to deserve it. From Beijing, very late, not quite awake.

English

2.7K

vLLM@vllm_project·3d

Thanks @OpenBMB! MiniCPM-o 4.5 — a 9B omnimodal model with real-time vision, speech, and text — now runs natively in vLLM v0.17.0. 🎉

OpenBMB@OpenBMB

✨vLLM v0.17.0 is out with support for #MiniCPM-o 4.5! 🚀 Now you can serve the latest 9B #omnimodal model with vLLM’s high-throughput serving engine. For developers, this means scaling real-time, full-duplex conversations (vision, speech, and text) is now production-ready. @vllm_project 👏Check the release: github.com/vllm-project/v… #LLM #vLLM #MiniCPM #OpenSource

English

5.9K

vLLM@vllm_project·3d

🎉 Congrats to @MistralAI on releasing Mistral Small 4 — a 119B MoE model (6.5B active per token) that unifies instruct, reasoning, and coding in one checkpoint. Multimodal, 256K context. Day-0 support in vLLM — MLA attention backend, tool calling, and configurable reasoning mode, verified on @nvidia GPUs. 🔗 huggingface.co/mistralai/Mist…

Mistral AI for Developers@MistralDevs

🔥 Meet Mistral Small 4: One model to do it all. ⚡ 128 experts, 119B total parameters, 256k context window ⚡ Configurable Reasoning ⚡ Apache 2.0 ⚡ 40% faster, 3x more throughput Our first model to unify the capabilities of our flagship models into a single, versatile model.

English

383

28.7K

vLLM@vllm_project·3d

Great to see @NVIDIA Dynamo 1.0 ship with native vLLM support! Disaggregated serving, agentic-aware routing, and topology-aware K8s scaling — exciting building blocks for production distributed inference. 🚀 Thanks to the @NVIDIAAIDev Dynamo team!

NVIDIA AI Developer@NVIDIAAIDev

Reasoning models are growing fast, and running them efficiently requires distributing workloads across multiple GPU nodes. NVIDIA Dynamo 1.0 delivers low-latency, high-throughput distributed inference for production AI deployments—while boosting NVIDIA Blackwell inference performance by up to 7x, lowering token cost, and expanding opportunities with free #OSS. Built for production. Key features: 🔹 Disaggregated serving 🔹 Agentic-aware routing 🔹 Multimodal inference 🔹 Topology-aware Kubernetes scaling 🔹 SLA-ready quick deployments Available now with native support for SGLang (@lmsysorg), TensorRT-LLM, and @vllm_project 👉 nvda.ws/47oyMMf

English

7.5K

vLLM@vllm_project·3d

P-EAGLE from @AmazonScience and @NVIDIAAIDev removes the sequential bottleneck in speculative decoding — all K draft tokens generated in a single forward pass. 📈 Up to 1.69x speedup over vanilla EAGLE-3 on NVIDIA B200, with 5-25% gains sustained at high concurrency (c=64). How it works: EAGLE drafts tokens autoregressively (K tokens = K forward passes). P-EAGLE replaces this with parallel generation using learned mask tokens and shared hidden states — one pass, K tokens. Pre-trained P-EAGLE heads available on @huggingface for GPT-OSS 120B, GPT-OSS 20B, and Qwen3-Coder 30B. Enable with one config change: `--speculative-config '{"method": "eagle3", "model": "amazon/GPT-OSS-20B-P-EAGLE", "num_speculative_tokens": 7, "parallel_drafting": true}'` Integrated in vLLM since v0.16.0. Blog: vllm.ai/blog/p-eagle 🤝 Thanks to the @amazon and @NVIDIAAI teams for the contribution!

English

125

7.5K

vLLM@vllm_project·3d

@vllm_project spotted at GTC 2026!🔥

Dansk

101

vLLM@vllm_project·3d

@realYushiBai @eltonjohn_007 @Zai_org Great work! Nice to see the community pushing sparse attention efficiency forward 💡

English

Yushi Bai@realYushiBai·5d

@eltonjohn_007 @Zai_org Just added vllm support in our official repo: github.com/THUDM/IndexCac…

English

202

Yushi Bai@realYushiBai·6d

🧵 1/4 Still waiting for DeepSeek-V4? We (@Zai_org) made DSA 1.8× faster with minimal code change — and it's ready to deliver real inference gains on GLM-5. IndexCache removes 50% of indexer computations in DeepSeek Sparse Attention with virtually zero quality loss. On GLM-5 (744B), we get ~1.2× E2E speedup while matching the original across both long-context and reasoning tasks. On our experimental-sized 30B model, removing 75% of indexers gives 1.82× prefill and 1.48× decode speedup at 200K context. How? 🧵👇 #DeepSeek #GLM5 #Deepseekv4 #LLM #Inference #Efficiency #LongContext #MLSys #SparseAttention

English

549

75.6K

vLLM@vllm_project·3d

vLLM Production Stack now has an end-to-end deployment guide on @OracleCloud OKE 🚀 Self-hosted LLM inference on OCI bare metal GPUs (A10, A100, H100) — from provisioning to first request. OCI deployment scripts are contributed and maintained in the official production-stack repo. Great option for teams that need full control over GPU drivers, CUDA versions, and model configs while keeping cloud elasticity. Thanks @OracleDevs!

Oracle Developers@OracleDevs

This tutorial walks you through deploying the vLLM Production Stack on OKE—from infrastructure provisioning to running your first inference request. social.ora.cl/6012hNgEp

English

6.2K

vLLM@vllm_project·4d

Want to run @openclaw with your own model using vLLM? 🦞 It is surprisingly easy and fast: 1️⃣ Deploy the model with vLLM 2️⃣ Expose the OpenAI-compatible API 3️⃣ Point OpenClaw to the endpoint Tool calling works out of the box, so it plugs nicely into OpenClaw agent workflows. vLLM setup guide (using Kimi K2.5 from @Kimi_Moonshot as an example): docs.vllm.ai/projects/recip… Quick Demo 👇

English

145

10.5K

vLLM@vllm_project·11 Mar

🎉 Congrats to @nvidia on the release of Nemotron 3 Super — day-0 support in vLLM v0.17.1! Verified on NVIDIA GPUs. 120B hybrid MoE, only 12B active at inference. Big upgrades over the previous Nemotron Super: - 5x higher throughput - 2x higher accuracy on Artificial Analysis Intelligence Index - Multi-Token Prediction (MTP) for faster long-form generation - Configurable thinking budget — dial accuracy vs token cost per task - 1M token context window Supports BF16, FP8, and NVFP4. Fully open: weights, datasets, recipes. Blog: vllm.ai/blog/nemotron-… 🤝 Thanks @NVIDIAAIDev Nemotron team and vLLM community contributors!

NVIDIA AI Developer@NVIDIAAIDev

Introducing NVIDIA Nemotron 3 Super 🎉 Open 120B-parameter (12B active) hybrid Mamba-Transformer MoE model Native 1M-token context Built for compute-efficient, high-accuracy multi-agent applications Plus, fully open weights, datasets and recipes for easy customization and deployment. 🧵

English

346

43.9K

vLLM retweetledi

Xunzhuo@XunzhuoLiu·9 Mar

@satyanadella mentioning “vLLM Semantic Router” at @MorganStanley ’s TMT Conference was a truly exciting and humbling moment for us! Honored to see semantic routing recognized on such an important stage 🔥🔥 arxiv.org/abs/2603.04444 #vLLM #OpenSource #LLM #Microsoft

English

3.6K

vLLM@vllm_project·9 Mar

Great to see vLLM powering a fully local AI assistant on @nvidia Jetson 🦞 The OpenClaw tutorial shows how to serve MoE models like Nemotron 3 Nano 30B with vLLM on Jetson AGX — everything runs on-device, zero cloud APIs. Thanks to the @NVIDIARobotics Jetson team for putting this together!

NVIDIA Robotics@NVIDIARobotics

🦞 Want an always‑on personal assistant on your NVIDIA Jetson? Follow our step‑by‑step OpenClaw tutorial to run it fully local on your Jetson with zero cloud APIs. 👉 jetson-ai-lab.com/tutorials/open…

English

145

14.1K

vLLM@vllm_project·7 Mar

🤖 New models: Qwen3.5, COLQwen3, ColModernVBERT, Ring 2.5, Ovis 2.6, Nemotron embed/rerank VL 🎙️ ASR: FunASR, FireRedASR2, Qwen3-ASR realtime streaming 📦 PyTorch 2.10 upgrade (breaking change for env deps) 🔗 Transformers v5 compatibility Speculative decoding: Nemotron-H MTP, Eagle3 + disaggregated serving, Sparse MLA + MTP with full CUDA graphs, DP+EP support ⚠️ Known issue: CUDA 12.9+ users may hit CUBLAS_STATUS_INVALID_VALUE — see release notes for workarounds. 🔗 github.com/vllm-project/v…

5.7K

vLLM@vllm_project·7 Mar

🔥 Kernel upgrades: - FlashInfer Sparse MLA backend - Triton-based top-k/top-p sampler kernels - TRTLLM DSV3 Router GEMM: 6% batch-1 speedup - Helion kernel framework with autotuning 🖥️ Hardware: - NVIDIA SM100/SM120 optimizations (MXFP8, FP8 GEMM) - AMD ROCm: AITER fused RoPE+KVCache, bitsandbytes, MXFP4 MoE on gfx950 - Intel XPU: CUDA graph support, GPUDirect RDMA via NIXL - CPU: ARM BF16, KleidiAI INT8_W4A8, AVX2+AVX512 in one release 📡 API: Anthropic thinking blocks, count_tokens, Responses API structured outputs

English

6.6K

vLLM@vllm_project·7 Mar

🚀 vLLM v0.17.0 is here! 699 commits from 272 contributors (48 new!) This is a big one. Highlights: ⚡ FlashAttention 4 integration 🧠 Qwen3.5 model family with GDN (Gated Delta Networks) 🏗️ Model Runner V2 maturation: Pipeline Parallel, Decode Context Parallel, Eagle3 + CUDA graphs 🎛️ New --performance-mode flag: balanced / interactivity / throughput 💾 Weight Offloading V2 with prefetching 🔀 Elastic Expert Parallelism Milestone 2 🔧 Quantized LoRA adapters (QLoRA) now loadable directly

English

947

60.7K

Keşfet

@RunPod @RedHat @IBM @NVIDIAAI @MITIBMLab @inferact @Baidu_Inc @AMD