EmbeddedLLM

485 posts

EmbeddedLLM

@EmbeddedLLM

Your open-source AI ally. We specialize in integrating LLM into your business.

Inscrit le Ekim 2023

1.3K Abonnements1K Abonnés

EmbeddedLLM retweeté

vLLM@vllm_project·2d

Up to 18x interactivity improvement on Kimi K2.5 1T MXFP4 on AMD GPUs. All fixes and GEMM tuning upstreamed into vLLM 0.18.0. And more is coming. AMD's GPU MODE Hackathon has a $650K track dedicated to pushing Kimi K2.5 inference on MI355X with vLLM. Thanks to the @AMD team and the community for the collaboration 🙌

SemiAnalysis@SemiAnalysis_

18x IMPROVEMENT ALERT🚀 In under 30 days, AMD was able to improvement Kimi K2.5 1T MXFP4 interactivity by up to 18x when iso-throughput. The main changes are in PR number 35850 AMD fixed their vLLM AITER integration to support the Kimi K2.5 MLA which uses num_head=8 for TP8 & num_head=16 for TP4 along with general GEMM tuning. All of these bug fixes & perf tuning are upstreamed & already in the vLLM 0.18 release. Great work to Chuan Li & @AnushElangovan Speed is the Moat 🔥

English

8.8K

EmbeddedLLM retweeté

vLLM@vllm_project·2d

Great systematic study on speculative decoding in vLLM 🔬 Thorough and practical. A useful reference if you're choosing an SD strategy for your deployment. 👇

Jiaxiang Yu@jiaxiangyuu

Speculative decoding (SD) is widely used to speed up LLM inference, but how well does it actually work in production settings? We presented, to our best knowledge, the first systematic evaluation of speculative decoding on a production-grade, widely deployed inference engine @vllm_project . We benchmarked 5 speculative decoding methods, 4 models, 6 workloads, batch sizes 1-128 on vLLM. Here's what we found. Paper: arxiv.org/abs/2601.11580 Blog: specdecode-bench.github.io Code: github.com/orgs/SpecDecod…

English

133

17.4K

EmbeddedLLM retweeté

vLLM@vllm_project·2d

🎉 Congrats to @Cohere on releasing Cohere Transcribe, a 2B speech recognition model (Apache 2.0, 14 languages). Day-0 support in vLLM. Cohere contributed encoder-decoder serving optimizations to vLLM: variable-length encoder batching and packed attention for the decoder. Up to 2x throughput improvement for speech workloads, and these gains carry over to all encoder-decoder models on vLLM. Thanks to the @Cohere team for the contribution! PR 🔗 github.com/vllm-project/v… Blog 🔗 huggingface.co/blog/CohereLab…

Cohere@cohere

Introducing: Cohere Transcribe – a new state-of-the-art in open source speech recognition.

English

201

15.6K

EmbeddedLLM retweeté

vLLM@vllm_project·2d

🎉 Congrats to @MistralAI on launching Voxtral 4B TTS — enterprise-grade TTS built for production voice agents. Day-0 support in vLLM Omni. 🌍 9 languages with natural prosody and emotional range 🎙️ 20 preset voices with easy adaptation to new ones ⚡ Ultra-low latency streaming, 24 kHz output in WAV/MP3/FLAC/AAC/Opus Blog 🔗 mistral.ai/news/voxtral-t… Model 🔗 huggingface.co/mistralai/Voxt…

Mistral AI@MistralAI

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily adaptable to new voices

English

323

22.5K

EmbeddedLLM retweeté

vLLM@vllm_project·4d

We rebuilt vLLM's execution core from the ground up — more efficient, more modular. Introducing Model Runner V2! 🔧 Modular design with cleaner abstractions ⚡️GPU-native input preparation 🔄 Async-first with zero CPU–GPU sync 🔋 New Triton-native sampler Already seeing notable gains in high-throughput and speculative decoding scenarios. No API changes — try it today: `export VLLM_USE_V2_MODEL_RUNNER=1`. 🔗 vllm.ai/blog/mrv2

English

425

35.8K

EmbeddedLLM retweeté

vLLM@vllm_project·4d

Missed our live talk at #GTC2026? Here's what you need to know. 👇 vLLM in 2026: Architectural Challenges and Performance Optimizations, by @woosuk_k - Model Runner V2 (MRV2): GPU-native Triton kernels replace CPU PyTorch ops - Hybrid Memory Allocator: 0–12% memory waste across OSS models - Encoder Prefill Disaggregation: up to 2.5x P99 throughput for multimodal workloads - ModularKernel for MoE: mix-and-match GEMM + all-to-all kernels - Case study: Kimi K2.5 (NVFP4) on GB200 🔗 Slides: docs.google.com/presentation/d… #vLLM #GTC2026 #LLMInference #NVIDIA

English

6.6K

EmbeddedLLM retweeté

vLLM@vllm_project·21 Mar

vLLM v0.18.0 is out! 445 commits from 213 contributors (61 new). 🎉 What's new: gRPC serving, GPU-less multimodal render, NGram spec decode on GPU, Elastic EP Milestone 2, FlashInfer 0.6.6, Responses API streaming tool calls. Thread 👇

English

402

23.7K

EmbeddedLLM@EmbeddedLLM·20 Mar

Incredible validation from Runpod's State of AI report today: vLLM is the de facto standard for LLM serving. Huge congrats to the entire vLLM community.

vLLM@vllm_project

📊 @RunPod's State of AI report — real production data from 500K developers: "vLLM has become the de facto standard for LLM serving, with half of text-only endpoints running vLLM variants." Thanks to everyone building with vLLM in production 🙏 Full report 👇

English

113

EmbeddedLLM retweeté

vLLM@vllm_project·19 Mar

Great to see @AMD select vLLM as one of the designated inference frameworks for the GPU MODE Hackathon. 🎉 The challenge: push Kimi K2.5 1T FP4 end-to-end inference performance on 8× AMD Instinct MI355X — using vLLM or AMD ATOM. Grand prize: $650,000. What makes this different: winning optimizations must be mergeable into AMD ATOM or vLLM upstream. Improvements that land in vLLM benefit the whole community. Phase 1 (kernel optimization) runs through April 6. More details ⬇️

AMD@AMD

Join the GPU MODE Hackathon, sponsored by AMD, and push the boundaries of LLM inference performance on leading open models, optimized for AMD Instinct MI355X GPUs. Finalists will compete for the $1.1M total cash prize pool across two independent tracks, each focused on a specific model and inference stack. Learn more and get registered here: luma.com/cqq4mojz

English

127

16.3K

EmbeddedLLM@EmbeddedLLM·8 Mar

Ever wonder why your LLM responses slow down if you send a request exactly at 8:00 or 9:00? 🤔 Blame the 🦞 OpenClaw demand is completely breaking standard inference ops. Everyone's cron jobs fire at the exact same minute, creating brutal traffic spikes. just an 8% uncached QPS burst explodes into a massive 26% more prefill load hitting the cluster all at once.

English

186

EmbeddedLLM retweeté

the tiny corp@__tinygrad__·3 Mar

AMD open sourced rocprof-trace-decoder! This was one of the last pieces of closed source code on the CPU side -- the definitions of the hardware SQTT traces are now public. AMD's tracing infrastructure is better than NVIDIA's, it can trace the timing of every instruction.

English

1.2K

51.7K

EmbeddedLLM retweeté

vLLM@vllm_project·28 Şub

@AIatAMD and @EmbeddedLLM built 7 attention backends for vLLM on ROCm — and animated the internals. Shuffled KV cache layouts. Batch reordering. Log-sum-exp merging across chunks. This is how ROCM_AITER_FA gets 4.4x decode throughput on AMD GPUs 👇

GIF

English

3.7K

EmbeddedLLM retweeté

vLLM@vllm_project·27 Şub

NVIDIA published a tutorial for deploying Cosmos Reason 2B on Jetson using vLLM — covering AGX Thor, AGX Orin, and Orin Super Nano. FP8 quantized VLM with chain-of-thought reasoning, served via `vllm serve` and connected to a real-time webcam UI for interactive vision analysis. Great to see vLLM powering edge inference on Jetson. 🙏 Thanks to the @NVIDIARobotics Jetson team! 🔗 huggingface.co/blog/nvidia/co…

NVIDIA Robotics@NVIDIARobotics

Want to bring open-source vision language models to the edge? 💻 Check out our @huggingface article on deploying NVIDIA Cosmos Reasoning 2B across the NVIDIA Jetson family with vLLM and a Live VLM WebUI. 📖 nvda.ws/3P5tLS4

English

196

21K

EmbeddedLLM retweeté

vLLM@vllm_project·25 Şub

Congrats to the @liquidai team on LFM2-24B-A2B! 🎉 Day-0 support for LFM2-24B-A2B in vLLM stable version ✅ 24B total params, only 2B active per token — fits in 32 GB RAM and hits 293 tok/s on H100 🔥

Liquid AI@liquidai

Today, we release our largest LFM2 model: LFM2-24B-A2B 🐘 > 24B total parameters > 2.3B active per token > Built on our hybrid, hardware-aware LFM2 architecture It combines LFM2’s fast, memory-efficient design with a Mixture of Experts setup, so only 2.3B parameters activate each run. The result: best-in-class efficiency, fast edge inference, and predictable log-linear scaling all in a 32GB, 2B-active MoE footprint. 🧵

English

113

14K

EmbeddedLLM@EmbeddedLLM·22 Şub

@GPU_MODE MLIR is all you need (works on Nvidia and AMD) rocm.blogs.amd.com/software-tools…

English

165

GPU MODE@GPU_MODE·21 Şub

PTX is all you need

English

183

17.9K

EmbeddedLLM@EmbeddedLLM·21 Şub

BREAKING: The CUDA moat has just shrunk again! @SemiAnalysis_ called out NVIDIA’s cuTile + Python CuTeDSL landing in PyTorch Inductor as another moat expansion. @AMD answer? FlyDSL! the full CuTe-style Python DSL with explicit tile/layout algebra, hierarchical MFMA control, and MLIR speed. Seamless NVIDIA kernel ports. Peak roofline on MI300X/MI350. @AIatAMD Check out the blog: rocm.blogs.amd.com/software-tools…

SemiAnalysis@SemiAnalysis_

BREAKING: The CUDA moat has just expanded again! PyTorch Compile/Inductor can now target NVIDIA Python CuTeDSL in addition to Triton. This enables 2x faster FlexAttention compared to Triton implementations. We explain below 👇 As we explained in our April 2025 AMD 2.0 piece, Python DSLs for kernel authoring represent the future—not C++ templates. NVIDIA has been massively supporting their closed-source Python CuTeDSL, cuTile, and TileIR ecosystem. By having Python CuTeDSL/cuTile/TileIR, NVIDIA regains closed-source compiler optimization passes, whereas in Triton, the middle-level IR optimization passes are open source. Furthermore, Triton currently lacks strong Blackwell performance as it doesn't yet support Cluster Launch Control and 2SM MMA with TMA multicast. Triton IR will support be able to target TileIR too. While Gluon attempts to address this, it remains a work in progress. Google has also integrated an experimental backend for TorchInductor to target Pallas for codegen. It's unclear when AMD will release/integrate Wave DSL or ComposableKernel Python DSL into Torch Inductor as a codegen target.

English

533

EmbeddedLLM retweeté

vLLM@vllm_project·11 Şub

🔥Congrats to @Zai_org on launching GLM-5 — 744B parameters (40B active), trained on 28.5T tokens, integrating DeepSeek Sparse Attention to keep deployment cost manageable while preserving long-context capacity. vLLM has day-0 support for GLM-5-FP8 with: 📖 DeepSeek Sparse Attention for efficient long-context serving ⚡️ MTP speculative decoding ⚙️ Tool calling + thinking mode Recipe with serving configs and benchmarks: 🔗 docs.vllm.ai/projects/recip…

English

379

42.3K

EmbeddedLLM retweeté

vLLM@vllm_project·11 Şub

🚀 vLLM just hit 70K GitHub stars! 🎉 The engine has kept evolving fast since the last milestone. We've been pushing hard on large-scale serving — production-grade multi-node support on NVIDIA Blackwell with WideEP and expert parallelism, making it practical to serve the biggest models at scale. More models, more hardware, async scheduling for higher throughput, real-time streaming for speech and audio, and a growing multimodal story across text, vision, video, and voice. Huge thanks to our sponsors, our 2,100+ contributors, friends at @PyTorch, @huggingface Transformers, and the model labs we work closely with to bring day-0 support — @deepseek_ai, @Alibaba_Qwen, @MiniMax_AI, @Kimi_Moonshot, @MistralAI, and partners @NVIDIAAIDev, @RedHat_AI, @AIatAMD, @AIatMeta, and many more we can't fit here — all helping bring vLLM to more platforms and more people. You make this ecosystem what it is. 💛💙 Also during this time, @inferact was founded by the creators and core maintainers of vLLM, dedicated to growing vLLM and making inference cheaper and faster. On to the next chapter — together. Easy, fast, and cheap LLM serving for everyone. 🌍

English

114

8.1K

EmbeddedLLM retweeté

vLLM@vllm_project·2 Şub

🎉🎉🎉 Congrats to @StepFun_ai on releasing Step 3.5 Flash, and day-0 support is ready in vLLM! A 196B MoE that activates only 11B params per token, giving you frontier reasoning with exceptional efficiency. Highlights: • 74.4% SWE-bench Verified, 51.0% Terminal-Bench 2.0 • 256K context with 3:1 Sliding Window Attention for cost-efficient long context • Built for coding agents and long-horizon agentic tasks Check out our detailed deployment recipe below 👇 🔗docs.vllm.ai/projects/recip…

StepFun@StepFun_ai

⚡️ Step 3.5 Flash is coming: Fast Enough to Think. Reliable Enough to Act！ We’re dropping our most capable open-source foundation model yet. Frontier reasoning meets extreme efficiency. It leverages a sparse Mixture of Experts (MoE) architecture, 196B total → 11B active. Key Capabilities: ✅Reasoning at Speed: MTP-3 powered throughput at 100–300 tok/s (350 tok/s peak for single-stream coding tasks). ✅Agentic Power: ⚡️ 74.4% SWE-bench Verified ⚡️ 51.0% Terminal-Bench 2.0. Proven stability for complex, long-horizon tasks. ✅256K Efficient Context: 3:1 SWA ratio + Full Attention. Massive datasets or long codebases support with minimal overhead. Consistent performance, hybrid efficiency. ✅Local-First Deployment: Optimized for Mac Studio M4 Max, NVIDIA DGX Spark. Secure, private, and frontier-capable. Your data, your hardware, your agent. You can try Step 3.5 Flash right now: 👉 OpenRouter: openrouter.ai/stepfun/step-3… 👉 GitHub: github.com/stepfun-ai/Ste… 👉 HuggingFace：huggingface.co/stepfun-ai/Ste… 👉 Blog：static.stepfun.com/blog/step-3.5-… 👉 ModelScope: modelscope.cn/models/stepfun… 🌌 The Next：Step 4 training is officially LIVE！ We're calling on the world's boldest builders to co-creat the Step 4 right now. Let's define the Agentic Era together! Join our Discord：discord.gg/RcMJhNVAQc

English

263

16.5K

EmbeddedLLM retweeté

vLLM@vllm_project·31 Oca

🚀 vLLM v0.15.0 is here! 335 commits from 158 contributors (39 new!) Highlights: ⚡ Async scheduling + Pipeline Parallelism 🧠 Mamba prefix caching (~2x speedup) ⚫ Blackwell FP4 65% faster 🟥 AMD RDNA3/RDNA4 consumer GPU support 🤖 Kimi-K2.5, Molmo2, Eagle2.5-8B VLM, EAGLE3 speculative decoding More updates: 🔗 github.com/vllm-project/v…

English

496

38.7K

Découvrir

@AMD @Cohere @MistralAI @woosuk_k @AIatAMD @NVIDIARobotics @liquidai @GPU_MODE