EmbeddedLLM

485 posts

EmbeddedLLM banner
EmbeddedLLM

EmbeddedLLM

@EmbeddedLLM

Your open-source AI ally. We specialize in integrating LLM into your business.

Inscrit le Ekim 2023
1.3K Abonnements1K Abonnés
EmbeddedLLM retweeté
vLLM
vLLM@vllm_project·
Up to 18x interactivity improvement on Kimi K2.5 1T MXFP4 on AMD GPUs. All fixes and GEMM tuning upstreamed into vLLM 0.18.0. And more is coming. AMD's GPU MODE Hackathon has a $650K track dedicated to pushing Kimi K2.5 inference on MI355X with vLLM. Thanks to the @AMD team and the community for the collaboration 🙌
SemiAnalysis@SemiAnalysis_

18x IMPROVEMENT ALERT🚀 In under 30 days, AMD was able to improvement Kimi K2.5 1T MXFP4 interactivity by up to 18x when iso-throughput.  The main changes are in PR number 35850 AMD fixed their vLLM AITER integration to support the Kimi K2.5 MLA which uses num_head=8 for TP8 & num_head=16 for TP4 along with general GEMM tuning. All of these bug fixes & perf tuning are upstreamed & already in the vLLM 0.18 release.  Great work to Chuan Li & @AnushElangovan  Speed is the Moat 🔥

English
2
13
99
8.8K
EmbeddedLLM retweeté
vLLM
vLLM@vllm_project·
Great systematic study on speculative decoding in vLLM 🔬 Thorough and practical. A useful reference if you're choosing an SD strategy for your deployment. 👇
Jiaxiang Yu@jiaxiangyuu

Speculative decoding (SD) is widely used to speed up LLM inference, but how well does it actually work in production settings? We presented, to our best knowledge, the first systematic evaluation of speculative decoding on a production-grade, widely deployed inference engine @vllm_project . We benchmarked 5 speculative decoding methods, 4 models, 6 workloads, batch sizes 1-128 on vLLM. Here's what we found. Paper: arxiv.org/abs/2601.11580 Blog: specdecode-bench.github.io Code: github.com/orgs/SpecDecod…

English
1
16
133
17.4K
EmbeddedLLM retweeté
vLLM
vLLM@vllm_project·
🎉 Congrats to @Cohere on releasing Cohere Transcribe, a 2B speech recognition model (Apache 2.0, 14 languages). Day-0 support in vLLM. Cohere contributed encoder-decoder serving optimizations to vLLM: variable-length encoder batching and packed attention for the decoder. Up to 2x throughput improvement for speech workloads, and these gains carry over to all encoder-decoder models on vLLM. Thanks to the @Cohere team for the contribution! PR 🔗 github.com/vllm-project/v… Blog 🔗 huggingface.co/blog/CohereLab…
vLLM tweet media
Cohere@cohere

Introducing: Cohere Transcribe – a new state-of-the-art in open source speech recognition.

English
2
21
201
15.6K
EmbeddedLLM retweeté
vLLM
vLLM@vllm_project·
🎉 Congrats to @MistralAI on launching Voxtral 4B TTS — enterprise-grade TTS built for production voice agents. Day-0 support in vLLM Omni. 🌍 9 languages with natural prosody and emotional range 🎙️ 20 preset voices with easy adaptation to new ones ⚡ Ultra-low latency streaming, 24 kHz output in WAV/MP3/FLAC/AAC/Opus Blog 🔗 mistral.ai/news/voxtral-t… Model 🔗 huggingface.co/mistralai/Voxt…
vLLM tweet mediavLLM tweet media
Mistral AI@MistralAI

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily adaptable to new voices

English
4
37
323
22.5K
EmbeddedLLM retweeté
vLLM
vLLM@vllm_project·
We rebuilt vLLM's execution core from the ground up — more efficient, more modular. Introducing Model Runner V2! 🔧 Modular design with cleaner abstractions ⚡️GPU-native input preparation 🔄 Async-first with zero CPU–GPU sync 🔋 New Triton-native sampler Already seeing notable gains in high-throughput and speculative decoding scenarios. No API changes — try it today: `export VLLM_USE_V2_MODEL_RUNNER=1`. 🔗 vllm.ai/blog/mrv2
English
12
52
425
35.8K
EmbeddedLLM retweeté
vLLM
vLLM@vllm_project·
Missed our live talk at #GTC2026? Here's what you need to know. 👇 vLLM in 2026: Architectural Challenges and Performance Optimizations, by @woosuk_k - Model Runner V2 (MRV2): GPU-native Triton kernels replace CPU PyTorch ops - Hybrid Memory Allocator: 0–12% memory waste across OSS models - Encoder Prefill Disaggregation: up to 2.5x P99 throughput for multimodal workloads - ModularKernel for MoE: mix-and-match GEMM + all-to-all kernels - Case study: Kimi K2.5 (NVFP4) on GB200 🔗 Slides: docs.google.com/presentation/d… #vLLM #GTC2026 #LLMInference #NVIDIA
vLLM tweet media
English
2
15
93
6.6K
EmbeddedLLM retweeté
vLLM
vLLM@vllm_project·
vLLM v0.18.0 is out! 445 commits from 213 contributors (61 new). 🎉 What's new: gRPC serving, GPU-less multimodal render, NGram spec decode on GPU, Elastic EP Milestone 2, FlashInfer 0.6.6, Responses API streaming tool calls. Thread 👇
vLLM tweet media
English
11
42
402
23.7K
EmbeddedLLM
EmbeddedLLM@EmbeddedLLM·
Incredible validation from Runpod's State of AI report today: vLLM is the de facto standard for LLM serving. ​Huge congrats to the entire vLLM community.
vLLM@vllm_project

📊 @RunPod's State of AI report — real production data from 500K developers: "vLLM has become the de facto standard for LLM serving, with half of text-only endpoints running vLLM variants." Thanks to everyone building with vLLM in production 🙏 Full report 👇

English
0
0
1
113
EmbeddedLLM retweeté
vLLM
vLLM@vllm_project·
Great to see @AMD select vLLM as one of the designated inference frameworks for the GPU MODE Hackathon. 🎉 The challenge: push Kimi K2.5 1T FP4 end-to-end inference performance on 8× AMD Instinct MI355X — using vLLM or AMD ATOM. Grand prize: $650,000. What makes this different: winning optimizations must be mergeable into AMD ATOM or vLLM upstream. Improvements that land in vLLM benefit the whole community. Phase 1 (kernel optimization) runs through April 6. More details ⬇️
AMD@AMD

Join the GPU MODE Hackathon, sponsored by AMD, and push the boundaries of LLM inference performance on leading open models, optimized for AMD Instinct MI355X GPUs. Finalists will compete for the $1.1M total cash prize pool across two independent tracks, each focused on a specific model and inference stack. Learn more and get registered here: luma.com/cqq4mojz

English
3
19
127
16.3K
EmbeddedLLM
EmbeddedLLM@EmbeddedLLM·
Ever wonder why your LLM responses slow down if you send a request exactly at 8:00 or 9:00? 🤔 Blame the 🦞 OpenClaw demand is completely breaking standard inference ops. Everyone's cron jobs fire at the exact same minute, creating brutal traffic spikes. just an 8% uncached QPS burst explodes into a massive 26% more prefill load hitting the cluster all at once.
English
0
2
5
186
EmbeddedLLM retweeté
the tiny corp
the tiny corp@__tinygrad__·
AMD open sourced rocprof-trace-decoder! This was one of the last pieces of closed source code on the CPU side -- the definitions of the hardware SQTT traces are now public. AMD's tracing infrastructure is better than NVIDIA's, it can trace the timing of every instruction.
English
12
54
1.2K
51.7K
EmbeddedLLM retweeté
vLLM
vLLM@vllm_project·
@AIatAMD and @EmbeddedLLM built 7 attention backends for vLLM on ROCm — and animated the internals. Shuffled KV cache layouts. Batch reordering. Log-sum-exp merging across chunks. This is how ROCM_AITER_FA gets 4.4x decode throughput on AMD GPUs 👇
GIF
English
1
5
58
3.7K
EmbeddedLLM retweeté
vLLM
vLLM@vllm_project·
NVIDIA published a tutorial for deploying Cosmos Reason 2B on Jetson using vLLM — covering AGX Thor, AGX Orin, and Orin Super Nano. FP8 quantized VLM with chain-of-thought reasoning, served via `vllm serve` and connected to a real-time webcam UI for interactive vision analysis. Great to see vLLM powering edge inference on Jetson. 🙏 Thanks to the @NVIDIARobotics Jetson team! 🔗 huggingface.co/blog/nvidia/co…
NVIDIA Robotics@NVIDIARobotics

Want to bring open-source vision language models to the edge? 💻 Check out our @huggingface article on deploying NVIDIA Cosmos Reasoning 2B across the NVIDIA Jetson family with vLLM and a Live VLM WebUI. 📖 nvda.ws/3P5tLS4

English
6
25
196
21K
EmbeddedLLM retweeté
GPU MODE
GPU MODE@GPU_MODE·
PTX is all you need
English
13
8
183
17.9K
EmbeddedLLM
EmbeddedLLM@EmbeddedLLM·
BREAKING: The CUDA moat has just shrunk again! @SemiAnalysis_ called out NVIDIA’s cuTile + Python CuTeDSL landing in PyTorch Inductor as another moat expansion. @AMD answer? FlyDSL! the full CuTe-style Python DSL with explicit tile/layout algebra, hierarchical MFMA control, and MLIR speed. Seamless NVIDIA kernel ports. Peak roofline on MI300X/MI350. @AIatAMD Check out the blog: rocm.blogs.amd.com/software-tools…
SemiAnalysis@SemiAnalysis_

BREAKING: The CUDA moat has just expanded again! PyTorch Compile/Inductor can now target NVIDIA Python CuTeDSL in addition to Triton. This enables 2x faster FlexAttention compared to Triton implementations. We explain below 👇 As we explained in our April 2025 AMD 2.0 piece, Python DSLs for kernel authoring represent the future—not C++ templates. NVIDIA has been massively supporting their closed-source Python CuTeDSL, cuTile, and TileIR ecosystem. By having Python CuTeDSL/cuTile/TileIR, NVIDIA regains closed-source compiler optimization passes, whereas in Triton, the middle-level IR optimization passes are open source. Furthermore, Triton currently lacks strong Blackwell performance as it doesn't yet support Cluster Launch Control and 2SM MMA with TMA multicast. Triton IR will support be able to target TileIR too. While Gluon attempts to address this, it remains a work in progress. Google has also integrated an experimental backend for TorchInductor to target Pallas for codegen. It's unclear when AMD will release/integrate Wave DSL or ComposableKernel Python DSL into Torch Inductor as a codegen target.

English
0
2
9
533
EmbeddedLLM retweeté
vLLM
vLLM@vllm_project·
🔥Congrats to @Zai_org on launching GLM-5 — 744B parameters (40B active), trained on 28.5T tokens, integrating DeepSeek Sparse Attention to keep deployment cost manageable while preserving long-context capacity. vLLM has day-0 support for GLM-5-FP8 with: 📖 DeepSeek Sparse Attention for efficient long-context serving ⚡️ MTP speculative decoding ⚙️ Tool calling + thinking mode Recipe with serving configs and benchmarks: 🔗 docs.vllm.ai/projects/recip…
vLLM tweet media
English
14
25
379
42.3K
EmbeddedLLM retweeté
vLLM
vLLM@vllm_project·
🚀 vLLM just hit 70K GitHub stars! 🎉 The engine has kept evolving fast since the last milestone. We've been pushing hard on large-scale serving — production-grade multi-node support on NVIDIA Blackwell with WideEP and expert parallelism, making it practical to serve the biggest models at scale. More models, more hardware, async scheduling for higher throughput, real-time streaming for speech and audio, and a growing multimodal story across text, vision, video, and voice. Huge thanks to our sponsors, our 2,100+ contributors, friends at @PyTorch, @huggingface Transformers, and the model labs we work closely with to bring day-0 support — @deepseek_ai, @Alibaba_Qwen, @MiniMax_AI, @Kimi_Moonshot, @MistralAI, and partners @NVIDIAAIDev, @RedHat_AI, @AIatAMD, @AIatMeta, and many more we can't fit here — all helping bring vLLM to more platforms and more people. You make this ecosystem what it is. 💛💙 Also during this time, @inferact was founded by the creators and core maintainers of vLLM, dedicated to growing vLLM and making inference cheaper and faster. On to the next chapter — together. Easy, fast, and cheap LLM serving for everyone. 🌍
vLLM tweet mediavLLM tweet mediavLLM tweet media
English
12
12
114
8.1K
EmbeddedLLM retweeté
vLLM
vLLM@vllm_project·
🎉🎉🎉 Congrats to @StepFun_ai on releasing Step 3.5 Flash, and day-0 support is ready in vLLM! A 196B MoE that activates only 11B params per token, giving you frontier reasoning with exceptional efficiency. Highlights: • 74.4% SWE-bench Verified, 51.0% Terminal-Bench 2.0 • 256K context with 3:1 Sliding Window Attention for cost-efficient long context • Built for coding agents and long-horizon agentic tasks Check out our detailed deployment recipe below 👇 🔗docs.vllm.ai/projects/recip…
vLLM tweet media
StepFun@StepFun_ai

⚡️ Step 3.5 Flash is coming: Fast Enough to Think. Reliable Enough to Act! We’re dropping our most capable open-source foundation model yet. Frontier reasoning meets extreme efficiency. It leverages a sparse Mixture of Experts (MoE) architecture, 196B total → 11B active. Key Capabilities: ✅Reasoning at Speed: MTP-3 powered throughput at 100–300 tok/s (350 tok/s peak for single-stream coding tasks). ✅Agentic Power: ⚡️ 74.4% SWE-bench Verified ⚡️ 51.0% Terminal-Bench 2.0. Proven stability for complex, long-horizon tasks. ✅256K Efficient Context: 3:1 SWA ratio + Full Attention. Massive datasets or long codebases support with minimal overhead. Consistent performance, hybrid efficiency. ✅Local-First Deployment: Optimized for Mac Studio M4 Max, NVIDIA DGX Spark. Secure, private, and frontier-capable. Your data, your hardware, your agent. You can try Step 3.5 Flash right now: 👉 OpenRouter: openrouter.ai/stepfun/step-3… 👉 GitHub: github.com/stepfun-ai/Ste… 👉 HuggingFace:huggingface.co/stepfun-ai/Ste… 👉 Blog:static.stepfun.com/blog/step-3.5-… 👉 ModelScope: modelscope.cn/models/stepfun… 🌌 The Next:Step 4 training is officially LIVE! We're calling on the world's boldest builders to co-creat the Step 4 right now. Let's define the Agentic Era together! Join our Discord:discord.gg/RcMJhNVAQc

English
5
22
263
16.5K
EmbeddedLLM retweeté
vLLM
vLLM@vllm_project·
🚀 vLLM v0.15.0 is here! 335 commits from 158 contributors (39 new!) Highlights: ⚡ Async scheduling + Pipeline Parallelism 🧠 Mamba prefix caching (~2x speedup) ⚫ Blackwell FP4 65% faster 🟥 AMD RDNA3/RDNA4 consumer GPU support 🤖 Kimi-K2.5, Molmo2, Eagle2.5-8B VLM, EAGLE3 speculative decoding More updates: 🔗 github.com/vllm-project/v…
vLLM tweet media
English
12
53
496
38.7K