EmbeddedLLM

546 posts

EmbeddedLLM

@EmbeddedLLM

Your open-source AI ally. We are committed to making production-grade AI inference as accessible and reliable as electricity, powered by vLLM.

Katılım Ekim 2023

1.4K Takip Edilen1.1K Takipçiler

EmbeddedLLM@EmbeddedLLM·11h

Open models running agents end to end. No proprietary API in the loop. 🔥 An open /responses API for vLLM: stateful execution, server-side tool calls, works with Codex CLI out of the box. Watch @RedHat_AI demo the @vllm_project Agentic API, co-built by @EmbeddedLLM engineers with the teams at @RedHat_AI and @awscloud. Next: we close the loop. Agent traces become training signal. RL post-training with vLLM. Serve. Observe. Improve. Your infrastructure.

Red Hat AI@RedHat_AI

Codex CLI, running entirely on open models. No OpenAI API. Web search working. Multi-turn state intact. @franciscojarceo shows how vLLM Agentic API bridges the gap: Codex connects to the Agentic API, which forwards prompts and tools to vLLM, executes tool calls server-side (web search with zero data retention), and streams the final answer back. State rehydration via previous response ID means each follow-up only sends the new prompt, not the full context, which matters for prefix caching as conversations grow. The demo makes the point cleanly. Without web search, the model answers from its 2025 training cutoff and has no answer on GPT-5. With web search enabled, it finds July 2026 results with citations. The project is starting with Codex and the Responses API and expanding tool calling for broader agentic harness support. If you want @vllm_project to be the inference layer for your agents, this is the video to check out!

English

EmbeddedLLM retweetledi

vLLM@vllm_project·23h

🎉 Great to see the @AMD team for bringing ROCm support to vime, the vLLM ecosystem's RL post-training framework. End-to-end RL post-training now runs natively on AMD Instinct MI355X GPUs. vime uses vLLM as its rollout backend, so on ROCm it inherits the full vLLM rollout stack with no separate code path. The @AIatAMD team validated the pipeline end-to-end, upstreamed the ROCm-specific fixes, and shipped a prebuilt container so you can skip building from source. What works today: - GRPO training - Colocated and async (non-colocated) train/rollout - Megatron-LM training + vLLM rollout backends - Qwen3 dense and MoE models On MI355X, Qwen3-8B sustains ~4,100 tokens/gpu/s, and the train-rollout logprob diff holds low and stable. 🔗 vllm.ai/blog/2026-07-1…

English

8.6K

EmbeddedLLM retweetledi

vLLM@vllm_project·2d

vLLM v0.25.0 is out! 558 commits from 232 contributors (64 new). 🎉 Highlights: Model Runner V2 is now the default for all dense models, the legacy PagedAttention implementation is retired, the Transformers backend now runs as fast as native vLLM, a new unified Streaming Parser Engine, universal speculative decoding across heterogeneous vocabularies (TLI) plus new DSpark and DFlash drafters, and new models including Hy3 and Unlimited OCR. Thread 👇

English

621

87.5K

EmbeddedLLM retweetledi

vLLM@vllm_project·6d

Big news from @hmellor_ + @huggingface team🙌! In v0.25.0 the Transformers modeling backend hits parity with hand-written vLLM models. Now 450+ transformers architectures run in vLLM at native speed with zero porting. Integrate once with transformers to get vLLM's fused kernels, torch.compile, and CUDA graphs for free. Read about the changes below 👇

Harry Mellor@hmellor_

I have HUGE news about the Transformers modelling backend for @vllm_project v0.25.0 🚀 It has reached performance parity with native vLLM model implementations 🤯 The Transformers modelling backend has just become a zero-effort, zero-compromise way to deploy to vLLM!

English

270

29.2K

EmbeddedLLM retweetledi

Harry Mellor@hmellor_·6d

English

31.3K

EmbeddedLLM retweetledi

Red Hat AI@RedHat_AI·6d

Quantized checkpoints for GLM-5.2 have been created by the Red Hat AI team! huggingface.co/RedHatAI/GLM-5… This model was calibrated quantized using DDP + disk offloading in under 2 hours. The full precision model requires 1.6T of VRAM, but NVFP4 quantization of MoE layers and FP8 quantization of attention layers reduces the model size by >70% while maintaining state-of-the-art accuracy recovery on GPQA. Pair it with the new DSpark speculator for additional throughput: huggingface.co/RedHatAI/GLM-5…

English

392

34K

EmbeddedLLM retweetledi

vLLM@vllm_project·5d

🎉 Congrats to the @MosiAI_Official team on MOSS-Transcribe-Diarize-0.9B, an open, end-to-end model for multi-speaker long-audio transcription, with day-0 support in vLLM. Most setups chain ASR + diarization + alignment (WhisperX-style). This one does all three in a single generative pass. It transcribes the speech, tags who is speaking, and emits timestamps together: [0.11][S01] Good morning![1.03] [1.11][S02] Morning, guys![1.34] A Whisper-style audio encoder feeds a Qwen3-style causal decoder, so a recording up to ~90 minutes goes in as one shot, no chunking or stitching. Keyword biasing lets you prime names, product codes, and domain terms so proper nouns come out right. Useful for meeting notes, interviews, call-center QA, and podcast transcription. 🔗 recipes.vllm.ai/OpenMOSS-Team/…

MOSI@MosiAI_Official

🤗 MOSS-Transcribe-Diarize-0.9B is now open source on @huggingface. Built with an end-to-end audio-to-structured-transcript paradigm: >0.9B open-source ASR model >Apache license 2.0 >128k long-context transcription >Up to ~90-min audio input >Speaker labels + timestamps in one generation >Multi-speaker diarization for meetings, interruptions, and overlapping voices >Hotword biasing for names, terms, and domain-specific vocabulary >~100 token/s on NVIDIA RTX 4090, RTF ~0.017 Thank you @sgl_project @vllm_project @Prince_Canuma @lllucas for day-0 support! 🚀 Github: github.com/OpenMOSS/MOSS-… Huggingface: huggingface.co/spaces/OpenMOS… API: shorturl.at/DWwe3 Live demo: shorturl.at/wRZ3j Technical Report：arxiv.org/abs/2601.01554 HF Space: huggingface.co/spaces/OpenMOS… AtomGit：ai.atomgit.com/OpenMOSS/MOSS-… SGLang-Omni: github.com/sgl-project/sg… vLLM: github.com/vllm-project/v… MLX-audio: github.com/Blaizzy/mlx-au… Discord:discord.gg/SmVQHGffZU

English

216

24K

EmbeddedLLM retweetledi

vLLM@vllm_project·6 Tem

Spin it up now! 🚀

Tencent Hy@TencentHunyuan

🚀Hy3 is here. 295B MoE. Best in its size class. Rivals trillion-scale flagships. Reliable and affordable for most agentic usecases. Apache 2.0. Friendly for commercial use. FREE API for 2 weeks → openrouter.ai/tencent/hy3:fr… 🤗 huggingface.co/tencent/Hy3 📖 hy.tencent.com/research/hy3

English

198

32.9K

EmbeddedLLM retweetledi

vLLM@vllm_project·6 Tem

Congrats to @MistralAI on Leanstral 1.5! 🎉 An Apache-2.0 Lean 4 proof agent that punches way above its size: 🧩 MoE: 119B total, just 6B active 📐 100% on miniF2F 🎓 New SOTA on FATE-H (87%) & FATE-X (34%) ⚡ 587/672 on PutnamBench at ~$4/problem Read the blog below or serve it on vLLM today! huggingface.co/mistralai/Lean…

Mert Ünsal@mertunsal2020

Today, we are releasing Le Chaton L∃∀N, aka Leanstral 1.5. It achieves SOTA performance on graduate algebra benchmarks FATE-H and FATE-X and improves Pareto Frontier on PutnamBench, solving 587/672 problems with a x10 cheaper budget. 🧵

English

214

25.5K

EmbeddedLLM retweetledi

vLLM@vllm_project·1 Tem

vLLM v0.24.0 is out! 571 commits from 256 contributors (77 new). 🎉 Highlights: MiniMax-M3 support (FP8/MXFP4 + broad AMD tuning), DeepSeek-V4 keeps maturing (FlashInfer sparse index cache, prefill chunk-planning, now on SM120), Model Runner V2 now handles quantized models by default, a new unified Streaming Parser Engine for tool-calls + reasoning, DiffusionGemma, DeepEP v2 for wide expert parallelism, and a maturing Rust frontend. Thread 👇

English

519

52.3K

EmbeddedLLM retweetledi

vLLM@vllm_project·28 Haz

🎉 Unlimited-OCR from @Baidu_Inc now runs in vLLM. One-shot parsing of entire books with constant KV cache, powered by Reference Sliding Window Attention (R-SWA). 🧠 R-SWA keeps KV cache fixed throughout decoding — no memory blowup, no slowdown, no matter how long the output gets. 📄 Transcribe 40+ pages in a single forward pass under a 32K context budget, with remarkably low edit distance even at scale. 🪶 35% faster than DeepSeek-OCR at 6K output tokens, with fully constant TPS and GPU memory. 🔗 Recipe: recipes.vllm.ai/baidu/Unlimite… 🤗 Weights: huggingface.co/baidu/Unlimite… 🙏 Thanks to the @BaiduAI_News team for the collaboration.

English

877

81.9K

EmbeddedLLM retweetledi

vLLM@vllm_project·18 Haz

🎉 Congrats to @poolsideai on Laguna M.1, a new open-weights agentic coding model. Day-0 support landed in vLLM v0.21.0. 🧠 70-layer sparse MoE: 225B total params, 23B active per token, 256K context 🔀 256 experts with top-k=16 routing, built for long-horizon agentic coding 🛠️ Native interleaved reasoning between tool calls, toggleable per request, Apache 2.0 Recipe 🔗 recipes.vllm.ai/poolside/Lagun…

Poolside@poolsideai

Today we’re releasing the weights for Laguna M.1, our most capable model to date, with a 256K context length. Both base and post-trained checkpoints are now available on Hugging Face under Apache 2.0.

English

176

16.6K

EmbeddedLLM retweetledi

vLLM@vllm_project·18 Haz

Your coding agent can run on open models you host yourself, not just a hosted API. vLLM serves them fast and cost-efficiently on your own GPUs, with broad hardware support across @NVIDIA, @AMD, and more. It speaks the same OpenAI Responses API that Codex uses, so any compatible agent points right at your server and any tool-calling model is a drop-in replacement. Spin up the latest GLM 5.2 (@Zai_org), Kimi K2.7 Code (@Kimi_Moonshot), or MiniMax M3 (@MiniMax_AI) model, or whatever open model fits your needs, and start coding. 🚀 Guide 🔗 docs.vllm.ai/en/latest/serv… Serving Recipe: recipes.vllm.ai

Tibo@thsottiaux

Reminder that you can use the Codex App, CLI and SDK with any open source model, not just with OpenAI models. #oss-mode-local-providers" target="_blank" rel="nofollow noopener">developers.openai.com/codex/config-a…

English

117

11.6K

EmbeddedLLM retweetledi

vLLM@vllm_project·18 Haz

Thanks for the kind words! Day 0 @MiniMax_AI M3 support came together thanks to this collaboration in the open. Big kudos to @rogerw0108 and @mgoin_ for the ongoing push, review, and mentorship. More improvements landing soon. 🙌 vllm.ai/blog/2026-06-1…

SemiAnalysis@SemiAnalysis_

Great work to @vllm_project team and @NVIDIA on smooth, out-of-the-box day 0 @MiniMax_AI M3 experience with @inferact EAGLE3 spec decode. Here are the details of ongoing M3 workstream: NVIDIA, Inferact and SemiAnalysis are working hard on enabling disaggregated inferencing (PR 45879), and the Inferact team is working on enabling FlashInfer M3 MoE kernels (PR 45723). Performance should be much better once those PRs land. Huge shoutout to @rogerw0108 & @mgoin_ and the maintainers for the rapid review and mentorship here!

English

5.7K

EmbeddedLLM retweetledi

vLLM@vllm_project·16 Haz

Great write-up from the @anyscalecompute team on PD disaggregation with Ray Serve + vLLM! PD Disagg is one of the most difficult techniques to get right in serving; the wins are real, but only in the right settings. Great to see it pressure-tested on AMD MI325X with Ray Serve + vLLM!

kourosh hakhamaneshi@CyrusHakha

One pattern we keep seeing with customers serving LLMs at scale: Prefill-decode disaggregation is often treated like a magic wand. But the reality is more nuanced. So we wrote down the core insights for when PD helps, when it does not, and validated them on AMD + vLLM — where the PD path has been much less paved. 🧵

English

107

12.8K

EmbeddedLLM retweetledi

vLLM@vllm_project·15 Haz

vLLM v0.23.0 is out! 408 commits from 200 contributors (63 new). 🎉 Highlights: DeepSeek-V4 matures across backends (TRTLLM-gen attention kernel, sparse MLA decoupled from V3.2, EPLB for the Mega-MoE), Model Runner V2 now default for Llama + Mistral dense models, Gemma 4 Unified (encoder-free) + MTP, a maturing Rust frontend, multi-tier KV cache offloading with an object-store tier, and a unified reasoning + tool-call parser. Thread 👇

English

457

39K

EmbeddedLLM@EmbeddedLLM·12 Haz

Singapore has come a long way. 🇸🇬 From AI adoption to AI infrastructure, the local ecosystem is now contributing to the layers production AI depends on: @PyTorch, @vllm_project, inference, sovereign AI, and open-source infra. Proud to see @RedHat_AI, @inferact, and @EmbeddedLLM building alongside APAC AI community.

PyTorch@PyTorch

The inaugural PyTorch Meetup Singapore brought together engineers, researchers, and community builders to talk about everything from vLLM project updates to the broader question of sovereign intelligence. Read the full technical recap and find presentation slides in our latest blog: bit.ly/4vdcPJU

English

581

EmbeddedLLM retweetledi

Kaichao You@KaichaoYou·12 Haz

Incredible collaboration from the team! Beyond basic inference support, we also have day-0 speculator and RL support🔥

vLLM@vllm_project

🎉 Congrats to @MiniMax_AI on releasing MiniMax M3! Frontier coding and agentic capabilities, native image and video input, computer use, and a 1M-token context window, all in a single open model. At the heart of M3 is MSA, a new sparse attention architecture: instead of attending densely over the full KV cache, each query scores 128-token KV blocks and runs attention only over the top blocks. That is what makes 1M-token context practical to serve. M3 runs in vLLM with day-0 support, verified on NVIDIA and AMD hardware: ✨ MSA sparse attention with dedicated prefill and decode kernels ✨ 1M-token context serving with prefix caching and chunked prefill ✨ BF16 and MXFP8 checkpoints, with MoE backends for both Hopper and Blackwell ✨ Native multimodal input (image + video) ✨ Tool calling, reasoning parsing, and thinking-mode control for agent workloads Day-0 support like this is a true team effort. Grateful to the teams at @MiniMax_AI, @NVIDIAAI, @AIatAMD, and @inferact, and to the vLLM community for making it happen. 🙏 Deep dive into the implementation, kernel work, and deployment recipes: 🔗 vllm.ai/blog/2026-06-1…

English

2.2K

EmbeddedLLM retweetledi

vLLM@vllm_project·12 Haz

MiniMax (official)@MiniMax_AI

MiniMax M3, Open-Weight, Now On Hugging Face , with only ~428B parameters and ~23B activated parameters Weights: huggingface.co/MiniMaxAI/Mini… MiniMax Sparse Attention: huggingface.co/papers/2606.13…

English

297

40.3K

EmbeddedLLM retweetledi

Simon Mo@simon_mo_·9 Haz

vime is a reference implementation for one reason only: make @vllm_project the best rollout engine for RL. This helps us better optimize vLLM for the whole ecosystem like @NovaSkyAI SkyRL, @PrimeIntellect Prime-RL, @nvidia NeMo-RL, @verl_project, and more! A wise man in leather jacket said: "We don't build PowerPoint slides and ship the chips. We build a whole data center. And until we get the whole data center built up, how do you know the software works? how do you know your fabric works?" - @NoPriorsPod

vLLM@vllm_project

Today we're excited to introduce vime — a simple, stable, and efficient RL framework for LLM post-training in the vLLM ecosystem. Built on slime's proven training design and powered by vLLM inference, vime brings another strong option to the growing vLLM post-training ecosystem. Our goal isn't a one-size-fits-all framework. We want users with different needs to find the right vLLM-ecosystem choice for their workflows—whether that's vime, NeMo RL, OpenRLHF, verl, or others. More choice. More interoperability. More innovation. Learn more: vllm.ai/blog/2026-06-0… #LLM #RLHF #PostTraining #vLLM

English

7.2K

Keşfet

@RedHat_AI @vllm_project @awscloud @AMD @AIatAMD @hmellor_ @huggingface @MosiAI_Official