

SGLang
26 posts

@sgl_project
SGLang project https://t.co/2wrCfYIlBz. This is an alias account for SGLang, please follow @lmsysorg




We’re open-sourcing MiniMax M2 — Agent & Code Native, at 8% Claude Sonnet price, ~2x faster ⚡ Global FREE for a limited time via MiniMax Agent & API - Advanced Coding Capability: Engineered for end-to-end developer workflows. Strong capability on a wide-range of applications (Claude Code, Cursor, Cline, Kilo Code, Droid, etc) - High Agentic Performance: Robust handling of long-horizon toolchains (mcp, shell, browser, retrieval, code). - Smarter, Faster, Cheaper with efficient parameter activation



⚡ Zero-overhead scheduler for speculative decoding ⚡ When your GPUs are running LLM inference, unoptimized software will waste a huge amount of time on CPU overhead - such as kernel launch and metadata bookkeeping. SGLang has been pioneering the zero-overhead CPU runtime for LLM runtime since last year. Now, we also carefully tune the scheduler for speculative decoding and seeing 10% - 20% speedup across the board. This improvement has been tested by the @googlecloud vertex AI team and we welcome more people to join our development. See the roadmap below ⬇️


🚨 DeepSeek just did something wild. They built an OCR system that compresses long text into vision tokens literally turning paragraphs into pixels. Their model, DeepSeek-OCR, achieves 97% decoding precision at 10× compression and still manages 60% accuracy even at 20×. That means one image can represent entire documents using a fraction of the tokens an LLM would need. Even crazier? It beats GOT-OCR2.0 and MinerU2.0 while using up to 60× fewer tokens and can process 200K+ pages/day on a single A100. This could solve one of AI’s biggest problems: long-context inefficiency. Instead of paying more for longer sequences, models might soon see text instead of reading it. The future of context compression might not be textual at all. It might be optical 👁️ github. com/deepseek-ai/DeepSeek-OCR




The Reinforcement Learning track at #PyTorchCon highlights new directions for RL with #PyTorch. Hear Chenyang Zhao (UCLA) on optimizing long-tail and MoE challenges in RL with SGLang, and Daniel Han (Unsloth) on maximizing luck in reinforcement learning. 🔗 Explore sessions: hubs.la/Q03NCZZS0







Exciting updates on DGX Spark: Now you can run gpt-oss-20b at 70 tokens/s with SGLang! This is 1.4x faster than what we got in our blog last week. We worked with the @NVIDIAAIDev team to fix a bunch of Triton and quantization issues. Cannot wait to see how much performance we can get from this tiny computer. Usage: download the lmsysorg/sglang:spark docker image and launch with python3 -m sglang.launch_server --model openai/gpt-oss-20b

Exciting updates on DGX Spark: Now you can run gpt-oss-20b at 70 tokens/s with SGLang! This is 1.4x faster than what we got in our blog last week. We worked with the @NVIDIAAIDev team to fix a bunch of Triton and quantization issues. Cannot wait to see how much performance we can get from this tiny computer. Usage: download the lmsysorg/sglang:spark docker image and launch with python3 -m sglang.launch_server --model openai/gpt-oss-20b



We're excited to announce the collaboration between KTransformers and SGLang! KTransformers has been a killer for local AI inference with its system-algorithm co-design, often showing 5x - 10x speedup. This integration equips SGLang with KTransformers’ inference strategy and optimized kernels, specifically optimized for MoE models. Combined with SGLang’s native multi-GPU scaling, the solution can be seamlessly extended to serve much larger workloads. ⬇️ Learn more in our tech blog below

SGLang just crossed 10k PRs! From a few hands to a whole community — onwards and upwards! 🌟👇

Kimi K2-0905 update 🚀 - Enhanced coding capabilities, esp. front-end & tool-calling - Context length extended to 256k tokens - Improved integration with various agent scaffolds (e.g., Claude Code, Roo Code, etc) 🔗 Weights & code: huggingface.co/moonshotai/Kim… 💬 Chat with new Kimi K2 on: kimi.com ⚡️ For 60–100 TPS + guaranteed 100% tool-call accuracy, try our turbo API: platform.moonshot.ai





🚨SGLang Summer Fest Bonus Drop🚨 Proud to share a joint effort from Mooncake by @Kimi_Moonshot, @Oracle , and SGLang: Kimi K2 trillion-scale deployment—running on 128 H200 GPUs sponsored by @NVIDIAAIDev DGX Cloud. OME + SGLang = MoE inference at production scale.👇

