Fuli Luo

19 posts

Fuli Luo

Fuli Luo

@_LuoFuli

Now building @XiaomiMiMo. Previously @deepseek_ai

Katılım Kasım 2023
157 Takip Edilen66.9K Takipçiler
Sabitlenmiş Tweet
Fuli Luo
Fuli Luo@_LuoFuli·
Inference Optimizations Behind the MiMo-V2.5 Series API Price Reductions Read the full technical blog: mimo.xiaomi.com/blog/mimo-v2-5… The V2.5 model family, including MiMo-V2.5 and MiMo-V2.5-Pro, is built on a Hybrid Sliding Window Attention (Hybrid SWA) architecture, which compresses KVCache storage to roughly 1/7 that of Full Attention. However, architectural advantages rarely translate directly into measurable gains in production serving. To realize these gains, we redesigned KVCache management, tiered caching, and the prefix-cache tree; addressed key challenges in SWA KVCache handling; and optimized scheduling as well as the Prefill/Decode pipeline. Validated on real production traffic, these optimizations have increased effective KVCache capacity by nearly 5x, with server-side cache hit rates averaging 93%–95% across mainstream harness frameworks. Together with MoE configuration tuning and multimodal inference optimizations, they enable more efficient long-context inference and form part of what makes the recent API price cuts possible.
English
48
94
878
112.3K
Fuli Luo
Fuli Luo@_LuoFuli·
Behind the MiMo API Price Reduction: The deepest price cut, up to 99%, is for Input (Cache Hit). The core reason is our inference framework now supports hierarchical KV cache optimization for SWA. Production inference engine tests show this optimization increases cached token capacity by 5x, equivalent to an 80% reduction in caching costs. Combined with Cache Read Overlap among multiple Full Attention modules in the Hybrid model, actual costs are further reduced. Prices for Input (Cache Miss) and Output are also reduced by 60%-80%. This mainly benefits from the extreme 1:7 Full:SWA sparsity ratio brought by the model architecture (the prefill compute of the 70-layer MiMo-V2.5-Pro roughly equals a 10-layer GQA model). This kept our original inference costs well below the industry average, naturally leaving a 2x-3x profit margin in pricing. This price adjustment simply reflects our decision to pass these structural cost efficiencies directly to developers. Operating at these newly reduced API prices, our production inference engine is running at near full capacity, and we can still essentially break even. We previously advised LLM companies not to "blindly cut prices" precisely because very few model architectures and inference optimizations can keep API costs from running at a loss. If more architectures that save compute and KV cache emerge, along with better inference Infra to drive down API costs, this will form an excellent virtuous cycle in the industry. More crucially, affordable, high-performance model APIs will drive real, sustained, and at-scale inference demand. This upstream demand pulls forward the development of the entire AI infrastructure chain—including chips, servers, optical transceivers, PCBs, liquid cooling, power, energy storage, and data centers—serving as a strategic fulcrum for a systemic revaluation of AI hardware. In the long run, this injects more affordable and accessible compute into both training and inference pipelines, accelerating the parallel evolution of global AGI across multiple regions and technical routes. For more technical details, we will release a detailed Blog post later.
English
152
186
1.7K
182.7K
Fuli Luo retweetledi
Xiaomi MiMo
Xiaomi MiMo@XiaomiMiMo·
🚀 Better inference efficiency, lower costs, broader access. MiMo-V2.5 Series API pricing is now permanently reduced — by up to 99% compared to previous pricing. ✨ Unified pricing across all context lengths. MiMo Token Plans have also been upgraded: • 5–8× more usable tokens at the same price • Simpler and more transparent billing rules 🎁 As a thank-you to current users, all current Token Plan credits will be fully reset. 🎧 MiMo-V2.5-TTS remains free for a limited time. ⏰ Effective May 26 at 6:00 PM PDT. These improvements are powered by continued inference optimization and serving efficiency upgrades across the MiMo stack. 🛠️ We’ll also publish a detailed technical blog on the inference optimizations later — stay tuned.
Xiaomi MiMo tweet mediaXiaomi MiMo tweet media
English
297
514
4.2K
1M
Fuli Luo retweetledi
Lei Li
Lei Li@_TobiasLee·
Big week for model releases, and Claw-Eval is updating too. MiMo V2.5 Pro now ranks 3rd, and MiMo V2.5 ranks 5th. Next up: DeepSeek V4? 👉🏻 claw-eval.github.io
Lei Li tweet media
English
2
6
77
22.1K
Fuli Luo retweetledi
Xiaomi MiMo
Xiaomi MiMo@XiaomiMiMo·
Xiaomi MiMo-V2.5 Series: Pushing Open-Source Agents Forward 🔸 MiMo-V2.5-Pro, our strongest model yet. A major leap from MiMo-V2-Pro in general agentic capabilities, complex software engineering, and long-horizon tasks, now matching frontier models like Claude Opus 4.6 and GPT-5.4 across most benchmarks (SWE-bench Pro 57.2, Claw-Eval 63.8, τ3-Bench 72.9). It can autonomously complete professional tasks involving 1,000+ tool calls, work that would take human experts days. Tech Blog: mimo.xiaomi.com/blog/mimo-v2.5… 🔸 MiMo-V2.5, native omnimodal with strong agentic capabilities. Pro-level agent performance at roughly half the cost. Improved multimodal perception across image and video understanding, native 1M-token context window, and significantly more efficient inference. Tech Blog: mimo.xiaomi.com/blog/mimo-v2.5 🔗 API & Token Plan: platform.xiaomimimo.com/token-plan
Xiaomi MiMo tweet media
English
136
274
2.5K
372.9K
Fuli Luo
Fuli Luo@_LuoFuli·
A bigger problem: many third-party harnesses compress tool responses every 3 steps when approaching the context limit, leading to very low cache hit rates.
English
6
0
108
31.1K
Fuli Luo
Fuli Luo@_LuoFuli·
Two days ago, Anthropic cut off third-party harnesses from using Claude subscriptions — not surprising. Three days ago, MiMo launched its Token Plan — a design I spent real time on, and what I believe is a serious attempt at getting compute allocation and agent harness development right. Putting these two things together, some thoughts: 1. Claude Code's subscription is a beautifully designed system for balanced compute allocation. My guess — it doesn't make money, possibly bleeds it, unless their API margins are 10-20x, which I doubt. I can't rigorously calculate the losses from third-party harnesses plugging in, but I've looked at OpenClaw's context management up close — it's bad. Within a single user query, it fires off rounds of low-value tool calls as separate API requests, each carrying a long context window (often >100K tokens) — wasteful even with cache hits, and in extreme cases driving up cache miss rates for other queries. The actual request count per query ends up several times higher than Claude Code's own framework. Translated to API pricing, the real cost is probably tens of times the subscription price. That's not a gap — that's a crater. 2. Third-party harnesses like OpenClaw/OpenCode can still call Claude via API — they just can't ride on subscriptions anymore. Short term, these agent users will feel the pain, costs jumping easily tens of times. But that pressure is exactly what pushes these harnesses to improve context management, maximize prompt cache hit rates to reuse processed context, cut wasteful token burn. Pain eventually converts to engineering discipline. 3. I'd urge LLM companies not to blindly race to the bottom on pricing before figuring out how to price a coding plan without hemorrhaging money. Selling tokens dirt cheap while leaving the door wide open to third-party harnesses looks nice to users, but it's a trap — the same trap Anthropic just walked out of. The deeper problem: if users burn their attention on low-quality agent harnesses, highly unstable and slow inference services, and models downgraded to cut costs, only to find they still can't get anything done — that's not a healthy cycle for user experience or retention. 4. On MiMo Token Plan — it supports third-party harnesses, billed by token quota, same logic as Claude's newly launched extra usage packages. Because what we're going for is long-term stable delivery of high-quality models and services — not getting you to impulse-pay and then abandon ship. The bigger picture: global compute capacity can't keep up with the token demand agents are creating. The real way forward isn't cheaper tokens — it's co-evolution. "More token-efficient agent harnesses" × "more powerful and efficient models." Anthropic's move, whether they intended it or not, is pushing the entire ecosystem — open source and closed source alike — in that direction. That's probably a good thing. The Agent era doesn't belong to whoever burns the most compute. It belongs to whoever uses it wisely.
English
174
231
1.8K
766K
Fuli Luo
Fuli Luo@_LuoFuli·
MiMo-V2-Pro & Omni & TTS is out. Our first full-stack model family built truly for the Agent era. I call this a quiet ambush — not because we planned it, but because the shift from Chat to Agent paradigm happened so fast, even we barely believed it. Somewhere in between was a process that was thrilling, painful, and fascinating all at once. The 1T base model started training months ago. The original goal was long-context reasoning efficiency. Hybrid Attention carries real innovation, without overreaching — and it turns out to be exactly the right foundation for the Agent era. 1M context window. MTP inference for ultra-low latency and cost. These architectural decisions weren't trendy. They were a structural advantage we built before we needed it. What changed everything was experiencing a complex agentic scaffold — what I'd call orchestrated Context — for the first time. I was shocked on day one. I tried to convince the team to use it. That didn't work. So I gave a hard mandate: anyone on MiMo Team with fewer than 100 conversations tomorrow can quit. It worked. Once the team's imagination was ignited by what agentic systems could do, that imagination converted directly into research velocity. People ask why we move so fast. I saw it firsthand building DeepSeek R1. My honest summary: — Backbone and Infra research has long cycles. You need strategic conviction a year before it pays off. — Posttrain agility is a different muscle: product intuition driving evaluation, iteration cycles compressed, paradigm shifts caught early. — And the constant: curiosity, sharp technical instinct, decisive execution, full commitment — and something that's easy to underestimate: a genuine love for the world you're building for. We will open-source — when the models are stable enough to deserve it. From Beijing, very late, not quite awake.
English
343
628
7K
2.4M
Fuli Luo
Fuli Luo@_LuoFuli·
Imagination is the ceiling of productivity in the new era. Inspiring imagination is the core of management in the age of Claw.
English
24
42
428
71K
Fuli Luo retweetledi
LMSYS Org
LMSYS Org@lmsysorg·
SGLang + Miles: Rollout Routing Replay (R3) is Now Live! 🎉 We're excited to announce that SGLang and Miles now support Rollout Routing Replay (R3) for stable reinforcement learning training on MoE models! Training MoE models with RL has been notoriously unstable, often leading to catastrophic collapse. The problem? Routing inconsistency between inference and training engines. R3 fixes this by recording expert routing decisions during inference and replaying them during training. The impact is significant: dramatically reduced training-inference discrepancy by reusing inference routing decisions, preventing training collapse. R3 has full distributed training support with DataParallel Attention and all parallelism strategies, supported models include Qwen3-30B-A3B, deepseek_v2, etc. Try it out and let us know your results! 🚀
LMSYS Org tweet media
English
7
24
231
123.1K
Fuli Luo retweetledi
Artificial Analysis
Artificial Analysis@ArtificialAnlys·
Xiaomi has just launched MiMo-V2-Flash, a 309B open weights reasoning model that scores 66 on the Artificial Analysis Intelligence Index. This release elevates Xiaomi to alongside other leading AI model labs. Key benchmarking takeaways: ➤ Strengths in Agentic Tool Use and Competition Math: MiMo-V2-Flash scores 95% on τ²-Bench Telecom and 96% on AIME 2025, demonstrating strong performance on agentic tool-use workflows and competition-style mathematical reasoning. MiMo-V2-Flash currently leads the τ²-Bench Telecom category among evaluated models ➤ Cost competitive: The full Artificial Analysis evaluation suite cost just $53 to run. This is supported by MiMo-V2-Flash’s highly competitive pricing of $0.10 per million input and $0.30 per million output, making it particularly attractive for cost-sensitive deployments and large-scale production workloads. This is similar to DeepSeek V3.2 ($54 total cost to run), and well below GPT-5.2 ($1,294 total cost to run) ➤ High token usage: MiMo-V2-Flash is demonstrates high verbosity and token usage relative to other models in the same intelligence tier, using ~150M reasoning tokens across the Artificial Analysis Intelligence suite ➤ Open weights: MiMo-V2-Flash is open weights and is 309B parameters with 15B active at inference time. Weights are released under a MIT license, continuing the trend of Chinese AI model labs open sourcing their frontier models See below for further analysis:
Artificial Analysis tweet media
English
21
68
586
228K
Fuli Luo
Fuli Luo@_LuoFuli·
MiMo-V2-Flash is live. It’s just step 2 on our AGI roadmap, but I wanted to dump some notes on the engineering choices that actually moved the needle. Architecture: We settled on a Hybrid SWA. It’s simple, elegant, and in our internal benchmarks, it outperformed other Linear Attention variants on long context reasoning. Plus, a fixed KV cache just plays way nicer with current infra. Note: Window size 128 turned out to be the magic number (512 actually degraded performance). Also, sink values are non-negotiable—don't skip them. MTP (Multi-Token Prediction): This is underrated for efficient RL. Aside from the first layer, it needs surprisingly little fine-tuning to hit high accept length. With a 3-layer MTP, we're seeing >3 accept length and ~2.5x speedup in coding tasks. It effectively solves the GPU idle time from long-tail samples in small-batch On-Policy RL. We didn't get to squeeze it into the RL loop this time due to deadlines, but it’s a perfect fit. We open-sourced the 3-layer MTPs so you can develop with it. Posttrain with MOPD: We adopted On-Policy-Distillation from Thinking Machine to merge multiple RL models, and the efficiency gains were wild. We matched the teacher model's performance using less than 1/50th the compute of a standard SFT+RL pipeline. There’s a clear path here for a self-reinforcing loop where the student evolves into a stronger teacher. Huge props to my team. They sculpted these ideas from scratch into production in just a few months. Full breakdown is in the tech report. If this kind of pragmatic engineering resonates with you, we should talk.
English
79
113
1.2K
405.6K
Fuli Luo retweetledi
Xiaomi MiMo
Xiaomi MiMo@XiaomiMiMo·
⚡ Faster than Fast. Designed for Agentic AI. Introducing Xiaomi MiMo-V2-Flash — our new open-source MoE model: 309B total params, 15B active. Blazing speed meets frontier performance. 🔥 Highlights: 🏗️ Hybrid Attention: 5:1 interleaved 128-window SWA + Global | 256K context 📈 Performance: ⚔️ Matches DeepSeek-V3.2 on general benchmarks — at a fraction of the latency 🏆 SWE-Bench Verified: 73.4% | SWE-Bench Multilingual: 71.7% — new SOTA for open-source models 🚀 Speed: 150 output tokens/s with Day-0 support from @lmsysorg🤝 🤗 Model: hf.co/XiaomiMiMo/MiM… 📝 Blog Post: mimo.xiaomi.com/blog/mimo-v2-f… 📄 Technical Report: github.com/XiaomiMiMo/MiM… 🎨 AI Studio: aistudio.xiaomimimo.com
Xiaomi MiMo tweet media
English
88
298
1.9K
562.8K
Fuli Luo
Fuli Luo@_LuoFuli·
Intelligence will inevitably evolve from language to the physical world, unlocking spatial intelligence for multi-modal perception, reasoning, generation, and action—essential for true AGI. I'm working on building this at @XiaomiMiMo, spearheading a creative and talented team!
Fuli Luo tweet media
English
30
26
387
146.1K