Fuli Luo

19 posts

Fuli Luo

@_LuoFuli

Now building @XiaomiMiMo. Previously @deepseek_ai

Katılım Kasım 2023

157 Takip Edilen66.9K Takipçiler

Sabitlenmiş Tweet

Fuli Luo@_LuoFuli·23h

Inference Optimizations Behind the MiMo-V2.5 Series API Price Reductions Read the full technical blog: mimo.xiaomi.com/blog/mimo-v2-5… The V2.5 model family, including MiMo-V2.5 and MiMo-V2.5-Pro, is built on a Hybrid Sliding Window Attention (Hybrid SWA) architecture, which compresses KVCache storage to roughly 1/7 that of Full Attention. However, architectural advantages rarely translate directly into measurable gains in production serving. To realize these gains, we redesigned KVCache management, tiered caching, and the prefix-cache tree; addressed key challenges in SWA KVCache handling; and optimized scheduling as well as the Prefill/Decode pipeline. Validated on real production traffic, these optimizations have increased effective KVCache capacity by nearly 5x, with server-side cache hit rates averaging 93%–95% across mainstream harness frameworks. Together with MoE configuration tuning and multimodal inference optimizations, they enable more efficient long-context inference and form part of what makes the recent API price cuts possible.

English

878

112.3K

Fuli Luo@_LuoFuli·3d

@leonliuzx yes

8.2K

leonliuzx@leonliuzx·3d

@_LuoFuli Got it, super insightful! 🔥 Quick question — will the detailed blog post be released on mimo.xiaomi.com ?

English

Fuli Luo@_LuoFuli·3d

Behind the MiMo API Price Reduction: The deepest price cut, up to 99%, is for Input (Cache Hit). The core reason is our inference framework now supports hierarchical KV cache optimization for SWA. Production inference engine tests show this optimization increases cached token capacity by 5x, equivalent to an 80% reduction in caching costs. Combined with Cache Read Overlap among multiple Full Attention modules in the Hybrid model, actual costs are further reduced. Prices for Input (Cache Miss) and Output are also reduced by 60%-80%. This mainly benefits from the extreme 1:7 Full:SWA sparsity ratio brought by the model architecture (the prefill compute of the 70-layer MiMo-V2.5-Pro roughly equals a 10-layer GQA model). This kept our original inference costs well below the industry average, naturally leaving a 2x-3x profit margin in pricing. This price adjustment simply reflects our decision to pass these structural cost efficiencies directly to developers. Operating at these newly reduced API prices, our production inference engine is running at near full capacity, and we can still essentially break even. We previously advised LLM companies not to "blindly cut prices" precisely because very few model architectures and inference optimizations can keep API costs from running at a loss. If more architectures that save compute and KV cache emerge, along with better inference Infra to drive down API costs, this will form an excellent virtuous cycle in the industry. More crucially, affordable, high-performance model APIs will drive real, sustained, and at-scale inference demand. This upstream demand pulls forward the development of the entire AI infrastructure chain—including chips, servers, optical transceivers, PCBs, liquid cooling, power, energy storage, and data centers—serving as a strategic fulcrum for a systemic revaluation of AI hardware. In the long run, this injects more affordable and accessible compute into both training and inference pipelines, accelerating the parallel evolution of global AGI across multiple regions and technical routes. For more technical details, we will release a detailed Blog post later.

English

152

186

1.7K

182.7K

Fuli Luo retweetledi

Xiaomi MiMo@XiaomiMiMo·4d

🚀 Better inference efficiency, lower costs, broader access. MiMo-V2.5 Series API pricing is now permanently reduced — by up to 99% compared to previous pricing. ✨ Unified pricing across all context lengths. MiMo Token Plans have also been upgraded: • 5–8× more usable tokens at the same price • Simpler and more transparent billing rules 🎁 As a thank-you to current users, all current Token Plan credits will be fully reset. 🎧 MiMo-V2.5-TTS remains free for a limited time. ⏰ Effective May 26 at 6:00 PM PDT. These improvements are powered by continued inference optimization and serving efficiency upgrades across the MiMo stack. 🛠️ We’ll also publish a detailed technical blog on the inference optimizations later — stay tuned.

English

297

514

4.2K

Fuli Luo@_LuoFuli·27 Nis

Just dropped two open-source models: MiMo-V2.5-Pro (Code Agent, 1T total) and MiMo-V2.5 (Multimodal Agent, 310B total). Oh and one more thing — we're giving devs & creators 100T tokens on us. Go build something cool 🛠️ 🎁 100T Free Token Grant for Builders 100t.xiaomimimo.com

Xiaomi MiMo@XiaomiMiMo

Xiaomi MiMo-V2.5 is now officially open-sourced！ MIT License, supporting commercial deployment, continued training, and fine-tuning - no additional authorization required. Two models, both supporting a 1M-token context window : • MiMo-V2.5-Pro: built for complex agent and coding tasks, ranking No.1 among open-source models on GDPVal-AA and ClawEval • MiMo-V2.5: a native omni-modal model with strong agent capabilities A model's value isn't measured by rankings alone — it's measured by the problems it solves. Let's build with MiMo now! 🤗 Weights: huggingface.co/collections/Xi… 📄 Blog: #blog" target="_blank" rel="nofollow noopener">mimo.xiaomi.com/index#blog

English

212

306

3.1K

672.1K

Fuli Luo retweetledi

Lei Li@_TobiasLee·23 Nis

Big week for model releases, and Claw-Eval is updating too. MiMo V2.5 Pro now ranks 3rd, and MiMo V2.5 ranks 5th. Next up: DeepSeek V4? 👉🏻 claw-eval.github.io

English

22.1K

Fuli Luo retweetledi

Xiaomi MiMo@XiaomiMiMo·22 Nis

Xiaomi MiMo-V2.5 Series: Pushing Open-Source Agents Forward 🔸 MiMo-V2.5-Pro, our strongest model yet. A major leap from MiMo-V2-Pro in general agentic capabilities, complex software engineering, and long-horizon tasks, now matching frontier models like Claude Opus 4.6 and GPT-5.4 across most benchmarks (SWE-bench Pro 57.2, Claw-Eval 63.8, τ3-Bench 72.9). It can autonomously complete professional tasks involving 1,000+ tool calls, work that would take human experts days. Tech Blog: mimo.xiaomi.com/blog/mimo-v2.5… 🔸 MiMo-V2.5, native omnimodal with strong agentic capabilities. Pro-level agent performance at roughly half the cost. Improved multimodal perception across image and video understanding, native 1M-token context window, and significantly more efficient inference. Tech Blog: mimo.xiaomi.com/blog/mimo-v2.5 🔗 API & Token Plan: platform.xiaomimimo.com/token-plan

English

136

274

2.5K

372.9K

Fuli Luo@_LuoFuli·6 Nis

A bigger problem: many third-party harnesses compress tool responses every 3 steps when approaching the context limit, leading to very low cache hit rates.

English

108

31.1K

Fuli Luo@_LuoFuli·5 Nis

Two days ago, Anthropic cut off third-party harnesses from using Claude subscriptions — not surprising. Three days ago, MiMo launched its Token Plan — a design I spent real time on, and what I believe is a serious attempt at getting compute allocation and agent harness development right. Putting these two things together, some thoughts: 1. Claude Code's subscription is a beautifully designed system for balanced compute allocation. My guess — it doesn't make money, possibly bleeds it, unless their API margins are 10-20x, which I doubt. I can't rigorously calculate the losses from third-party harnesses plugging in, but I've looked at OpenClaw's context management up close — it's bad. Within a single user query, it fires off rounds of low-value tool calls as separate API requests, each carrying a long context window (often >100K tokens) — wasteful even with cache hits, and in extreme cases driving up cache miss rates for other queries. The actual request count per query ends up several times higher than Claude Code's own framework. Translated to API pricing, the real cost is probably tens of times the subscription price. That's not a gap — that's a crater. 2. Third-party harnesses like OpenClaw/OpenCode can still call Claude via API — they just can't ride on subscriptions anymore. Short term, these agent users will feel the pain, costs jumping easily tens of times. But that pressure is exactly what pushes these harnesses to improve context management, maximize prompt cache hit rates to reuse processed context, cut wasteful token burn. Pain eventually converts to engineering discipline. 3. I'd urge LLM companies not to blindly race to the bottom on pricing before figuring out how to price a coding plan without hemorrhaging money. Selling tokens dirt cheap while leaving the door wide open to third-party harnesses looks nice to users, but it's a trap — the same trap Anthropic just walked out of. The deeper problem: if users burn their attention on low-quality agent harnesses, highly unstable and slow inference services, and models downgraded to cut costs, only to find they still can't get anything done — that's not a healthy cycle for user experience or retention. 4. On MiMo Token Plan — it supports third-party harnesses, billed by token quota, same logic as Claude's newly launched extra usage packages. Because what we're going for is long-term stable delivery of high-quality models and services — not getting you to impulse-pay and then abandon ship. The bigger picture: global compute capacity can't keep up with the token demand agents are creating. The real way forward isn't cheaper tokens — it's co-evolution. "More token-efficient agent harnesses" × "more powerful and efficient models." Anthropic's move, whether they intended it or not, is pushing the entire ecosystem — open source and closed source alike — in that direction. That's probably a good thing. The Agent era doesn't belong to whoever burns the most compute. It belongs to whoever uses it wisely.

English

174

231

1.8K

766K

Fuli Luo retweetledi

Xiaomi MiMo@XiaomiMiMo·18 Mar

Introducing MiMo-V2-Pro & Omni & TTS mimo.xiaomi.com

Indonesia

163

1.3K

302.7K

Fuli Luo@_LuoFuli·19 Mar

MiMo-V2-Pro & Omni & TTS is out. Our first full-stack model family built truly for the Agent era. I call this a quiet ambush — not because we planned it, but because the shift from Chat to Agent paradigm happened so fast, even we barely believed it. Somewhere in between was a process that was thrilling, painful, and fascinating all at once. The 1T base model started training months ago. The original goal was long-context reasoning efficiency. Hybrid Attention carries real innovation, without overreaching — and it turns out to be exactly the right foundation for the Agent era. 1M context window. MTP inference for ultra-low latency and cost. These architectural decisions weren't trendy. They were a structural advantage we built before we needed it. What changed everything was experiencing a complex agentic scaffold — what I'd call orchestrated Context — for the first time. I was shocked on day one. I tried to convince the team to use it. That didn't work. So I gave a hard mandate: anyone on MiMo Team with fewer than 100 conversations tomorrow can quit. It worked. Once the team's imagination was ignited by what agentic systems could do, that imagination converted directly into research velocity. People ask why we move so fast. I saw it firsthand building DeepSeek R1. My honest summary: — Backbone and Infra research has long cycles. You need strategic conviction a year before it pays off. — Posttrain agility is a different muscle: product intuition driving evaluation, iteration cycles compressed, paradigm shifts caught early. — And the constant: curiosity, sharp technical instinct, decisive execution, full commitment — and something that's easy to underestimate: a genuine love for the world you're building for. We will open-source — when the models are stable enough to deserve it. From Beijing, very late, not quite awake.

English

343

628

2.4M

Fuli Luo@_LuoFuli·1 Mar

Imagination is the ceiling of productivity in the new era. Inspiring imagination is the core of management in the age of Claw.

English

428

71K

Fuli Luo retweetledi

LMSYS Org@lmsysorg·23 Ara

SGLang + Miles: Rollout Routing Replay (R3) is Now Live! 🎉 We're excited to announce that SGLang and Miles now support Rollout Routing Replay (R3) for stable reinforcement learning training on MoE models! Training MoE models with RL has been notoriously unstable, often leading to catastrophic collapse. The problem? Routing inconsistency between inference and training engines. R3 fixes this by recording expert routing decisions during inference and replaying them during training. The impact is significant: dramatically reduced training-inference discrepancy by reusing inference routing decisions, preventing training collapse. R3 has full distributed training support with DataParallel Attention and all parallelism strategies, supported models include Qwen3-30B-A3B, deepseek_v2, etc. Try it out and let us know your results! 🚀

English

231

123.1K

Fuli Luo retweetledi

Artificial Analysis@ArtificialAnlys·20 Ara

Xiaomi has just launched MiMo-V2-Flash, a 309B open weights reasoning model that scores 66 on the Artificial Analysis Intelligence Index. This release elevates Xiaomi to alongside other leading AI model labs. Key benchmarking takeaways: ➤ Strengths in Agentic Tool Use and Competition Math: MiMo-V2-Flash scores 95% on τ²-Bench Telecom and 96% on AIME 2025, demonstrating strong performance on agentic tool-use workflows and competition-style mathematical reasoning. MiMo-V2-Flash currently leads the τ²-Bench Telecom category among evaluated models ➤ Cost competitive: The full Artificial Analysis evaluation suite cost just $53 to run. This is supported by MiMo-V2-Flash’s highly competitive pricing of $0.10 per million input and $0.30 per million output, making it particularly attractive for cost-sensitive deployments and large-scale production workloads. This is similar to DeepSeek V3.2 ($54 total cost to run), and well below GPT-5.2 ($1,294 total cost to run) ➤ High token usage: MiMo-V2-Flash is demonstrates high verbosity and token usage relative to other models in the same intelligence tier, using ~150M reasoning tokens across the Artificial Analysis Intelligence suite ➤ Open weights: MiMo-V2-Flash is open weights and is 309B parameters with 15B active at inference time. Weights are released under a MIT license, continuing the trend of Chinese AI model labs open sourcing their frontier models See below for further analysis:

English

586

228K

Fuli Luo@_LuoFuli·16 Ara

MiMo-V2-Flash is live. It’s just step 2 on our AGI roadmap, but I wanted to dump some notes on the engineering choices that actually moved the needle. Architecture: We settled on a Hybrid SWA. It’s simple, elegant, and in our internal benchmarks, it outperformed other Linear Attention variants on long context reasoning. Plus, a fixed KV cache just plays way nicer with current infra. Note: Window size 128 turned out to be the magic number (512 actually degraded performance). Also, sink values are non-negotiable—don't skip them. MTP (Multi-Token Prediction): This is underrated for efficient RL. Aside from the first layer, it needs surprisingly little fine-tuning to hit high accept length. With a 3-layer MTP, we're seeing >3 accept length and ~2.5x speedup in coding tasks. It effectively solves the GPU idle time from long-tail samples in small-batch On-Policy RL. We didn't get to squeeze it into the RL loop this time due to deadlines, but it’s a perfect fit. We open-sourced the 3-layer MTPs so you can develop with it. Posttrain with MOPD: We adopted On-Policy-Distillation from Thinking Machine to merge multiple RL models, and the efficiency gains were wild. We matched the teacher model's performance using less than 1/50th the compute of a standard SFT+RL pipeline. There’s a clear path here for a self-reinforcing loop where the student evolves into a stronger teacher. Huge props to my team. They sculpted these ideas from scratch into production in just a few months. Full breakdown is in the tech report. If this kind of pragmatic engineering resonates with you, we should talk.

English

113

1.2K

405.6K

Fuli Luo retweetledi

Xiaomi MiMo@XiaomiMiMo·16 Ara

⚡ Faster than Fast. Designed for Agentic AI. Introducing Xiaomi MiMo-V2-Flash — our new open-source MoE model: 309B total params, 15B active. Blazing speed meets frontier performance. 🔥 Highlights: 🏗️ Hybrid Attention: 5:1 interleaved 128-window SWA + Global | 256K context 📈 Performance: ⚔️ Matches DeepSeek-V3.2 on general benchmarks — at a fraction of the latency 🏆 SWE-Bench Verified: 73.4% | SWE-Bench Multilingual: 71.7% — new SOTA for open-source models 🚀 Speed: 150 output tokens/s with Day-0 support from @lmsysorg🤝 🤗 Model: hf.co/XiaomiMiMo/MiM… 📝 Blog Post: mimo.xiaomi.com/blog/mimo-v2-f… 📄 Technical Report: github.com/XiaomiMiMo/MiM… 🎨 AI Studio: aistudio.xiaomimimo.com

English

298

1.9K

562.8K

Fuli Luo@_LuoFuli·11 Kas

Intelligence will inevitably evolve from language to the physical world, unlocking spatial intelligence for multi-modal perception, reasoning, generation, and action—essential for true AGI. I'm working on building this at @XiaomiMiMo, spearheading a creative and talented team!

English

387

146.1K

Fuli Luo@_LuoFuli·2 Ağu

@deepseek_ai

QAM

2.5K

Keşfet

@leonliuzx @lmsysorg @XiaomiMiMo @deepseek_ai @elonmusk @BarackObama @taylorswift13 @cristiano