Zhihui Xie

218 posts

Zhihui Xie banner
Zhihui Xie

Zhihui Xie

@_zhihuixie

PhD student @hkunlp2020 | prev. intern @AIatMeta @sjtu1896

Katılım Temmuz 2019
638 Takip Edilen426 Takipçiler
Sabitlenmiş Tweet
Zhihui Xie
Zhihui Xie@_zhihuixie·
🚀 Thrilled to announce Dream-Coder 7B — the most powerful open diffusion code  LLM to date.
Zhihui Xie tweet media
English
3
36
128
16.3K
Zhihui Xie retweetledi
Jack Jingyu Zhang
Jack Jingyu Zhang@jackjingyuzhang·
Real-world agents juggle instructions from skill files, tools, other agents, ... each with different trust levels. When these conflict, can models reliably prioritize the most trusted one? Our ManyIH-Bench🪜 finds that even frontier models like GPT-5.4 only get ~40% accuracy! 👇
Jack Jingyu Zhang tweet media
English
1
31
116
9K
Zhihui Xie retweetledi
Lei Li
Lei Li@_TobiasLee·
Claw-Eval v1.1 is out, with multimodal tasks and multi-turn dialogue. Now we have: 300 human-verified tasks | 2,159 rubrics | 9 categories | 14 models from 7 families tested. Agents are graded on Completion, Safety, and Robustness through full-trajectory auditing. Shoutout to Qwen @Alibaba_Qwen , GLM @Zai_org , and MiniMax @MiniMax_AI for integrating Claw-Eval into their model evaluations! Paper: arxiv.org/abs/2604.06132… Leaderboard: claw-eval.github.io Code: github.com/claw-eval/claw… 🤗Data: hf.co/datasets/claw-… 🧵 Here are our findings:
Lei Li tweet media
English
1
8
35
3.5K
Zhihui Xie retweetledi
Fuli Luo
Fuli Luo@_LuoFuli·
Two days ago, Anthropic cut off third-party harnesses from using Claude subscriptions — not surprising. Three days ago, MiMo launched its Token Plan — a design I spent real time on, and what I believe is a serious attempt at getting compute allocation and agent harness development right. Putting these two things together, some thoughts: 1. Claude Code's subscription is a beautifully designed system for balanced compute allocation. My guess — it doesn't make money, possibly bleeds it, unless their API margins are 10-20x, which I doubt. I can't rigorously calculate the losses from third-party harnesses plugging in, but I've looked at OpenClaw's context management up close — it's bad. Within a single user query, it fires off rounds of low-value tool calls as separate API requests, each carrying a long context window (often >100K tokens) — wasteful even with cache hits, and in extreme cases driving up cache miss rates for other queries. The actual request count per query ends up several times higher than Claude Code's own framework. Translated to API pricing, the real cost is probably tens of times the subscription price. That's not a gap — that's a crater. 2. Third-party harnesses like OpenClaw/OpenCode can still call Claude via API — they just can't ride on subscriptions anymore. Short term, these agent users will feel the pain, costs jumping easily tens of times. But that pressure is exactly what pushes these harnesses to improve context management, maximize prompt cache hit rates to reuse processed context, cut wasteful token burn. Pain eventually converts to engineering discipline. 3. I'd urge LLM companies not to blindly race to the bottom on pricing before figuring out how to price a coding plan without hemorrhaging money. Selling tokens dirt cheap while leaving the door wide open to third-party harnesses looks nice to users, but it's a trap — the same trap Anthropic just walked out of. The deeper problem: if users burn their attention on low-quality agent harnesses, highly unstable and slow inference services, and models downgraded to cut costs, only to find they still can't get anything done — that's not a healthy cycle for user experience or retention. 4. On MiMo Token Plan — it supports third-party harnesses, billed by token quota, same logic as Claude's newly launched extra usage packages. Because what we're going for is long-term stable delivery of high-quality models and services — not getting you to impulse-pay and then abandon ship. The bigger picture: global compute capacity can't keep up with the token demand agents are creating. The real way forward isn't cheaper tokens — it's co-evolution. "More token-efficient agent harnesses" × "more powerful and efficient models." Anthropic's move, whether they intended it or not, is pushing the entire ecosystem — open source and closed source alike — in that direction. That's probably a good thing. The Agent era doesn't belong to whoever burns the most compute. It belongs to whoever uses it wisely.
English
171
224
1.8K
739.3K
Zhihui Xie retweetledi
Shenzhi Wang🌟
Shenzhi Wang🌟@ShenzhiWang_THU·
When training Qwen3.5, we kept asking ourselves: 🧐What kind of multimodal RLVR data actually leads to generalizable gains? 💡We believe the answer may not lie only in data tightly tailored to specific benchmarks, but also in OOD proxy tasks that train the foundational abilities behind long-chain visual reasoning. The motivation is simple: VLMs are still unreliable in long-CoT settings. Small mistakes in perception, reasoning, knowledge use, or grounding can compound across intermediate steps and eventually lead to much larger final errors. However, much of today’s RLVR data still does not require complex reasoning chains grounded in visual evidence throughout, meaning these failure modes are often not sufficiently stressed during training. 🚀Excited to share our new work from Qwen and Tsinghua LeapLab: HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning This is also one of the training task sources used in Qwen3.5 VL RLVR. To study this question, we propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data for RLVR training. The key idea is to build each query as a chain of logically dependent hops: earlier hops establish the instances, sets, or conditions needed for later hops, while the model must repeatedly return to the image for fresh visual grounding along the way. At the same time, each query ends with a specific, unambiguous numerical answer, making it naturally suitable for verifiable rewards. Concretely, HopChain combines two complementary structures: perception-level hops and instance-chain hops. We require each synthesized example to involve both, so the model cannot simply continue reasoning from language inertia. Instead, it is forced to keep grounding intermediate steps in the image, maintain cross-step dependencies, and control error accumulation across long reasoning trajectories. Our goal is not to mimic any specific downstream benchmark, but to strengthen the more fundamental abilities that long-CoT vision-language reasoning depends on. We add HopChain-synthesized data into RLVR training for Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and evaluate on 24 benchmarks spanning diverse domains. Despite not being designed for any particular benchmark, HopChain improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. We also find that full chained multi-hop queries are crucial: replacing them with half-multi-hop or single-hop variants reduces performance substantially. Most notably, the gains are especially strong on long-CoT and ultra-long-CoT vision-language reasoning, peaking at more than 50 accuracy points in the ultra-long-CoT regime. Our main takeaway is simple: beyond benchmark-aligned data, OOD proxy tasks that systematically train the core mechanics of long-chain visual reasoning can be a powerful and scalable source of RLVR supervision for VLMs — and can lead to more generalizableimprovements. 🔗 huggingface.co/papers/2603.17…
Shenzhi Wang🌟 tweet mediaShenzhi Wang🌟 tweet mediaShenzhi Wang🌟 tweet media
English
2
55
432
58.2K
Zhihui Xie retweetledi
Jason Weston
Jason Weston@jaseweston·
🧮 Principia: Training LLMs to Reason over Mathematical Objects 📐 We release: - PrincipiaBench, a new eval for *mathematical objects* (not just numerical values or MCQ) - Principia Collection: training data that improves reasoning across the board. For models to help with scientific and mathematical work, you need to train on such data & test whether they can derive things like equations, sets, matrices, intervals, and piecewise functions. We show that this ends up improving the overall reasoning ability of your model for all tasks. Read more in the blog post: facebookresearch.github.io/RAM/blogs/prin…
Jason Weston tweet media
English
0
34
127
12.3K
Zhihui Xie retweetledi
Teng Xiao
Teng Xiao@TengX6·
🚀 New work: Meta-Reinforcement Learning with Self-Reflection LLM agents shouldn't just solve problems. They should learn from their own attempts. Most current RL methods optimize single independent trajectories. Each attempt starts from scratch, with no mechanism to improve across attempts. But intelligent systems should get better after trying once. This raises a fundamental question: How do we train models to learn from their own attempts? We believe Meta-Reinforcement Learning may be a key paradigm for training future LLM agents, enabling models to adapt and improve across attempts and environments. In this work we introduce MR-Search, a training paradigm built around: 🧠 In-Context Meta-Reinforcement Learning 🪞 Self-Reflection 🔁 Learning to learn at test time 📄 Paper: arxiv.org/abs/2603.11327 💻 Code: github.com/tengxiao1/MR-S…
English
11
47
298
49.4K
Zhihui Xie retweetledi
Lin Zheng
Lin Zheng@linzhengisme·
Introducing proxy compression for end-to-end language modeling: train on compressed (e.g., tokenized) data for efficiency, but run inference entirely on raw bytes without a tokenizer. No architectural changes required. At scale, proxy-trained byte models match or surpass tokenizer baselines at 7B and 14B. 📄 Paper: arxiv.org/abs/2602.04289 💻 Code: github.com/LZhengisme/pro… [1/9] 🧵👇
Lin Zheng tweet media
English
2
15
98
20.2K
Zhihui Xie retweetledi
Lei Li
Lei Li@_TobiasLee·
Agents are doing real work, but existing benchmarks still test them in isolation. Today we’re releasing Claw-Eval 🦞: an open-source, transparent evaluation framework for AI agents. We feature 104 tasks spanning daily assistants, Office QA, deep finance research, and terminal usage. We test completion, robustness, and safety across real and mock services with configurable error injection. Fully traceable and human-verified. First leaderboard results: Claude Opus 4.6 @AnthropicAI tops pass rate (68.3%), but Gemini 3.1 @GeminiApp Pro edges it on avg score (0.764 vs 0.759). Agents have a long way to go.🤨 Check it out: claw-eval.github.io @steipete @openclaw
Lei Li tweet media
English
10
27
155
41K
Zhihui Xie retweetledi
renjie pi
renjie pi@RenjiePi·
Introducing Nemotron-Terminal: a systematic data engineering pipeline for scaling LLM Terminal Agents. We bridge the gap between open models and proprietary models with a fully open synthetic-to-real trajectory pipeline. 🤯The payoff: SFT on our Nemotron-Terminal-Corpus boosts Qwen3-32B from 3.4% → 27.4% on Terminal-Bench 2.0 (+24.0), rivaling models multiple its size. What makes it work? 🌟Terminal-Task-Gen: A lightweight data curation pipeline that seamlessly combines the adaptation of existing datasets with robust synthetic task construction. 🌟Nemotron-Terminal-Corpus: A massive, open-source dataset covering diverse terminal interactions, which contains explicit planning and execution traces for complex long-horizon tasks. And we’re releasing everything: 📦 Nemotron-Terminal-Corpus (Large-scale dataset) 🤖 Nemotron-Terminal models (8B, 14B, 32B) Paper: arxiv.org/abs/2602.21193 HF Daily: huggingface.co/papers/2602.21… Models & Data: huggingface.co/collections/nv… Our tech report just hit the #1 spot on Hugging Face Daily Papers! We're also incredibly excited to see the open-source community putting our work to the test, with the Nemotron-Terminal-Corpus dataset currently trending at over 1,800 downloads and counting. We can't wait to see what the community build with it!
renjie pi tweet mediarenjie pi tweet mediarenjie pi tweet media
English
6
28
206
17.2K
Zhihui Xie retweetledi
Ning Ding
Ning Ding@stingning·
Today I heard a line that stuck with me: "the real moat is the organizational structure."
English
4
5
51
6.6K
Zhihui Xie retweetledi
Rui Yang
Rui Yang@RuiYang70669025·
Collecting high-quality GUI trajectories for agent training is expensive. But are we fully leveraging the open-source data we already have? 🤔 ✨Introducing GUI-Libra (gui-libra.github.io): 81K high-quality, action-aligned reasoning dataset curated from open-source corpora, plus a tailored training recipe that combines action-aware SFT with step-wise RLVR-style training (⚠️partially verifiable rather than fully verifiable!). Result: stronger native GUI agents on both offline step-wise evaluation and online environments across mobile and web domains. Take away: With careful data curation + tailored post-training recipe, a small subset of open-source trajectories can still go a long way for training native GUI agents. Check out our paper (arxiv.org/abs/2602.22190) and code/dataset/model (github.com/GUI-Libra/GUI-…) for more details. #GUI #agent #VLM
Rui Yang tweet mediaRui Yang tweet media
English
1
12
58
11.7K
Zhihui Xie retweetledi
Qwen
Qwen@Alibaba_Qwen·
🚀 Qwen3.5-397B-A17B is here: The first open-weight model in the Qwen3.5 series. 🖼️Native multimodal. Trained for real-world agents. ✨Powered by hybrid linear attention + sparse MoE and large-scale RL environment scaling. ⚡8.6x–19.0x decoding throughput vs Qwen3-Max 🌍201 languages & dialects 📜Apache2.0 licensed 🔗Dive in: GitHub: github.com/QwenLM/Qwen3.5 Chat: chat.qwen.ai API:modelstudio.console.alibabacloud.com/ap-southeast-1… Qwen Code: github.com/QwenLM/qwen-co… Hugging Face: huggingface.co/collections/Qw… ModelScope: modelscope.cn/collections/Qw… blog: qwen.ai/blog?id=qwen3.5
Qwen tweet media
English
271
866
5.3K
1.3M
Zhihui Xie retweetledi
Siyan Zhao
Siyan Zhao@siyan_zhao·
Introducing 💡On-Policy Self-Distillation💡, a simple method that enables LLM to teach itself with dense per-token feedback on its own on-policy generations—achieving 4-8x more token efficiency vs. GRPO and outperforming both GRPO and SFT/Off-Policy Distillation. Key insight: like a student reviewing solutions, rationalizing them, and correcting prior mistakes, an LLM can be conditioned on privileged info (e.g., correct solution or a reasoning trace) and supervise its weaker self—the version without such access—by matching the privileged-info-induced distribution from itself. 🌐Blog: siyan-zhao.github.io/blog/2026/opsd/ 🧵👇
Siyan Zhao tweet media
English
31
157
920
132.3K
Zhihui Xie retweetledi
Zhoujun (Jorge) Cheng
Zhoujun (Jorge) Cheng@ChengZhoujun·
Pretraining has scaling laws to guide compute allocation. But for RL on LLMs, we lack a practical guide on how to spend compute wisely. We show the optimal compute allocation in LLM RL scales predictably. ↓ Key takeaways below
GIF
English
18
99
443
68.9K
Zhihui Xie retweetledi
Yao Tang
Yao Tang@tyao923·
𝗧𝗵𝗶𝗻𝗸 𝘄𝗶𝗱𝗲𝗿. 𝗧𝗵𝗶𝗻𝗸 𝘀𝗵𝗼𝗿𝘁𝗲𝗿. 🚀 𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴: token-wise branch-and-merge reasoning for LLMs. 💸 Discrete CoT is costly. 🎛️ Existing continuous tokens often clash with 𝗼𝗻-𝗽𝗼𝗹𝗶𝗰𝘆 𝗥𝗟 𝗲𝘅𝗽𝗹𝗼𝗿𝗮𝘁𝗶𝗼𝗻. 🎥 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴, a sampling-based continuous reasoning paradigm:
English
25
111
810
151.3K
Zhihui Xie retweetledi
Jiacheng Ye
Jiacheng Ye@JiachengYe15·
🚀Building on the success of Dream 7B, we introduce Dream-VL and Dream-VLA, open VL and VLA models that fully unlock discrete diffusion’s advantages in long-horizon planning, bidirectional reasoning, and parallel action generation for multimodal tasks.
GIF
English
1
16
58
16.8K
Zhihui Xie retweetledi
Lingpeng Kong
Lingpeng Kong@ikekong·
🚀 Introducing Dream-VL & Dream-VLA! We’re proving that dLLMs have an amazing advantage in building VLA models. The result is stunning performance: 🏆 97.2% on LIBERO ⚡ 27x speedup vs AR models 🔥 Beats OpenVLA & $\pi_0$ ✅ Fully Open Source Blog: hkunlp.github.io/blog/2025/drea…
Jiacheng Ye@JiachengYe15

🚀Building on the success of Dream 7B, we introduce Dream-VL and Dream-VLA, open VL and VLA models that fully unlock discrete diffusion’s advantages in long-horizon planning, bidirectional reasoning, and parallel action generation for multimodal tasks.

English
1
25
127
12.8K
Zhihui Xie retweetledi
Xiaomi MiMo
Xiaomi MiMo@XiaomiMiMo·
⚡ Faster than Fast. Designed for Agentic AI. Introducing Xiaomi MiMo-V2-Flash — our new open-source MoE model: 309B total params, 15B active. Blazing speed meets frontier performance. 🔥 Highlights: 🏗️ Hybrid Attention: 5:1 interleaved 128-window SWA + Global | 256K context 📈 Performance: ⚔️ Matches DeepSeek-V3.2 on general benchmarks — at a fraction of the latency 🏆 SWE-Bench Verified: 73.4% | SWE-Bench Multilingual: 71.7% — new SOTA for open-source models 🚀 Speed: 150 output tokens/s with Day-0 support from @lmsysorg🤝 🤗 Model: hf.co/XiaomiMiMo/MiM… 📝 Blog Post: mimo.xiaomi.com/blog/mimo-v2-f… 📄 Technical Report: github.com/XiaomiMiMo/MiM… 🎨 AI Studio: aistudio.xiaomimimo.com
Xiaomi MiMo tweet media
English
90
295
1.9K
559.5K
Zhihui Xie retweetledi
Hao Zhang
Hao Zhang@haozhangml·
Check out latest blog describing: 1. how to more appropriately characterizing the speed-accuracy trade-off of dLLMs (applies to any parallel decoding methods, too, such as LLMs + speculative decoding) and 2. Our ultra-fast d3llm which gives both strong speedup (5x over AR LLM and 10 over vanilla dLLM) and strong accuracy!
Hao AI Lab@haoailab

🔥 New blog: AUP: when Accuracy Meets Parallelism in Diffusion Language Models. 🔗hao-ai-lab.github.io/blogs/text-dif… Diffusion LLMs promise parallel decoding, error correction, and random-order generation. But if you look at both speed and accuracy: Are dLLMs actually better than AR + speculative decoding? Our study: not yet… Here’s why, and how we design our ultra-fast dLLM framework d3LLM 🚀 to close the gap!

English
1
6
21
4.5K
Zhihui Xie retweetledi
Shizhe Diao
Shizhe Diao@shizhediao·
🚀 Excited to share ToolOrchestra, an end-to-end RL training framework for orchestrating tools and agentic workflows. Everyone’s building agent workflows these days — connecting tools, APIs, and LLMs like LEGO. 🧩 But here are our findings: 👉 Just prompting the agent workflow won’t cut it. It’s not how you build the best agent. 👉 Without learning, workflows plateau fast. It’s time to bring RL fine-tuning 🔥back into agent development. (1/n)
Shizhe Diao tweet media
English
29
70
348
67.6K