Heming Xia

200 posts

Heming Xia banner
Heming Xia

Heming Xia

@hemingkx

Ph.D. student @HongKongPolyU | Prev MEng & BSc @PKU1898 | Prev Intern @MSFTResearch (MSRA) | NLP | Language Modeling

Hong Kong Katılım Temmuz 2020
2.2K Takip Edilen1.3K Takipçiler
Sabitlenmiş Tweet
Heming Xia
Heming Xia@hemingkx·
🎉Excited to share that TokenSkip has been accepted to the main conference of EMNLP 2025! Many thanks to all the coauthors for their hard work! Looking forward to seeing everyone in Suzhou😉. arxiv.org/abs/2502.12067
Heming Xia@hemingkx

Does every token in the CoT output contribute equally to deriving the answer? —— We say NO! 🚀 We are excited to introduce TokenSkip, which enables LLMs to skip less important tokens during Chain-of-Thought generation⚡️. 📄 Arxiv: arxiv.org/abs/2502.12067 🧵1/n

English
1
17
88
9.1K
Heming Xia retweetledi
Yinghui He
Yinghui He@yinghui_he_·
RLVR gives sparse supervision; On-Policy Self-Distillation often requires high-quality demonstrations. Our new method, ✨SD-Zero✨, gets the best of both worlds – we use model’s self-revision to turn binary rewards into dense token-level supervision. No external teacher. No curated demonstrations. 🚨 Introducing Self-Distillation Zero (SD-Zero), which trains one model to play two roles: (1) “Generator” that makes attempts, and (2) “Reviser” that conditions on the generator’s failed/successful attempt + binary reward to produce a better answer. ‼️Even WRONG attempts can become the training signal.‼️ 🔗Paper: arxiv.org/abs/2604.12002 🏆 SD-Zero brings 10%+ improvement over base models (Qwen3,4B; Olmo3,7B) on math & code reasoning, beating GRPO and vanilla On-Policy Self-Distillation under the same training budget. SD-Zero also enables iterative self-evolution.
Yinghui He tweet mediaYinghui He tweet media
English
16
56
402
213.9K
Heming Xia retweetledi
Shizhe Diao
Shizhe Diao@shizhediao·
RLVR is powerful — but how do you train with multiple rewards effectively? 🤔 🎯GDPO (not GRPO) is coming. We introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new multi-reward RL algorithm that consistently improves per-reward convergence over GRPO across a wide range of tasks. (1/n)
Shizhe Diao tweet media
English
25
131
812
83.3K
Heming Xia retweetledi
Rimsha Bhardwaj
Rimsha Bhardwaj@heyrimsha·
Holy shit… Meta might’ve just solved self-improving AI 🤯 Their new paper SPICE (Self-Play in Corpus Environments) basically turns a language model into its own teacher no humans, no labels, no datasets just the internet as its training ground. Here’s the twist: one copy of the model becomes a Challenger that digs through real documents to create hard, fact-grounded reasoning problems. Another copy becomes the Reasoner, trying to solve them without access to the source. They compete, learn, and evolve together an automatic curriculum with real-world grounding so it never collapses into hallucinations. The results are nuts: +9.1% on reasoning benchmarks with Qwen3-4B +11.9% with OctoThinker-8B and it beats every prior self-play method like R-Zero and Absolute Zero. This flips the script on AI self-improvement. Instead of looping on synthetic junk, SPICE grows by mining real knowledge a closed-loop system with open-world intelligence. If this scales, we might be staring at the blueprint for autonomous, self-evolving reasoning models.
Rimsha Bhardwaj tweet media
English
39
78
478
32.3K
Heming Xia retweetledi
Xiang Yue
Xiang Yue@xiangyue96·
There are competing views on whether RL can genuinely improve base model's performance (e.g., pass@128). The answer is both yes and no, largely depending on the interplay between pre-training, mid-training, and RL. We trained a few hundreds of GPT-2 scale LMs on synthetic GSM-like reasoning data from scratch. Here are what we found: 🧵
Xiang Yue tweet media
English
28
242
1.4K
326.3K
Heming Xia
Heming Xia@hemingkx·
ERNIE 5.0 ranks #1 in China, #2 globally on LMArena. But more importantly, it pushes forward the architectural frontier of multi-modal language modeling. Congrats to the team at @Baidu_Inc!🫡
English
0
0
0
79
Heming Xia
Heming Xia@hemingkx·
The system scales to 2.4T parameters with ultra-sparse MoE (<3% activation), and Baidu's engineering contributions in training/inference (e.g., expert parallelism, FP8, speculative decoding) are non-trivial and worth independent study.
English
1
0
0
81
Heming Xia retweetledi
𝚐𝔪𝟾𝚡𝚡𝟾
Tina proved that LoRA can match or surpass full-parameter RL. Tora builds directly on that result, turning it into a full framework. Built on torchtune, it extends RL post-training to LoRA, QLoRA, DoRA, and QDoRA under one interface with GRPO, FSDP, and compile support. QLoRA and QDoRA enable 4-bit RL with stable rewards, while DoRA-Cache speeds rollouts by 2–4× under the same setup. Tora establishes a clean, scalable baseline for LoRA in RL post-training. ⮕ 𝐥𝐢𝐧𝐤 𝐛𝐞𝐥𝐨𝐰
𝚐𝔪𝟾𝚡𝚡𝟾 tweet media
𝚐𝔪𝟾𝚡𝚡𝟾@gm8xx8

Tina: Tiny Reasoning Models via LoRA LoRA-RL tuned 1.5B models on curated reasoning data, achieving +20% gains and 43% Pass@1 (AIME24) at $9 total cost. Outperforms full-parameter RL on DeepSeek-R1-Distill-Qwen-1.5B. - LoRA-based RL yields better performance with less compute. - best checkpoints align with format-reward transitions, not accuracy plateaus. - efficiently adapts reasoning structure while preserving core model knowledge.

English
3
29
300
29.5K
Heming Xia retweetledi
Zhaopeng Tu
Zhaopeng Tu@tuzhaopeng·
What if LLMs could learn like curious children — exploring what surprises them most? 🤖🧒 Introducing CDE (Curiosity-Driven Exploration) — a lightweight framework that leverages a model's intrinsic sense of curiosity to guide exploration in RLVR, solving the "learn too fast, see too little" problem in LLM training. 1️⃣ We formalize curiosity from two perspectives: 🎭 Actor Curiosity: Uses perplexity to identify & reward surprising but correct solutions. 🎯 Critic Curiosity: Uses multi-head variance to spot and explore uncertain, under-explored regions. 2️⃣ We uncover and mitigate "calibration collapse" in standard RLVR: 🤯 Models lose self-awareness, becoming equally confident in both right and wrong answers. 📏 CDE maintains healthy calibration, a key step in curbing LLM hallucinations. 3️⃣ Strong empirical results: 🚀 +3 point improvement over strong RLVR baselines on the challenging AIME math benchmarks. 📈 Pass@16 gains of 8-10 points on multiple datasets. ⚖️ Better exploration-exploitation balance for more robust reasoning. 📃 Paper: arxiv.org/abs/2509.09675
Zhaopeng Tu tweet media
English
8
73
354
41.4K
Heming Xia retweetledi
Dongfu Jiang
Dongfu Jiang@DongfuJiang·
🚀 Excited to finally share our paper on VerlTool, released today after months of work since the initial release in late May! VerlTool is a high-efficiency, easy-to-use framework for Agentic RL with Tool use (ARLT), built on top of VeRL. It currently supports a wide range of tools (including multimodal ones) such as code interpreter, FAISS retriever, Google Search, Bash terminal, SQL executor, image processing, SWE, and more. For each tool, we provide training recipes and detailed analysis, with all code designed to be reproducible and runnable on a single node. A key design choice is the separation of the RL workflow and the tool server. Every trajectory sends tool calls via a well-designed API interface after encountering an action stop token. The tool server handles requests with either multi-threading or Ray, ensuring high concurrency and stable resource management—for example, our math experiments run stably past 1k steps. Our goal with VerlTool is to make it easy for the community to add new tools in ARLT training. Developers only need to inherit from BaseTool and adapt minimal code. In fact, you could even give the BaseTool file to GPT/Claude and get almost plug-and-play code. We also explored important technical issues in Agentic RL, such as how much async rollouts can actually speed things up, or how tool response tokenization may cause off-policy drift. We hope these insights, while modest, can be useful for the community. 📄 HuggingFace Daily Paper: huggingface.co/papers/2509.01… 🛠️Github: github.com/TIGER-AI-Lab/v… More details: (0/5)👇
Dongfu Jiang tweet media
Dongfu Jiang@DongfuJiang

Introducing VerlTool - a unified and easy-to-extend tool agent training framework based on verl. Recently, there's been a growing trend toward training tool agents with reinforcement learning algorithms like GRPO and PPO. Representative works include SearchR1, ToRL, ReTool, and ToolRL. While these achieve impressive performance, their training codes are either not fully open-sourced or too difficult to modify and customize with new tools, creating unexpectedly high engineering costs for the community when exploring new ideas. To address these issues and reduce engineering overhead, we propose verl-tool. Key Features: 1. 🔧 Complete decoupling of actor rollout and environment interaction - We use verl as a submodule to benefit from ongoing verl repo updates. All tool calling is integrated via a unified API, allowing you to easily add new tools by simply adding a Python file and testing independently. 2. 🌍 Tool-as-environment paradigm - Each tool interaction can modify the environment state. We store and reload environment states for each trajectory. For each training, you can launch 3. ⚡ Native RL framework for tool-calling agents - verl-tool natively supports multi-turn interactive loops between agents and their tool environments. 4. 📊 User-friendly evaluation suite - Launch your trained model with OpenAI API alongside the tool server. Simply send questions and get final outputs with all interactions handled internally. We've successfully reproduced ToRL results using our verl-tool framework, demonstrating its correctness and demonstrating comparable performance on mathematical benchmarks. VerlTool is an active ongoing project! We aim to incorporate more tools covering a wide range of use cases and expect they can be trained together in a single framework. Suggestions and contributions are highly welcomed! Check out our GitHub: github.com/TIGER-AI-Lab/v… More details: 👇 (0/4)

English
2
37
154
17.1K
Heming Xia retweetledi
Qian Liu
Qian Liu@sivil_taram·
Thanks AK for sharing our work! 🔥 🧵 Back to Jan when we just started this project... we were living a nightmare 😩 Months of watching our multi-turn RL models collapse. Every. Single. Time. 💥 We thought we were doing something wrong... until we discovered other research teams seem to hit the same invisible wall (Devin, verl and other reported issues) 🧱 Multi-turn tool reasoning just... BROKE 💔 It's NOT like the elegant simplicity of R1-Zero’s approach. This was pure chaos. Then came the "aha!" moment of our SimpleTIR💡✨ The secret was hiding in plain sight: “void turns” - those meaningless steps where models generates text that leads... absolutely NOWHERE 🕳️ One simple filter changed everything ✨ Our 7B model jumped from 22% (DAPO) to 50% (Multi-Turn Tool Use) on AIME24 📈 No complex algorithms, no fancy techniques. Just removing the void turn examples that were poisoning the training 🎯 Sometimes, the biggest gains come from understanding what NOT to learn 💡 📄 Paper: huggingface.co/papers/2509.02…
💻 Code: github.com/ltzheng/Simple… ✍️ Blog: simpletir.notion.site/report
AK@_akhaliq

SimpleTIR End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

English
2
17
96
18.7K
Heming Xia retweetledi
Junxian He
Junxian He@junxian_he·
Mirage or method? We re-assess a series of RL observations such as spurious reward, one-shot RL, test-time RL, and negative-sample training. 🧐These approaches were all proved on Qwen+Math combination originally, but do they work in other settings? If not, under which conditions do these conclusions hold? Unsurprisingly, we find these techniques are ineffective in many settings. However, Qwen is not the true magic here -- the original conclusions hold even for Llama models when working on certain tasks. Through extensive experiments, we identify model-task alignment as the underlying reason here (rather than data contamination): these RL techniques yield divergent conclusions when the models feel the task difficulty differently. ☘️ Luckily, standard RL with gold reward and proper training data size always works in all our settings. 🤔Rather than blaming the previous conclusions, I like the implications from these methods when thinking the other way around: how can we build a base model that can be easily RLed, where these techniques can work and relax the requirements for accurate reward and large-scale in-domain training data? It seems mid-training or cold-start SFT are important factors to affect how critical the subsequent reward and RL data quality needs to be. Our paper is at: arxiv.org/abs/2508.21188
Junxian He tweet mediaJunxian He tweet media
English
3
44
185
18.7K
Heming Xia retweetledi
Meituan LongCat
Meituan LongCat@Meituan_LongCat·
🚀 LongCat-Flash-Chat Launches! ▫️ 560B Total Params | 18.6B-31.3B Dynamic Activation ▫️ Trained on 20T Tokens | 100+ tokens/sec Inference ▫️ High Performance: TerminalBench 39.5 | τ²-Bench 67.7 🔗 Model: huggingface.co/meituan-longca… 💻 Try Now: longcat.ai
Meituan LongCat tweet media
English
75
151
851
299.1K
Heming Xia retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
The paper shows how an LLM agent keeps improving by learning from its own memory, without changing the base model. It ranks top on GAIA validation at 87.88% Pass@3, with 79.40% on the private test. Most agent systems either rely on fixed workflows that never adapt, or burn compute to fine tune model weights. AgentFly stores each solved attempt as a case in episodic memory, then picks similar cases to guide the next plan. They cast it as a memory augmented decision process, where a learned retrieval policy scores which past cases to reuse. That policy learns online from task rewards, using either simple similarity or a small neural scorer, so case choice keeps improving. A planner proposes subtasks with those cases, an executor runs tools via the Model Context Protocol, and case, subtask, and tool memories track progress. Because only memory and the retrieval policy update, the base LLM stays frozen, cost stays low, and the agent adapts continuously. Across research and question answering, the case memory lifts out of distribution accuracy by +4.7% to +9.6%, and hits 95.0% on SimpleQA. The takeaway is practical, teach the agent which past experiences matter and it will plan better without fiddling with weights. ---- Paper – arxiv. org/abs/2508.16153 Paper Title: "AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs"
Rohan Paul tweet media
English
21
117
591
42.3K
Heming Xia retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
Another paper claiming really BIG result. The First method to achieve 99.9% on AIME 2025 with open-source models! 🤯 DeepConf uses a model’s own token confidence to keep only its strongest reasoning, with GPT-OSS-120B while cutting tokens by up to 84.7% compared to standard parallel thinking. Most systems still lean on self-consistency with majority voting, which lifts accuracy but hits diminishing returns and burns a lot of tokens. 🧠 The key idea DeepConf is a test-time method that scores the model’s reasoning locally for confidence, filters weak traces, and often improves accuracy with fewer tokens without any extra training or tuning. 🧱 Why majority voting hits a wall Parallel thinking samples many chains and votes, accuracy grows slowly as samples rise so compute scales linearly and the benefit flattens, which is exactly the pain DeepConf targets. 🔎 The confidence signals Token confidence is the negative mean log probability of the top k candidates at each step, which gives a direct signal of how sure the model is at that moment. Group confidence averages token confidence over a sliding window so local dips are visible without noise from the whole trace. Tail confidence averages the last chunk of tokens because the ending steps decide the final answer and are where good traces often slip. Bottom 10% group confidence looks at the worst parts of a trace, which is a strong indicator that the overall reasoning is shaky. Lowest group confidence picks the single weakest window along a trace, which turns out to be a clean gate for dropping that trace early. ✅ Bottom line DeepConf is a plug-in test-time compression recipe that filters or halts weak reasoning in place, so teams get higher accuracy and a big token cut without retraining or new hyperparameters.
Rohan Paul tweet media
English
20
147
808
70.7K
Heming Xia retweetledi
Feng Yao
Feng Yao@fengyao1909·
Failing on 𝐥𝐚𝐫𝐠𝐞-𝐬𝐜𝐚𝐥𝐞 𝐑𝐋 with VeRL? ⚠️ Mixing inference backend (𝐯𝐋𝐋𝐌/𝐒𝐆𝐋𝐚𝐧𝐠) with training backends (𝐅𝐒𝐃𝐏/𝐌𝐞𝐠𝐚𝐭𝐫𝐨𝐧) 𝐬𝐞𝐜𝐫𝐞𝐭𝐥𝐲 𝐭𝐮𝐫𝐧𝐬 𝐲𝐨𝐮𝐫 𝐑𝐋 𝐢𝐧𝐭𝐨 𝐨𝐟𝐟-𝐩𝐨𝐥𝐢𝐜𝐲 — even if they share the same weights! 📉 Blog: fengyao.notion.site/off-policy-rl 💻 Code: github.com/yaof20/verl/tr…
Feng Yao tweet media
English
14
124
766
183.3K
Heming Xia retweetledi