Kaiyan Zhang

85 posts

Kaiyan Zhang

Kaiyan Zhang

@OkhayIea

PhD Student at Tsinghua University.

Katılım Nisan 2019
389 Takip Edilen284 Takipçiler
Sabitlenmiş Tweet
Kaiyan Zhang
Kaiyan Zhang@OkhayIea·
🚀 Excited to share our new survey paper on RL for Large Reasoning Models (LRMs)! Since early this year, our team has released several RL+LLMs works (PRIME, TTRL, SimpleVLA, MARTI, SSRL, HPT), covering dense rewards, self-evolution, embodied AI, multi-agent, tool learning, and hybrid post-training. The field is growing rapidly—new papers & projects are popping up every day! It felt like the right time to systematically review the landscape and reflect on the path towards superintelligence. In the past two months, together with collaborators from Tsinghua University and Shanghai AI Lab, we organized and summarized the latest RL research for reasoning models into a comprehensive survey. Our paper introduces the fundamentals, problems, resources, applications, and future directions of RL for LRMs, with a special focus on the long-term co-evolution of language models and environments. Preprint is online—welcome to check it out, discuss, or show support! 📄 Paper: huggingface.co/papers/2509.08… 🔗 GitHub: github.com/TsinghuaC3I/Aw…
Kaiyan Zhang tweet media
English
5
76
345
24.9K
Kaiyan Zhang retweetledi
Bingxiang He
Bingxiang He@HBX_hbx·
✨What if the simplest RL recipe is all you need? Introducing JustRL: new SOTA among 1.5B reasoning models with 2× less compute. Stable improvement over 4,000+ steps. No multi-stage pipelines. No dynamic schedules. Just simple RL at scale. 📄 Blog: relieved-cafe-fe1.notion.site/JustRL-Scaling…
Bingxiang He tweet media
English
7
52
333
39.6K
Kaiyan Zhang
Kaiyan Zhang@OkhayIea·
Excited about the surge in Agent Memory research? With breakthroughs in Context Management and Learning from Experience powering self-improving AI agents, check out this curated Awesome list : github.com/TsinghuaC3I/Aw… Essential resources for building smarter, context-aware agents in 2025! #AI #Agents #Memory
Kaiyan Zhang tweet media
English
0
1
9
445
Kaiyan Zhang retweetledi
Xuekai Zhu
Xuekai Zhu@zhu_xuekai·
We introduce FlowRL: ☑️ matching the full reward distribution via flow balancing instead of maximizing rewards in LLM RL.
Xuekai Zhu tweet media
English
3
5
21
3K
Kaiyan Zhang retweetledi
机器之心 JIQIZHIXIN
机器之心 JIQIZHIXIN@jiqizhixin·
Wow, A Survey of Reinforcement Learning for Large Reasoning Models This new survey reveals how RL has pushed LLMs beyond text into logic, math, and code—laying the groundwork for Large Reasoning Models (LRMs). But the road ahead isn’t just about more GPUs: scaling RL for reasoning faces deep challenges in algorithms, data, and infrastructure. By tracing breakthroughs since DeepSeek-R1 and mapping out future directions, researchers argue RL could be the catalyst that accelerates LLMs toward Artificial Superintelligence.
机器之心 JIQIZHIXIN tweet media
English
5
19
118
5.7K
Kaiyan Zhang retweetledi
Ksenia_TuringPost
Ksenia_TuringPost@TheTuringPost·
6 Recent & free sources to master Reinforcement Learning ▪️ A Survey of Continual Reinforcement Learning ▪️ Deep Reinforcement Learning course by @huggingface ▪️ Reinforcement Learning Specialization (Coursera, @UAlberta) ▪️ A Technical Survey of RL Techniques for LLMs ▪️ A Survey of RL for Software Engineering ▪️ A Survey of Reinforcement Learning for LRMs Save the list, and check this out for the links and more: huggingface.co/posts/Kseniase…
Ksenia_TuringPost tweet mediaKsenia_TuringPost tweet media
English
5
73
379
25.5K
Kaiyan Zhang
Kaiyan Zhang@OkhayIea·
🚀 Excited to share that Awesome-RL-for-LRMs has reached 1,000 stars on GitHub! 🎉 Huge thanks for all your support 🙏 We’ll keep tracking the latest progress in RL for reasoning models, covering algorithms, resources, and applications, and look forward to the journey toward superintelligence. A new version of our survey is on the way, and your feedback is always welcome! 💡 📄 Paper: huggingface.co/papers/2509.08… 🔗 GitHub: github.com/TsinghuaC3I/Aw…
Kaiyan Zhang tweet media
English
1
5
31
1.8K
Kaiyan Zhang retweetledi
Ksenia_TuringPost
Ksenia_TuringPost@TheTuringPost·
One of the most comprehensive Surveys of Reinforcement Learning for LRMs Covers: - LLMs ➝ LRMs via RL (math, code, reasoning) - Reward design, policy optimization, sampling - RL vs SFT, training recipes - Uses: coding, agents, multimodal, robotics, etc. - Future approaches: continual/memory/model-based RL, pretraining, diffusion, co-design
Ksenia_TuringPost tweet media
English
14
178
829
51.4K
Kaiyan Zhang retweetledi
elvis
elvis@omarsar0·
A Survey of Reinforcement Learning for Large Reasoning Models. 100+ pages covering foundational components, core problems, training resources, and applications. Great recaps of RL for LLMs.
elvis tweet media
English
15
98
502
70.7K
Francesco Bertolotti
Francesco Bertolotti@f14bertolotti·
This is a new 100-page RL for LLM literature review. It appears fairly complete. It also covers static/dynamic data and frameworks. And it has some nice figures! 🔗arxiv.org/abs/2509.08827
Francesco Bertolotti tweet mediaFrancesco Bertolotti tweet mediaFrancesco Bertolotti tweet media
English
9
117
746
45.8K
Kaiyan Zhang
Kaiyan Zhang@OkhayIea·
@xieenze_jr Good job! Does the AR baseline use transformers.generate, vLLM, or sglang?
English
0
0
1
271
Enze Xie
Enze Xie@xieenze_jr·
🚀 Fast-dLLM v2: Parallel Block-Diffusion Decoding for LLMs ⚡️ Highlights 🌟 - Blockwise bidirectional context via complementary masks - Hierarchical caches (block + sub-block) - Parallel sub-block decoding + token-shift training Results 📊 - ~2.5× faster vs. standard AR decoding on A100 - 102.5 tok/s (bs=1), 201.0 tok/s (bs=4) — >2× Qwen2.5 - SOTA efficiency–quality trade-off among diffusion LLMs webpage 🌐: nvlabs.github.io/Fast-dLLM/v2/
GIF
English
7
54
284
43.4K
Kaiyan Zhang retweetledi
Yuxin Zuo
Yuxin Zuo@zuo_yuxin·
🧭 Thinking about proposing a new RL algorithm? We introduce UPGE for deep diving into post-training to give you a boost! 🤔 Many recent works mix RL with SFT, but with drastically different loss functions, why should they be used together? We introduce Unified Policy Gradient Estimator (UPGE), a theoretical framework that unifies post-training algorithms (SFT, PPO, GRPO, etc.) under a common objective. 1️⃣ SFT and RL actually share the common optimization objective. 2️⃣ The combination of SFT and RL not only improves Pass@1 but also enhances Pass@k. 👉 Check out more details on what you can do with UPGE!
Yuxin Zuo tweet media
English
1
9
18
1.4K
Kaiyan Zhang retweetledi
Ksenia_TuringPost
Ksenia_TuringPost@TheTuringPost·
Recent research papers that you should definitely take a look at: ▪️ Mobile-Agent-v3 ▪️ Prompt Orchestration Markup Language ▪️ SSRL: Self-Search Reinforcement Learning ▪️ Atom-Searcher ▪️ MindJourney ▪️ Deep Think with Confidence ▪️ Controlling Multimodal LLMs via Reward-guided Decoding ▪️ CRISP: Persistent Concept Unlearning via Sparse Autoencoders ▪️ Unlearning Comparator ▪️ XQuant ▪️ BeyondWeb ▪️ Retrieval-augmented reasoning with lean LMs Find the full list here: turingpost.com/p/fod115
Ksenia_TuringPost tweet media
English
11
54
274
17.6K
Kaiyan Zhang retweetledi
AK
AK@_akhaliq·
SSRL Self-Search Reinforcement Learning
AK tweet media
English
8
60
280
37.4K
Kaiyan Zhang
Kaiyan Zhang@OkhayIea·
🚀 New paper: SSRL: Self-Search Reinforcement Learning Can LLMs serve as simulators of world knowledge for agentic RL—reducing external tool reliance without sacrificing sim2real generalization? 🔍 We introduce Self-Search: structured prompts + sampling to leverage LLMs’ internal knowledge. 🧠 SSRL further optimizes Self-Search with format- and rule-based rewards—no web access required. 🏆 Results: Good sim2real generalization and seamless integration with real search engines. Paper: arxiv.org/abs/2508.10874 Code: github.com/TsinghuaC3I/SS… #AgenticRL #Sim2Real #LLM #RL #SSRL #AI
Kaiyan Zhang tweet media
English
1
10
20
1.4K
Kaiyan Zhang
Kaiyan Zhang@OkhayIea·
🧵 New drops on RL for LLM Reasoning show how scale, reward design, and agentic capabilities are evolving fast: Gemini 2.5 (Google): Advanced reasoning, multimodal perception, and long-context capabilities, powered by unseen infrastructure scale.📄 storage.googleapis.com/deepmind-media… Kimi-Researcher (Moonshot AI): End-to-end RL enables tool-use, planning, and document navigation—pushing toward autonomous agent capabilities.🌐 moonshotai.github.io/Kimi-Researche… POLARIS (Inclusion AI): A robust recipe for post-training RL on reasoning models—scalable, stable, and applicable across tasks.📄 honorable-payment-890.notion.site/POLARIS-A-POst… Skywork-SWE32B: model for software engineering tasks. Shows data scaling laws emerge with RL + SWE domain alignment.🤖 huggingface.co/Skywork/Skywor… Reasoning360 (LLM360): Cross-domain RL reasoning benchmark across 10 diverse tasks. Highlights generalization gaps in current LLMs.📄 arxiv.org/abs/2506.14965 | 🔗 github.com/LLM360/Reasoni… Ego-R1 (NTU): Combines tool use & chain-of-thought for ultra-long egocentric video QA (e.g., 20k+ frames).📄 arxiv.org/abs/2506.13654 | 💻 github.com/egolife-ai/Ego… Lessons from Verifiable Rewards: Shows grounded RL reward signals boost alignment and reasoning in real-world setups.📄 arxiv.org/abs/2506.15522 AutoRule (CMU): Extracts logic rules from CoT and integrates them into reward functions—improving preference learning.📄 arxiv.org/abs/2506.15651 | 💻 github.com/cxcscmu/AutoRu… Act Only When It Pays (GRESO): Selective rollout strategy for RL reduces token-level waste and speeds up training.📄 arxiv.org/abs/2506.02177 | 💻 github.com/Infini-AI-Lab/… 🔗 Full collection on GitHub: github.com/TsinghuaC3I/Aw…
English
0
0
7
277
Kaiyan Zhang
Kaiyan Zhang@OkhayIea·
🧵 Latest RL-for-LLM Reasoning Papers – June 16-17, 2025 1. Reasoning with Exploration: An Entropy Perspective (RUC & MSRA) Incorporates entropy into the RL advantage function to promote longer, deeper reasoning chains. Observed direct gains in Pass@K even at high K values. 🔗 arxiv.org/abs/2506.14758 2. Ring‑lite: Scalable Reasoning via C3PO‑Stabilized RL (Inclusion AI) A MoE-based LLM using Constrained Contextual Computation Policy Optimization (C3PO) to improve training stability and efficiency. Matches SOTA with only one-third of active parameters. 🔗 arxiv.org/abs/2506.14731 💻 github.com/inclusionAI/Ri… 3. Reinforcement Learning with Verifiable Rewards… (MRA & PKU) Introduces a new CoT‑Pass@K metric showing RLVR truly improves reasoning integrity—early gains generalize across K. 🔗 arxiv.org/abs/2506.14245 4. Adaptive Guidance Accelerates RL of Reasoning Models (Scale AI) Proposes “Guide”, which uses natural language hints during RL training. Achieves up to +4% improvement across math domains at scale. 🔗 arxiv.org/abs/2506.13923 5. MiniMax‑M1: Scaling Test‑Time Compute Efficiently (MiniMax) A hybrid MoE model with “lightning attention” supporting 1M token contexts. Achieves efficient RL on long-context and software-engineering tasks; code released. 🔗 arxiv.org/abs/2506.13585 💻 github.com/MiniMax-AI/Min… 6. Direct Reasoning Optimization (DRO) (Microsoft & UCLA) Proposes R3 reward, a self-generated reflection signal for open-ended reasoning. Demonstrates consistent gains on long-form and math tasks without external rewards. 🔗 arxiv.org/abs/2506.13351 7. AceReason‑Nemotron 1.1 (NVIDIA) Shows synergy between SFT and RL: tuning sampling temperature to entropy ≈0.3 yields significant boost. New 7B model achieves SOTA in math & code. Hugging Face released. 🔗 arxiv.org/abs/2506.13284 ✅ Explore all papers, code & models here: github.com/TsinghuaC3I/Aw…
English
1
0
8
489