Kaiyan Zhang

85 posts

Kaiyan Zhang

@OkhayIea

PhD Student at Tsinghua University.

Katılım Nisan 2019

389 Takip Edilen284 Takipçiler

Sabitlenmiş Tweet

Kaiyan Zhang@OkhayIea·11 Eyl

🚀 Excited to share our new survey paper on RL for Large Reasoning Models (LRMs)! Since early this year, our team has released several RL+LLMs works (PRIME, TTRL, SimpleVLA, MARTI, SSRL, HPT), covering dense rewards, self-evolution, embodied AI, multi-agent, tool learning, and hybrid post-training. The field is growing rapidly—new papers & projects are popping up every day! It felt like the right time to systematically review the landscape and reflect on the path towards superintelligence. In the past two months, together with collaborators from Tsinghua University and Shanghai AI Lab, we organized and summarized the latest RL research for reasoning models into a comprehensive survey. Our paper introduces the fundamentals, problems, resources, applications, and future directions of RL for LRMs, with a special focus on the long-term co-evolution of language models and environments. Preprint is online—welcome to check it out, discuss, or show support! 📄 Paper: huggingface.co/papers/2509.08… 🔗 GitHub: github.com/TsinghuaC3I/Aw…

English

345

24.9K

Kaiyan Zhang retweetledi

Bingxiang He@HBX_hbx·12 Kas

✨What if the simplest RL recipe is all you need? Introducing JustRL: new SOTA among 1.5B reasoning models with 2× less compute. Stable improvement over 4,000+ steps. No multi-stage pipelines. No dynamic schedules. Just simple RL at scale. 📄 Blog: relieved-cafe-fe1.notion.site/JustRL-Scaling…

English

333

39.6K

Kaiyan Zhang@OkhayIea·4 Kas

Excited about the surge in Agent Memory research? With breakthroughs in Context Management and Learning from Experience powering self-improving AI agents, check out this curated Awesome list : github.com/TsinghuaC3I/Aw… Essential resources for building smarter, context-aware agents in 2025! #AI #Agents #Memory

English

445

Kaiyan Zhang retweetledi

Xuekai Zhu@zhu_xuekai·19 Eyl

We introduce FlowRL: ☑️ matching the full reward distribution via flow balancing instead of maximizing rewards in LLM RL.

English

Kaiyan Zhang retweetledi

机器之心 JIQIZHIXIN@jiqizhixin·15 Eyl

Wow, A Survey of Reinforcement Learning for Large Reasoning Models This new survey reveals how RL has pushed LLMs beyond text into logic, math, and code—laying the groundwork for Large Reasoning Models (LRMs). But the road ahead isn’t just about more GPUs: scaling RL for reasoning faces deep challenges in algorithms, data, and infrastructure. By tracing breakthroughs since DeepSeek-R1 and mapping out future directions, researchers argue RL could be the catalyst that accelerates LLMs toward Artificial Superintelligence.

English

118

5.7K

Kaiyan Zhang retweetledi

Ksenia_TuringPost@TheTuringPost·14 Eyl

6 Recent & free sources to master Reinforcement Learning ▪️ A Survey of Continual Reinforcement Learning ▪️ Deep Reinforcement Learning course by @huggingface ▪️ Reinforcement Learning Specialization (Coursera, @UAlberta) ▪️ A Technical Survey of RL Techniques for LLMs ▪️ A Survey of RL for Software Engineering ▪️ A Survey of Reinforcement Learning for LRMs Save the list, and check this out for the links and more: huggingface.co/posts/Kseniase…

English

379

25.5K

Kaiyan Zhang@OkhayIea·13 Eyl

Thanks to all my collaborators 🎉 @zuo_yuxin @stingning @BiqingQ @taitel1321401 @lindsayttsq @ @zhu_xuekai @yafuly @fangfu0830 and many more amazing teammates (sorry I couldn’t find all the usernames!) 🙏

English

239

Kaiyan Zhang@OkhayIea·13 Eyl

🚀 Excited to share that Awesome-RL-for-LRMs has reached 1,000 stars on GitHub! 🎉 Huge thanks for all your support 🙏 We’ll keep tracking the latest progress in RL for reasoning models, covering algorithms, resources, and applications, and look forward to the journey toward superintelligence. A new version of our survey is on the way, and your feedback is always welcome! 💡 📄 Paper: huggingface.co/papers/2509.08… 🔗 GitHub: github.com/TsinghuaC3I/Aw…

English

1.8K

Kaiyan Zhang retweetledi

Ksenia_TuringPost@TheTuringPost·13 Eyl

One of the most comprehensive Surveys of Reinforcement Learning for LRMs Covers: - LLMs ➝ LRMs via RL (math, code, reasoning) - Reward design, policy optimization, sampling - RL vs SFT, training recipes - Uses: coding, agents, multimodal, robotics, etc. - Future approaches: continual/memory/model-based RL, pretraining, diffusion, co-design

English

178

829

51.4K

Kaiyan Zhang retweetledi

elvis@omarsar0·11 Eyl

A Survey of Reinforcement Learning for Large Reasoning Models. 100+ pages covering foundational components, core problems, training resources, and applications. Great recaps of RL for LLMs.

English

502

70.7K

Kaiyan Zhang@OkhayIea·11 Eyl

@f14bertolotti Thank you for sharing our paper. 🥳

English

927

Francesco Bertolotti@f14bertolotti·11 Eyl

This is a new 100-page RL for LLM literature review. It appears fairly complete. It also covers static/dynamic data and frameworks. And it has some nice figures! 🔗arxiv.org/abs/2509.08827

English

117

746

45.8K

Kaiyan Zhang@OkhayIea·7 Eyl

@xieenze_jr Good job! Does the AR baseline use transformers.generate, vLLM, or sglang?

English

271

Enze Xie@xieenze_jr·6 Eyl

🚀 Fast-dLLM v2: Parallel Block-Diffusion Decoding for LLMs ⚡️ Highlights 🌟 - Blockwise bidirectional context via complementary masks - Hierarchical caches (block + sub-block) - Parallel sub-block decoding + token-shift training Results 📊 - ~2.5× faster vs. standard AR decoding on A100 - 102.5 tok/s (bs=1), 201.0 tok/s (bs=4) — >2× Qwen2.5 - SOTA efficiency–quality trade-off among diffusion LLMs webpage 🌐: nvlabs.github.io/Fast-dLLM/v2/

GIF

English

284

43.4K

Kaiyan Zhang retweetledi

Yuxin Zuo@zuo_yuxin·5 Eyl

🧭 Thinking about proposing a new RL algorithm? We introduce UPGE for deep diving into post-training to give you a boost! 🤔 Many recent works mix RL with SFT, but with drastically different loss functions, why should they be used together? We introduce Unified Policy Gradient Estimator (UPGE), a theoretical framework that unifies post-training algorithms (SFT, PPO, GRPO, etc.) under a common objective. 1️⃣ SFT and RL actually share the common optimization objective. 2️⃣ The combination of SFT and RL not only improves Pass@1 but also enhances Pass@k. 👉 Check out more details on what you can do with UPGE!

English

1.4K

Kaiyan Zhang retweetledi

Ksenia_TuringPost@TheTuringPost·29 Ağu

Recent research papers that you should definitely take a look at: ▪️ Mobile-Agent-v3 ▪️ Prompt Orchestration Markup Language ▪️ SSRL: Self-Search Reinforcement Learning ▪️ Atom-Searcher ▪️ MindJourney ▪️ Deep Think with Confidence ▪️ Controlling Multimodal LLMs via Reward-guided Decoding ▪️ CRISP: Persistent Concept Unlearning via Sparse Autoencoders ▪️ Unlearning Comparator ▪️ XQuant ▪️ BeyondWeb ▪️ Retrieval-augmented reasoning with lean LMs Find the full list here: turingpost.com/p/fod115

English

274

17.6K

Kaiyan Zhang retweetledi

AK@_akhaliq·18 Ağu

SSRL Self-Search Reinforcement Learning

English

280

37.4K

Kaiyan Zhang@OkhayIea·18 Ağu

TTRL showed LLMs can provide intrinsic rewards for RL. Now SSRL: LLMs simulate world-knowledge states for Agentic RL, enabling sim-to-real generalization. So much knowledge in LLMs still awaits elicitation with RL—maybe even a world model? Paper: arxiv.org/abs/2508.10874 Code: github.com/TsinghuaC3I/SS… #ReinforcementLearning #LLM #AI #Sim2Real #SSRL #TTRL

English

1.4K

Kaiyan Zhang@OkhayIea·15 Ağu

🚀 New paper: SSRL: Self-Search Reinforcement Learning Can LLMs serve as simulators of world knowledge for agentic RL—reducing external tool reliance without sacrificing sim2real generalization? 🔍 We introduce Self-Search: structured prompts + sampling to leverage LLMs’ internal knowledge. 🧠 SSRL further optimizes Self-Search with format- and rule-based rewards—no web access required. 🏆 Results: Good sim2real generalization and seamless integration with real search engines. Paper: arxiv.org/abs/2508.10874 Code: github.com/TsinghuaC3I/SS… #AgenticRL #Sim2Real #LLM #RL #SSRL #AI

English

1.4K

Kaiyan Zhang@OkhayIea·15 Ağu

@_AndrewZhao Thank you for sharing!

English

505

Andrew Zhao@_AndrewZhao·15 Ağu

LLMs as internet/knowledge base, no need for external tools. Reminiscent of older work from AI2/UW, Rainer arxiv.org/pdf/2210.03078 and CRYSTAL arxiv.org/abs/2310.04921 arxiv.org/abs/2508.10874

English

311

21.4K

Kaiyan Zhang@OkhayIea·22 Haz

🧵 New drops on RL for LLM Reasoning show how scale, reward design, and agentic capabilities are evolving fast: Gemini 2.5 (Google): Advanced reasoning, multimodal perception, and long-context capabilities, powered by unseen infrastructure scale.📄 storage.googleapis.com/deepmind-media… Kimi-Researcher (Moonshot AI): End-to-end RL enables tool-use, planning, and document navigation—pushing toward autonomous agent capabilities.🌐 moonshotai.github.io/Kimi-Researche… POLARIS (Inclusion AI): A robust recipe for post-training RL on reasoning models—scalable, stable, and applicable across tasks.📄 honorable-payment-890.notion.site/POLARIS-A-POst… Skywork-SWE32B: model for software engineering tasks. Shows data scaling laws emerge with RL + SWE domain alignment.🤖 huggingface.co/Skywork/Skywor… Reasoning360 (LLM360): Cross-domain RL reasoning benchmark across 10 diverse tasks. Highlights generalization gaps in current LLMs.📄 arxiv.org/abs/2506.14965 | 🔗 github.com/LLM360/Reasoni… Ego-R1 (NTU): Combines tool use & chain-of-thought for ultra-long egocentric video QA (e.g., 20k+ frames).📄 arxiv.org/abs/2506.13654 | 💻 github.com/egolife-ai/Ego… Lessons from Verifiable Rewards: Shows grounded RL reward signals boost alignment and reasoning in real-world setups.📄 arxiv.org/abs/2506.15522 AutoRule (CMU): Extracts logic rules from CoT and integrates them into reward functions—improving preference learning.📄 arxiv.org/abs/2506.15651 | 💻 github.com/cxcscmu/AutoRu… Act Only When It Pays (GRESO): Selective rollout strategy for RL reduces token-level waste and speeds up training.📄 arxiv.org/abs/2506.02177 | 💻 github.com/Infini-AI-Lab/… 🔗 Full collection on GitHub: github.com/TsinghuaC3I/Aw…

English

277

Kaiyan Zhang@OkhayIea·18 Haz

🧵 Latest RL-for-LLM Reasoning Papers – June 16-17, 2025 1. Reasoning with Exploration: An Entropy Perspective (RUC & MSRA) Incorporates entropy into the RL advantage function to promote longer, deeper reasoning chains. Observed direct gains in Pass@K even at high K values. 🔗 arxiv.org/abs/2506.14758 2. Ring‑lite: Scalable Reasoning via C3PO‑Stabilized RL (Inclusion AI) A MoE-based LLM using Constrained Contextual Computation Policy Optimization (C3PO) to improve training stability and efficiency. Matches SOTA with only one-third of active parameters. 🔗 arxiv.org/abs/2506.14731 💻 github.com/inclusionAI/Ri… 3. Reinforcement Learning with Verifiable Rewards… (MRA & PKU) Introduces a new CoT‑Pass@K metric showing RLVR truly improves reasoning integrity—early gains generalize across K. 🔗 arxiv.org/abs/2506.14245 4. Adaptive Guidance Accelerates RL of Reasoning Models (Scale AI) Proposes “Guide”, which uses natural language hints during RL training. Achieves up to +4% improvement across math domains at scale. 🔗 arxiv.org/abs/2506.13923 5. MiniMax‑M1: Scaling Test‑Time Compute Efficiently (MiniMax) A hybrid MoE model with “lightning attention” supporting 1M token contexts. Achieves efficient RL on long-context and software-engineering tasks; code released. 🔗 arxiv.org/abs/2506.13585 💻 github.com/MiniMax-AI/Min… 6. Direct Reasoning Optimization (DRO) (Microsoft & UCLA) Proposes R3 reward, a self-generated reflection signal for open-ended reasoning. Demonstrates consistent gains on long-form and math tasks without external rewards. 🔗 arxiv.org/abs/2506.13351 7. AceReason‑Nemotron 1.1 (NVIDIA) Shows synergy between SFT and RL: tuning sampling temperature to entropy ≈0.3 yields significant boost. New 7B model achieves SOTA in math & code. Hugging Face released. 🔗 arxiv.org/abs/2506.13284 ✅ Explore all papers, code & models here: github.com/TsinghuaC3I/Aw…

English

489

Keşfet

@huggingface @UAlberta @zuo_yuxin @stingning @BiqingQ @taitel1321401 @zhu_xuekai @yafuly