Yihao Feng

234 posts

Yihao Feng

Yihao Feng

@yihaocs

Palo Alto, CA Katılım Eylül 2013
485 Takip Edilen116 Takipçiler
Yihao Feng retweetledi
Hayden Prairie
Hayden Prairie@hayden_prairie·
We’ve been thinking a lot about scaling laws, wondering if there is a more effective way to scale FLOPs without increasing parameters. Turns out the answer is YES – by looping blocks of layers during training. We find that predictable scaling laws exist for layer looping, allowing us to use looping to achieve the quality of a Transformer twice the size. Our scaling laws suggest that for a fixed parameter budget, data and looping should be increased in tandem! 🧵👇
Hayden Prairie tweet media
English
41
178
1.3K
288K
Yihao Feng retweetledi
Rosinality
Rosinality@rosinality·
The update for the likelihood of correct actions can be very small or even negative, especially in RL with tool use. This might be because of rather OOD tokens from tool outputs and increased uncertainty from it, and structurally shared prefixes among trajectories.
Rosinality tweet media
English
1
10
60
3K
Yihao Feng retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
The paper shows a way to make LLM reasoning shorter without losing accuracy by rewarding conciseness only when answers are correct. On a 7B model, it reports 8.1% higher accuracy with 19.9% fewer tokens. Overthinking is the problem, models make long chains of steps that repeat ideas and burn compute. Simple length penalties backfire, they either push very short guesses or destabilize training. So the authors train a small judge that scores a full solution for repetition, relevance, and brevity. During reinforcement learning, the conciseness bonus applies only if the final answer is correct, which stops reward gaming. The bonus slowly fades as training goes on, and it shrinks on harder questions so longer reasoning stays allowed. This cleaner signal reduces gradient noise, makes updates steadier, and avoids the collapse seen with plain length penalties. Across math tasks and backbones like Qwen, Llama, and Mistral, outputs get shorter while accuracy holds or improves. ---- Paper – arxiv. org/abs/2511.09158 Paper Title: "Efficient Reasoning via Reward Model"
Rohan Paul tweet media
English
9
18
111
8K
Yihao Feng retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
New Tencent paper shows how smarter data curation makes reinforcement learning code models much stronger. The core idea is to stop obsessing over new reinforcement learning tricks and fix the data and training flow. The team first fine tunes a 32B model on curated coding data and tags each problem by difficulty. They then run reinforcement learning with real rewards from unit tests that execute the code. Stage 1 widens the model’s habits by training on a broad mixed set with multiple attempts per prompt. Stage 2 zooms in on a small pool of the toughest problems and spends many more attempts per prompt. This 2 stage setup raises pass rates on LiveCode, LeetCode, and Codeforces, with the biggest gains on hard sets. The ablations show that skipping either stage hurts results, and that easy samples during Stage 2 waste learning. The largest jump is a 58% relative gain on Codeforces against the same size baseline. The same recipe also helps a larger mixture of experts model, which means the approach scales. ---- Paper – arxiv. org/abs/2511.06307 Paper Title: "DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation"
Rohan Paul tweet media
English
7
14
98
9.2K
Yihao Feng retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
New Nvidia paper shows how a single LLM can teach itself to reason better. It creates 3 roles from the same model, a Proposer, a Solver, and a Judge. The Proposer writes hard but solvable questions that stretch the model. The Solver answers those questions with clear steps and final results. The Judge scores both question quality and answer correctness using strict rules. All 3 roles run in a closed loop and learn with reinforcement learning. The Proposer’s reward blends quality, difficulty, and clean formatting checks. The difficulty reward is higher when the Solver fails, which drives harder tasks. A quality filter blocks weak or confusing questions from entering the training set. The Solver’s reward comes from the Judge’s score plus a formatting check. The Judge is trained to output reliable numeric scores inside simple tags. All role parameters update together using normalized advantages for stability. On Qwen2.5-3B, the approach lifts average accuracy by 4.54%, and it beats simple supervised fine tuning while rivaling a strong self play baseline without outside tools. ---- Paper – arxiv. org/abs/2510.23595v1 Paper Title: "Multi-Agent Evolve: LLM Self-Improve through Co-evolution"
Rohan Paul tweet media
English
21
80
435
21.8K
Yihao Feng retweetledi
vLLM
vLLM@vllm_project·
🚀 Excited to share our work on batch-invariant inference in vLLM! Now you can get identical results regardless of batch size with just one flag: VLLM_BATCH_INVARIANT=1 No more subtle differences between bs=1 and bs=N (including prefill!). Let's dive into how we built this 🧵👇
vLLM tweet media
English
2
43
281
38.6K
Yihao Feng retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
The paper shows a few reinforcement learning tweaks let small LLM agents use tools better and beat larger ones. A 4B model matches or exceeds 32B agents on hard math, science, and code tasks. Old training stitched fake tool traces together, which taught clumsy timing for tool calls. They switch to real, end to end tool sessions so the model learns when, why, and how to act. They also build a diverse, model aware reinforcement set that keeps training exploratory and steady. Algorithm changes are simple, loosen clipping a bit, add a light length penalty near the limit, and train with token level loss. This keeps the model’s action entropy in a healthy range so it keeps trying useful options. On behavior, deliberate agents think a bit longer, call fewer tools, and win more per call. Reactive agents spam tools and get worse results per call. Long chain of thought models avoid tools on reasoning tasks, so instruction tuned models scale cleaner for agents. The recipe is real trajectories plus entropy friendly updates plus deliberate tool use. This combo turns compact agents into strong problem solvers with efficient tool calls. ---- Paper – arxiv. org/abs/2510.11701 Paper Title: "Demystifying Reinforcement Learning in Agentic Reasoning"
Rohan Paul tweet media
English
5
34
182
10.3K
Yihao Feng retweetledi
Simon Yu
Simon Yu@simon_ycl·
💥New Paper Diversity is the key to everything: creative tasks and RL exploration. Yet, most LLMs suffered from mode collapse, always repeating the same answers. Our new paper introduces Verbalized Sampling, a general method to bypass this and unlock your model's true potential. See the discussion 📷👇
Simon Yu tweet media
Weiyan Shi@shi_weiyan

New paper: You can make ChatGPT 2x as creative with one sentence. Ever notice how LLMs all sound the same? They know 100+ jokes but only ever tell one. Every blog intro: "In today's digital landscape..." We figured out why – and how to unlock the rest 🔓 Copy-paste prompt: 🧵

English
6
18
153
24K
Yihao Feng retweetledi
Ant Ling
Ant Ling@AntLingAGI·
2/5 Continuously evolving deep thinking. Ring-1T scales reinforcement learning from 16B → 100B → trillions of parameters with our icepop algorithm and ASystem. This ensures stable long-term RL training for large MoE models and bridges the gap between training and inference. The result: smoother reasoning, longer-context stability, and stronger alignment under trillion-scale load.
Ant Ling tweet mediaAnt Ling tweet media
English
1
3
37
5.1K
Yihao Feng retweetledi
God of Prompt
God of Prompt@godofprompt·
🚨 NVIDIA just did the impossible. They trained a 12B-parameter language model on 10 trillion tokens entirely in 4-bit precision. It’s called NVFP4, and it might redefine how frontier AI models are trained. Here’s why this matters: • NVFP4 delivers 2–3× faster math throughput and 50% less memory vs FP8 • Accuracy? Practically identical. (MMLU-Pro: FP8 = 62.62%, NVFP4 = 62.58%) • Stability issues? Solved using Random Hadamard transforms, stochastic rounding, and 2D scaling • Trained entirely on NVIDIA Blackwell GPUs the first 4-bit run stable across 10T tokens This is the first successful demonstration of large-scale 4-bit pretraining without losing accuracy. The next generation of frontier models will be faster, cheaper, and greener without compromise.
God of Prompt tweet media
English
68
320
2K
210.9K
Yihao Feng retweetledi
Infini-AI-Lab
Infini-AI-Lab@InfiniAILab·
🤔Can we train RL on LLMs with extremely stale data? 🚀Our latest study says YES! Stale data can be as informative as on-policy data, unlocking more scalable, efficient asynchronous RL for LLMs. We introduce M2PO, an off-policy RL algorithm that keeps training stable and performant even when using data stale by 256 model updates. 🔗 Notion Blog: m2po.notion.site/rl-stale-m2po 📄 Paper: arxiv.org/abs/2510.01161 💻 GitHub: github.com/Infini-AI-Lab/… 🧵 1/4
Infini-AI-Lab tweet media
English
3
39
232
63.3K
Yihao Feng retweetledi
Dan Alistarh
Dan Alistarh@DAlistarh·
🚀 We are releasing state-of-the-art post-training quantization (PTQ) algorithms for Microscaling FP4, together with kernels: - First study focused on MXFP4/NVFP4 PTQ for LLMs - New Micro-Rotated (MR) format and GPTQ algorithm - QuTLASS GPU kernels with up to 3.6x speedups.
Dan Alistarh tweet media
English
2
28
152
9.4K
Yihao Feng retweetledi
Zhihu Frontier
Zhihu Frontier@ZhihuFrontier·
🚀 @AntLingAGI open-sourced Ling 2.0 — the first FP8-native mixed-precision training framework for MoE models. Plug-and-play. So how does it work? Zhihu contributor & Ling Team dev 千千 shares his thinking. 🧠 Why it matters: Ling 2.0 is trained natively in FP8, not BF16 with post-hoc quantization. Using fine-grained tile/block-wise scaling, it avoids outlier distortion and retains BF16-level accuracy. ⚙️ Key Results: • Up to 30-60% throughput gain with MTP • Still 90-120% faster without MTP • Training loss curve matches BF16 — virtually no accuracy drop 💡 Innovations include: • FP8 Optimizer: Compresses 1st & 2nd moments to FP8, saving up to 75% Adam optimizer memory • On-demand weight transpose: Cuts redundant memory for backward pass • FP8 routing maps for more efficient MoE dispatching • Support for E4M3 FP8 format in backward — better accuracy than E5M2 🔍 Tackling FP8 challenges: • Quantization errors (e.g. outliers clipped to 0) • Value distortion (close values collapsed) Ling 2.0 adds monitoring for quantization loss, underflow, and FP32 recompute deltas — alerts when precision degrades. 💭 Why it matters long-term: In the BF16-vs-FP8 tradeoff, Ling 2.0 emphasizes "lossless precision" first, then optimizes performance. Combined with 3D parallelism and dynamic memory strategies, it's ideal for training large LLMs under memory constraints. 📖 Full technical post: zhuanlan.zhihu.com/p/195129140928… #FP8 #LLM #AIInfra #OpenSource #AntGroup #Ling2
Zhihu Frontier tweet mediaZhihu Frontier tweet mediaZhihu Frontier tweet mediaZhihu Frontier tweet media
English
1
3
76
5.5K
Yihao Feng retweetledi
Zhangchen Xu
Zhangchen Xu@zhangchen_xu·
🚀 Want high-quality, realistic, and truly challenging post-training data for the agentic era? Introducing Toucan-1.5M (huggingface.co/papers/2510.01…) — the largest open tool-agentic dataset yet: ✨ 1.53M real agentic trajectories synthesized by 3 models ✨ Diverse, challenging tasks across wide domains ✨ Built from 495 MCP servers & 2K+ open-source tools 📈 SOTA results on BFCL & MCP-Universe with just SFT. 💡 Data, models, and pipeline are all open! Key Features: - Multi-tool & multi-server tasks for advanced planning - Multi-turn interactions with auto-generated follow-ups (super-long trajectories) - Single + parallel function calls → efficient tool use
Zhangchen Xu tweet media
English
6
72
363
34.9K
Yihao Feng retweetledi
Nouha Dziri
Nouha Dziri@nouhadziri·
🚀Ever wondered how to make RL work on impossible hard tasks where pass@k = 0%? 🤔 In our new work, we share the RL Grokking Recipe: a training recipe that enables LLMs to solve previously unsolvable coding problems! I will be at #CoLM2025 next week so happy to chat about it! We also dive into the heated debate: does RL just sharpen previous learnt skills or can it unlock genuinely new reasoning? 🔥🔥 Read the full blog here: tinyurl.com/ntarc3kw #AI #RL #NLP #reinforcementlearning #llm
Nouha Dziri tweet media
English
26
190
1.2K
153.8K
Yihao Feng retweetledi
Yihao Feng retweetledi
Shenao Zhang
Shenao Zhang@ShenaoZhang·
🚀Excited to share our recent research:🚀 “Learning to Reason as Action Abstractions with Scalable Mid-Training RL” We theoretically study 𝙝𝙤𝙬 𝙢𝙞𝙙-𝙩𝙧𝙖𝙞𝙣𝙞𝙣𝙜 𝙨𝙝𝙖𝙥𝙚𝙨 𝙥𝙤𝙨𝙩-𝙩𝙧𝙖𝙞𝙣𝙞𝙣𝙜 𝙍𝙇. The findings lead to a scalable algorithm for learning action hierarchies from expert demonstrations, which we successfully apply to 𝟭𝘽 Python code data. A thread:🧵
Shenao Zhang tweet media
English
7
70
436
59.8K
Yihao Feng retweetledi
elvis
elvis@omarsar0·
Great work showing prompt synthesis as a new scaling axis for reasoning. Good training data is scarce. This work showcases a framework that might make it possible to construct high-quality training problems for reasoning-focused LLMs. Technical details below:
elvis tweet media
English
20
69
348
63.6K
Yihao Feng retweetledi
Tanishq Mathew Abraham, Ph.D.
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·
Language Models that Think, Chat Better "This paper shows that the RLVR paradigm is effective beyond verifiable domains, and introduces RL with Model-rewarded Thinking (RLMT) for general-purpose chat capabilities." "RLMT consistently outperforms standard RLHF pipelines. This includes substantial gains of 3–7 points on three chat benchmarks (AlpacaEval2, WildBench, and ArenaHardV2), along with 1–3 point improvements on other tasks like creative writing and general knowledge. Our best 8B model surpasses GPT-4o in chat and creative writing"
Tanishq Mathew Abraham, Ph.D. tweet media
English
8
34
241
17.7K