Yihao Feng

234 posts

Yihao Feng

@yihaocs

Palo Alto, CA Katılım Eylül 2013

485 Takip Edilen116 Takipçiler

Yihao Feng retweetledi

Hayden Prairie@hayden_prairie·15 Nis

We’ve been thinking a lot about scaling laws, wondering if there is a more effective way to scale FLOPs without increasing parameters. Turns out the answer is YES – by looping blocks of layers during training. We find that predictable scaling laws exist for layer looping, allowing us to use looping to achieve the quality of a Transformer twice the size. Our scaling laws suggest that for a fixed parameter budget, data and looping should be increased in tandem! 🧵👇

English

178

1.3K

288K

Yihao Feng retweetledi

Rosinality@rosinality·5 Ara

The update for the likelihood of correct actions can be very small or even negative, especially in RL with tool use. This might be because of rather OOD tokens from tool outputs and increased uncertainty from it, and structurally shared prefixes among trajectories.

English

Yihao Feng retweetledi

Rohan Paul@rohanpaul_ai·15 Kas

The paper shows a way to make LLM reasoning shorter without losing accuracy by rewarding conciseness only when answers are correct. On a 7B model, it reports 8.1% higher accuracy with 19.9% fewer tokens. Overthinking is the problem, models make long chains of steps that repeat ideas and burn compute. Simple length penalties backfire, they either push very short guesses or destabilize training. So the authors train a small judge that scores a full solution for repetition, relevance, and brevity. During reinforcement learning, the conciseness bonus applies only if the final answer is correct, which stops reward gaming. The bonus slowly fades as training goes on, and it shrinks on harder questions so longer reasoning stays allowed. This cleaner signal reduces gradient noise, makes updates steadier, and avoids the collapse seen with plain length penalties. Across math tasks and backbones like Qwen, Llama, and Mistral, outputs get shorter while accuracy holds or improves. ---- Paper – arxiv. org/abs/2511.09158 Paper Title: "Efficient Reasoning via Reward Model"

English

111

Yihao Feng retweetledi

Rohan Paul@rohanpaul_ai·12 Kas

New Tencent paper shows how smarter data curation makes reinforcement learning code models much stronger. The core idea is to stop obsessing over new reinforcement learning tricks and fix the data and training flow. The team first fine tunes a 32B model on curated coding data and tags each problem by difficulty. They then run reinforcement learning with real rewards from unit tests that execute the code. Stage 1 widens the model’s habits by training on a broad mixed set with multiple attempts per prompt. Stage 2 zooms in on a small pool of the toughest problems and spends many more attempts per prompt. This 2 stage setup raises pass rates on LiveCode, LeetCode, and Codeforces, with the biggest gains on hard sets. The ablations show that skipping either stage hurts results, and that easy samples during Stage 2 waste learning. The largest jump is a 58% relative gain on Codeforces against the same size baseline. The same recipe also helps a larger mixture of experts model, which means the approach scales. ---- Paper – arxiv. org/abs/2511.06307 Paper Title: "DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation"

English

9.2K

Yihao Feng retweetledi

Rohan Paul@rohanpaul_ai·30 Eki

New Nvidia paper shows how a single LLM can teach itself to reason better. It creates 3 roles from the same model, a Proposer, a Solver, and a Judge. The Proposer writes hard but solvable questions that stretch the model. The Solver answers those questions with clear steps and final results. The Judge scores both question quality and answer correctness using strict rules. All 3 roles run in a closed loop and learn with reinforcement learning. The Proposer’s reward blends quality, difficulty, and clean formatting checks. The difficulty reward is higher when the Solver fails, which drives harder tasks. A quality filter blocks weak or confusing questions from entering the training set. The Solver’s reward comes from the Judge’s score plus a formatting check. The Judge is trained to output reliable numeric scores inside simple tags. All role parameters update together using normalized advantages for stability. On Qwen2.5-3B, the approach lifts average accuracy by 4.54%, and it beats simple supervised fine tuning while rivaling a strong self play baseline without outside tools. ---- Paper – arxiv. org/abs/2510.23595v1 Paper Title: "Multi-Agent Evolve: LLM Self-Improve through Co-evolution"

English

435

21.8K

Yihao Feng retweetledi

vLLM@vllm_project·22 Eki

🚀 Excited to share our work on batch-invariant inference in vLLM! Now you can get identical results regardless of batch size with just one flag: VLLM_BATCH_INVARIANT=1 No more subtle differences between bs=1 and bs=N (including prefill!). Let's dive into how we built this 🧵👇

English

281

38.6K

Yihao Feng retweetledi

Yufan Zhuang@yufan_zhuang·21 Eki

Can LLMs reason beyond context limits? 🤔 Introducing Knowledge Flow, a training-free method that helped gpt-oss-120b & Qwen3-235B achieve 100% on the AIME-25, no tools. How? like human deliberation, for LLMs. 📝 Blog: yufanzhuang.notion.site/knowledge-flow 💻 Code: github.com/EvanZhuang/kno…

English

231

60.3K

Yihao Feng retweetledi

Rohan Paul@rohanpaul_ai·19 Eki

The paper shows a few reinforcement learning tweaks let small LLM agents use tools better and beat larger ones. A 4B model matches or exceeds 32B agents on hard math, science, and code tasks. Old training stitched fake tool traces together, which taught clumsy timing for tool calls. They switch to real, end to end tool sessions so the model learns when, why, and how to act. They also build a diverse, model aware reinforcement set that keeps training exploratory and steady. Algorithm changes are simple, loosen clipping a bit, add a light length penalty near the limit, and train with token level loss. This keeps the model’s action entropy in a healthy range so it keeps trying useful options. On behavior, deliberate agents think a bit longer, call fewer tools, and win more per call. Reactive agents spam tools and get worse results per call. Long chain of thought models avoid tools on reasoning tasks, so instruction tuned models scale cleaner for agents. The recipe is real trajectories plus entropy friendly updates plus deliberate tool use. This combo turns compact agents into strong problem solvers with efficient tool calls. ---- Paper – arxiv. org/abs/2510.11701 Paper Title: "Demystifying Reinforcement Learning in Agentic Reasoning"

English

182

10.3K

Yihao Feng retweetledi

Simon Yu@simon_ycl·15 Eki

💥New Paper Diversity is the key to everything: creative tasks and RL exploration. Yet, most LLMs suffered from mode collapse, always repeating the same answers. Our new paper introduces Verbalized Sampling, a general method to bypass this and unlock your model's true potential. See the discussion 📷👇

Weiyan Shi@shi_weiyan

New paper: You can make ChatGPT 2x as creative with one sentence. Ever notice how LLMs all sound the same? They know 100+ jokes but only ever tell one. Every blog intro: "In today's digital landscape..." We figured out why – and how to unlock the rest 🔓 Copy-paste prompt: 🧵

English

153

24K

Yihao Feng retweetledi

Ant Ling@AntLingAGI·13 Eki

2/5 Continuously evolving deep thinking. Ring-1T scales reinforcement learning from 16B → 100B → trillions of parameters with our icepop algorithm and ASystem. This ensures stable long-term RL training for large MoE models and bridges the gap between training and inference. The result: smoother reasoning, longer-context stability, and stronger alignment under trillion-scale load.

English

5.1K

Yihao Feng retweetledi

God of Prompt@godofprompt·13 Eki

🚨 NVIDIA just did the impossible. They trained a 12B-parameter language model on 10 trillion tokens entirely in 4-bit precision. It’s called NVFP4, and it might redefine how frontier AI models are trained. Here’s why this matters: • NVFP4 delivers 2–3× faster math throughput and 50% less memory vs FP8 • Accuracy? Practically identical. (MMLU-Pro: FP8 = 62.62%, NVFP4 = 62.58%) • Stability issues? Solved using Random Hadamard transforms, stochastic rounding, and 2D scaling • Trained entirely on NVIDIA Blackwell GPUs the first 4-bit run stable across 10T tokens This is the first successful demonstration of large-scale 4-bit pretraining without losing accuracy. The next generation of frontier models will be faster, cheaper, and greener without compromise.

English

320

210.9K

Yihao Feng retweetledi

Infini-AI-Lab@InfiniAILab·7 Eki

🤔Can we train RL on LLMs with extremely stale data? 🚀Our latest study says YES! Stale data can be as informative as on-policy data, unlocking more scalable, efficient asynchronous RL for LLMs. We introduce M2PO, an off-policy RL algorithm that keeps training stable and performant even when using data stale by 256 model updates. 🔗 Notion Blog: m2po.notion.site/rl-stale-m2po 📄 Paper: arxiv.org/abs/2510.01161 💻 GitHub: github.com/Infini-AI-Lab/… 🧵 1/4

English

232

63.3K

Yihao Feng retweetledi

Dan Alistarh@DAlistarh·6 Eki

🚀 We are releasing state-of-the-art post-training quantization (PTQ) algorithms for Microscaling FP4, together with kernels: - First study focused on MXFP4/NVFP4 PTQ for LLMs - New Micro-Rotated (MR) format and GPTQ algorithm - QuTLASS GPU kernels with up to 3.6x speedups.

English

152

9.4K

Yihao Feng retweetledi

Zhihu Frontier@ZhihuFrontier·3 Eki

🚀 @AntLingAGI open-sourced Ling 2.0 — the first FP8-native mixed-precision training framework for MoE models. Plug-and-play. So how does it work? Zhihu contributor & Ling Team dev 千千 shares his thinking. 🧠 Why it matters: Ling 2.0 is trained natively in FP8, not BF16 with post-hoc quantization. Using fine-grained tile/block-wise scaling, it avoids outlier distortion and retains BF16-level accuracy. ⚙️ Key Results: • Up to 30-60% throughput gain with MTP • Still 90-120% faster without MTP • Training loss curve matches BF16 — virtually no accuracy drop 💡 Innovations include: • FP8 Optimizer: Compresses 1st & 2nd moments to FP8, saving up to 75% Adam optimizer memory • On-demand weight transpose: Cuts redundant memory for backward pass • FP8 routing maps for more efficient MoE dispatching • Support for E4M3 FP8 format in backward — better accuracy than E5M2 🔍 Tackling FP8 challenges: • Quantization errors (e.g. outliers clipped to 0) • Value distortion (close values collapsed) Ling 2.0 adds monitoring for quantization loss, underflow, and FP32 recompute deltas — alerts when precision degrades. 💭 Why it matters long-term: In the BF16-vs-FP8 tradeoff, Ling 2.0 emphasizes "lossless precision" first, then optimizes performance. Combined with 3D parallelism and dynamic memory strategies, it's ideal for training large LLMs under memory constraints. 📖 Full technical post: zhuanlan.zhihu.com/p/195129140928… #FP8 #LLM #AIInfra #OpenSource #AntGroup #Ling2

English

5.5K

Yihao Feng retweetledi

Zhangchen Xu@zhangchen_xu·3 Eki

🚀 Want high-quality, realistic, and truly challenging post-training data for the agentic era? Introducing Toucan-1.5M (huggingface.co/papers/2510.01…) — the largest open tool-agentic dataset yet: ✨ 1.53M real agentic trajectories synthesized by 3 models ✨ Diverse, challenging tasks across wide domains ✨ Built from 495 MCP servers & 2K+ open-source tools 📈 SOTA results on BFCL & MCP-Universe with just SFT. 💡 Data, models, and pipeline are all open! Key Features: - Multi-tool & multi-server tasks for advanced planning - Multi-turn interactions with auto-generated follow-ups (super-long trajectories) - Single + parallel function calls → efficient tool use

English

363

34.9K

Yihao Feng retweetledi

Nouha Dziri@nouhadziri·1 Eki

🚀Ever wondered how to make RL work on impossible hard tasks where pass@k = 0%? 🤔 In our new work, we share the RL Grokking Recipe: a training recipe that enables LLMs to solve previously unsolvable coding problems! I will be at #CoLM2025 next week so happy to chat about it! We also dive into the heated debate: does RL just sharpen previous learnt skills or can it unlock genuinely new reasoning? 🔥🔥 Read the full blog here: tinyurl.com/ntarc3kw #AI #RL #NLP #reinforcementlearning #llm

English

190

1.2K

153.8K

Yihao Feng retweetledi

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·2 Eki

Seed on why GRPO is not frontier: zero gradients from uniform batch reward. Grandiose solution: "estimate success rate, allocate rollout budget accordingly". +40% efficiency! tbh I'm sure I've seen this kind of thinking several times already but well done

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

Ziniu Li@ZiniuLi

🚀 Excited to share our work at Bytedance Seed! Knapsack RL: Unlocking Exploration of LLMs via Budget Allocation 🎒 Exploration in LLM training is crucial but expensive. Uniform rollout allocation is wasteful: ✅ Easy tasks → always solved → 0 gradient ❌ Hard tasks → always fail → 0 gradient 💡 Our idea: treat exploration as a knapsack problem → allocate rollouts where they matter most. ✨ Results: 🔼 +20–40% more non-zero gradients 🧮 Up to 93 rollouts for hard tasks (w/o extra compute) 📈 +2–4 avg points, +9 peak gains on math benchmarks 💰 ~2× cheaper than uniform allocation 📄 Paper: huggingface.co/papers/2509.25…

English

138

12.5K

Yihao Feng retweetledi

Shenao Zhang@ShenaoZhang·1 Eki

🚀Excited to share our recent research:🚀 “Learning to Reason as Action Abstractions with Scalable Mid-Training RL” We theoretically study 𝙝𝙤𝙬 𝙢𝙞𝙙-𝙩𝙧𝙖𝙞𝙣𝙞𝙣𝙜 𝙨𝙝𝙖𝙥𝙚𝙨 𝙥𝙤𝙨𝙩-𝙩𝙧𝙖𝙞𝙣𝙞𝙣𝙜 𝙍𝙇. The findings lead to a scalable algorithm for learning action hierarchies from expert demonstrations, which we successfully apply to 𝟭𝘽 Python code data. A thread:🧵

English

436

59.8K

Yihao Feng retweetledi

elvis@omarsar0·28 Eyl

Great work showing prompt synthesis as a new scaling axis for reasoning. Good training data is scarce. This work showcases a framework that might make it possible to construct high-quality training problems for reasoning-focused LLMs. Technical details below:

English

348

63.6K

Yihao Feng retweetledi

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·25 Eyl

Language Models that Think, Chat Better "This paper shows that the RLVR paradigm is effective beyond verifiable domains, and introduces RL with Model-rewarded Thinking (RLMT) for general-purpose chat capabilities." "RLMT consistently outperforms standard RLHF pipelines. This includes substantial gains of 3–7 points on three chat benchmarks (AlpacaEval2, WildBench, and ArenaHardV2), along with 1–3 point improvements on other tasks like creative writing and general knowledge. Our best 8B model surpasses GPT-4o in chat and creative writing"

Tanishq Mathew Abraham, Ph.D. tweet media

English

241

17.7K

Keşfet

@AntLingAGI @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine