Yunzhen Feng

134 posts

Yunzhen Feng

Yunzhen Feng

@feeelix_feng

PhD at CDS, NYU. Ex-Intern at GenAI, FAIR @AIatMeta. Previously undergrad at @PKU1898

Katılım Mayıs 2022
844 Takip Edilen532 Takipçiler
Yunzhen Feng retweetledi
Yuda Song
Yuda Song@yus167·
RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵
Yuda Song tweet media
English
14
102
599
105.7K
Yunzhen Feng retweetledi
Shobhita Sundaram
Shobhita Sundaram@shobsund·
Can a model learn to break its own reasoning plateau? In our new paper, we show that LLMs can be taught with meta-RL to generate their own "stepping stones" that kickstart learning on hard math problems (0/128 success rate) where direct RL fails. Paper 📝: arxiv.org/abs/2601.18778 Blog post 🌐: ssundaram21.github.io/soar/ (1/n)
Shobhita Sundaram tweet mediaShobhita Sundaram tweet media
English
20
108
663
91.3K
Yunzhen Feng retweetledi
Luhuan Wu
Luhuan Wu@hlws_bot·
🚀 ML / Applied Math / Stats PhD Opportunities @JohnsHopkins I'm recruiting PhD students excited about generative modeling, probabilistic inference, and scientific applications (biochemistry, physics, and more), with strong backgrounds in CS/Math/Stats/Basic Science and curiosity for advancing ML and solving real-world problems! Apply to our Applied Mathematics and Statistics PhD program by Dec 15, 2025, and become part of the broader @HopkinsDSAI community! engineering.jhu.edu/ams/academics/…
English
6
34
197
21.7K
Julia Kempe
Julia Kempe@KempeLab·
Interested in our work on characteristics of *Efficient LLM Reasoning*? Come to our spotlight poster at the Efficient Reasoning workshop at NeurIPS today, Exhibit Hall F, and talk to @feeelix_feng .
Julia Kempe tweet media
English
3
12
57
4.3K
Yunzhen Feng
Yunzhen Feng@feeelix_feng·
I’ll be at #NeurIPS2025 until 12/7!👋 Please reach out if you want to chat about RL, reasoning, self-evolving, or LLM diversity. My Pre: 🌟 Fri, Dec 5 (11a-2p): Spotlight on Synthetic Data Scheduling, #4108 🌟 Sat, Dec 6 (11:30a & 4:30p): Spotlight on evaluating CoT, Hall F
Yunzhen Feng tweet mediaYunzhen Feng tweet media
English
0
1
8
441
Yunzhen Feng retweetledi
Julia Kempe
Julia Kempe@KempeLab·
I will be recruiting 1-2 PhD students at @NYUDataScience or @NYUCourant CS to work on Machine Learning & applications in NYU's vibrant top ML ecosystem. Check Google Scholar to see our latest research interests. Interested? Please mention my name in your application. Deadl. 12/12
English
5
79
333
27.2K
Yunzhen Feng
Yunzhen Feng@feeelix_feng·
@zorikgekhman Hey Zorik, thanks for the interest in our work. Could you share your email address?
English
1
0
1
18
Zorik Gekhman
Zorik Gekhman@zorikgekhman·
@feeelix_feng Very interesting work, congrats @feeelix_feng. I want to implement your review ration baseline and wondered if you can share the prompt you used for Llama 4 Maverick to label each chunk as progress or review? And also maybe the code you used for chunking using the keywords?
English
1
0
1
75
Yunzhen Feng
Yunzhen Feng@feeelix_feng·
🔥 NEW PAPER: What makes reasoning traces effective in LLMs? Spoiler: It's NOT length or self-checking. We found a simple graph metric that predicts accuracy better than anything else—and proved it causally. 🧵[1/n]
Yunzhen Feng tweet media
English
4
27
177
10.4K
Yunzhen Feng
Yunzhen Feng@feeelix_feng·
@jxmnop Instruction following ability in the generation?
English
0
0
0
175
dr. jack morris
dr. jack morris@jxmnop·
i've been curious about what information LLMs "forget" during RL recently i spent time combing through research for examples of things models getting worse at after RL turns out that learning to reason makes models better at pretty much everything. scary realization tbh
English
17
11
375
26.3K
Yunzhen Feng retweetledi
Nikos Tsilivis
Nikos Tsilivis@nikostsilivis·
RL has led to amazing advances in reasoning domains with LLMs. But why has it been so successful, and why does the length of the response increases during RL? In new work, we introduce a framework to provide conceptual and theoretical answers to these questions.
Nikos Tsilivis tweet media
English
2
14
62
5K
Yunzhen Feng retweetledi
Saining Xie
Saining Xie@sainingxie·
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)
Saining Xie tweet media
English
57
328
1.9K
414K
Yunzhen Feng
Yunzhen Feng@feeelix_feng·
Current GRPO wastes compute on negative groups — when all K samples are wrong, you get zero gradient despite full generation cost. We propose a principled fix by bridging reward modeling and policy optimization: 👉 Penalize highly confident wrong answers more to create signal.🧵
Yunzhen Feng tweet media
English
7
41
341
39.9K
Yunzhen Feng
Yunzhen Feng@feeelix_feng·
@siddarthv66 Prompt changes does affect eval. But both the baseline training and our method use the same eval setup - for fair comparison. The experiments are run with university compute, so I wish I could run more. We are experimenting using LoRA to train.
English
0
0
1
128
Siddarth Venkatraman
Siddarth Venkatraman@siddarthv66·
Why throw the numina 1.5 results into the appendix then? Regardless it’s another math dataset with comparable difficulty to MATH-500, which we know is grossly insufficient to evaluate these models. There’s so many other eval tasks now in 2025, not including any of them is just being lazy. Even minor prompt changes can account for 1-2% difference on these math tasks. Sorry for saying it’s single seed, but again good RL practice involves reporting std intervals, and 2 seeds aren’t enough. If the improvement was significant it’s still justifiable, but definitely not with 1-2% gains.
English
2
0
15
879
Siddarth Venkatraman
Siddarth Venkatraman@siddarthv66·
> Method is just hacky reward shaping > Only evaluate on MATH-500 (criminal) > Qwen 2.5 and Llama 3.1, with about 1-2% improvement from baselines (single seed) This is awful experimental practice, really disappointing to see MSL/FAIR turning into a paper mill
Siddarth Venkatraman tweet media
Yunzhen Feng@feeelix_feng

Current GRPO wastes compute on negative groups — when all K samples are wrong, you get zero gradient despite full generation cost. We propose a principled fix by bridging reward modeling and policy optimization: 👉 Penalize highly confident wrong answers more to create signal.🧵

English
4
5
88
15.1K
Yunzhen Feng
Yunzhen Feng@feeelix_feng·
@siddarthv66 It was for page limit so we put the Numina 1.5 in the appendix. The eval challenge is mostly for Llama. The accuracy is <1% for AIME25. We do not want GSM or AMC because they're saturated. What else benchmark are there for math that is not contaminated?
English
1
0
1
550