Yunzhen Feng

134 posts

Yunzhen Feng

@feeelix_feng

PhD at CDS, NYU. Ex-Intern at GenAI, FAIR @AIatMeta. Previously undergrad at @PKU1898

Katılım Mayıs 2022

844 Takip Edilen532 Takipçiler

Yunzhen Feng retweetledi

Tianle Cai@tianle_cai·6d

x.com/i/article/2042…

ZXX

630

218.2K

Yunzhen Feng retweetledi

Yuda Song@yus167·3 Şub

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English

102

599

105.7K

Yunzhen Feng retweetledi

Shobhita Sundaram@shobsund·27 Oca

Can a model learn to break its own reasoning plateau? In our new paper, we show that LLMs can be taught with meta-RL to generate their own "stepping stones" that kickstart learning on hard math problems (0/128 success rate) where direct RL fails. Paper 📝: arxiv.org/abs/2601.18778 Blog post 🌐: ssundaram21.github.io/soar/ (1/n)

English

108

663

91.3K

Yunzhen Feng retweetledi

Luhuan Wu@hlws_bot·11 Ara

🚀 ML / Applied Math / Stats PhD Opportunities @JohnsHopkins I'm recruiting PhD students excited about generative modeling, probabilistic inference, and scientific applications (biochemistry, physics, and more), with strong backgrounds in CS/Math/Stats/Basic Science and curiosity for advancing ML and solving real-world problems! Apply to our Applied Mathematics and Statistics PhD program by Dec 15, 2025, and become part of the broader @HopkinsDSAI community! engineering.jhu.edu/ams/academics/…

English

197

21.7K

Yunzhen Feng@feeelix_feng·7 Ara

@codewithimanshu @KempeLab Best reasoning: Be accurate first and then improve the efficiency

English

Himanshu Kumar@codewithimanshu·7 Ara

@KempeLab @feeelix_feng That's a very interesting topic, Julia, but I wonder if efficiency always equals better reasoning, no?

English

Julia Kempe@KempeLab·7 Ara

Interested in our work on characteristics of *Efficient LLM Reasoning*? Come to our spotlight poster at the Efficient Reasoning workshop at NeurIPS today, Exhibit Hall F, and talk to @feeelix_feng .

English

4.3K

Yunzhen Feng@feeelix_feng·3 Ara

I’ll be at #NeurIPS2025 until 12/7!👋 Please reach out if you want to chat about RL, reasoning, self-evolving, or LLM diversity. My Pre: 🌟 Fri, Dec 5 (11a-2p): Spotlight on Synthetic Data Scheduling, #4108 🌟 Sat, Dec 6 (11:30a & 4:30p): Spotlight on evaluating CoT, Hall F

English

441

Yunzhen Feng retweetledi

Julia Kempe@KempeLab·1 Ara

I will be recruiting 1-2 PhD students at @NYUDataScience or @NYUCourant CS to work on Machine Learning & applications in NYU's vibrant top ML ecosystem. Check Google Scholar to see our latest research interests. Interested? Please mention my name in your application. Deadl. 12/12

English

333

27.2K

Yunzhen Feng@feeelix_feng·26 Kas

@zorikgekhman Hey Zorik, thanks for the interest in our work. Could you share your email address?

English

Zorik Gekhman@zorikgekhman·24 Kas

@feeelix_feng Very interesting work, congrats @feeelix_feng. I want to implement your review ration baseline and wondered if you can share the prompt you used for Llama 4 Maverick to label each chunk as progress or review? And also maybe the code you used for chunking using the keywords?

English

Yunzhen Feng@feeelix_feng·24 Eyl

🔥 NEW PAPER: What makes reasoning traces effective in LLMs? Spoiler: It's NOT length or self-checking. We found a simple graph metric that predicts accuracy better than anything else—and proved it causally. 🧵[1/n]

English

177

10.4K

Yunzhen Feng@feeelix_feng·17 Eki

@jxmnop Instruction following ability in the generation?

English

175

dr. jack morris@jxmnop·16 Eki

i've been curious about what information LLMs "forget" during RL recently i spent time combing through research for examples of things models getting worse at after RL turns out that learning to reason makes models better at pretty much everything. scary realization tbh

English

375

26.3K

Yunzhen Feng retweetledi

Nikos Tsilivis@nikostsilivis·15 Eki

RL has led to amazing advances in reasoning domains with LLMs. But why has it been so successful, and why does the length of the response increases during RL? In new work, we introduce a framework to provide conceptual and theoretical answers to these questions.

English

Yunzhen Feng retweetledi

Saining Xie@sainingxie·14 Eki

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

English

328

1.9K

414K

Yunzhen Feng@feeelix_feng·14 Eki

@AntChen_ @KempeLab @YaqiDuanPKU @jparag123 @tonyjhartshorn @AIatMeta @NYUDataScience 1) Yes, but p* does not sum to 1 over all o. It represents the probability of correctness of o given q. 2) Yes. For the same reason, we need to scale the policy probability into correctness probability.

English

124

Anthony GX-Chen@AntChen_·14 Eki

@feeelix_feng @KempeLab @YaqiDuanPKU @jparag123 @tonyjhartshorn @AIatMeta @NYUDataScience Cool! Some naive Qs to better understand how you estimate confidence which seem to rely on D(q). Here: 1) Eq 4 further normalizes pi* when it is already a proper distribution? 2) Sub (5) into (4) gives p*(q,o) = \sum_o p*(q,o) pi*(o|q)? Or am I mis-reading the notation?

English

115

Yunzhen Feng@feeelix_feng·13 Eki

Current GRPO wastes compute on negative groups — when all K samples are wrong, you get zero gradient despite full generation cost. We propose a principled fix by bridging reward modeling and policy optimization: 👉 Penalize highly confident wrong answers more to create signal.🧵

English

341

39.9K

Yunzhen Feng@feeelix_feng·14 Eki

@josancamon19 @KempeLab @YaqiDuanPKU @jparag123 @tonyjhartshorn Are you referring to the dynamic sampling in DAPO? In DAPO, they oversample and then filter out all the negative groups. In contrast, we aim to recover training signal from those discarded groups.

English

356

Joan Cabezas@josancamon19·14 Eki

@feeelix_feng @KempeLab @YaqiDuanPKU @jparag123 @tonyjhartshorn how’s this different from DAPO rejection sampling?

English

430

Yunzhen Feng@feeelix_feng·14 Eki

@siddarthv66 Prompt changes does affect eval. But both the baseline training and our method use the same eval setup - for fair comparison. The experiments are run with university compute, so I wish I could run more. We are experimenting using LoRA to train.

English

128

Siddarth Venkatraman@siddarthv66·13 Eki

Why throw the numina 1.5 results into the appendix then? Regardless it’s another math dataset with comparable difficulty to MATH-500, which we know is grossly insufficient to evaluate these models. There’s so many other eval tasks now in 2025, not including any of them is just being lazy. Even minor prompt changes can account for 1-2% difference on these math tasks. Sorry for saying it’s single seed, but again good RL practice involves reporting std intervals, and 2 seeds aren’t enough. If the improvement was significant it’s still justifiable, but definitely not with 1-2% gains.

English

879

Siddarth Venkatraman@siddarthv66·13 Eki

> Method is just hacky reward shaping > Only evaluate on MATH-500 (criminal) > Qwen 2.5 and Llama 3.1, with about 1-2% improvement from baselines (single seed) This is awful experimental practice, really disappointing to see MSL/FAIR turning into a paper mill

Yunzhen Feng@feeelix_feng

English

15.1K

Yunzhen Feng@feeelix_feng·14 Eki

@siddarthv66 It was for page limit so we put the Numina 1.5 in the appendix. The eval challenge is mostly for Llama. The accuracy is <1% for AIME25. We do not want GSM or AMC because they're saturated. What else benchmark are there for math that is not contaminated?

English

550

Yunzhen Feng@feeelix_feng·13 Eki

@KempeLab @YaqiDuanPKU @jparag123 @tonyjhartshorn @AIatMeta @NYUDataScience Key observation: (1) Our method continues to improve accuracy when GRPO saturates ⬆️ (2) Our method improves all Pass@k metrics This matches our intuition—by learning from negative groups, we get better exploration on hard problems where it matters most.

English

736

Yunzhen Feng@feeelix_feng·13 Eki

@KempeLab @YaqiDuanPKU @jparag123 @tonyjhartshorn @AIatMeta @NYUDataScience We experiment on two different training sets with Llama-3.1-8B and Qwen-2.5-3B 📈 For MATH+DAPO, we run two random seeds. Our method consistently outperforms GRPO across training, with significant improvements on hard problems (Level 4-5)

English

869

Keşfet

@JohnsHopkins @HopkinsDSAI @codewithimanshu @KempeLab @NYUDataScience @NYUCourant @zorikgekhman @jxmnop