Chenghao Yang

229 posts

Chenghao Yang

@chrome1996

Ph.D. student @UChicago Ex-SR @google Ex-Scientist @AWS. Ex-RA @jhuCLSP @columbianlp @TsinghuaNLP. Ex-Intern @IBM @AWS. Opinions are my own.

Chicago, Illinois เข้าร่วม Mart 2017

756 กำลังติดตาม2.1K ผู้ติดตาม

ทวีตที่ปักหมุด

Chenghao Yang@chrome1996·24 Haz

Have you noticed… 🔍 Aligned LLM generations feel less diverse? 🎯 Base models are decoding-sensitive? 🤔 Generations get more predictable as they progress? 🌲 Tree search fails mid-generation (esp. for reasoning)? We trace these mysteries to LLM probability concentration, and introduce Branching Factor (BF) — a simple measure that captures it all. Key Findings: — BF declines over time → generations become more deterministic — Alignment tuning slashes BF → shrinks the generative horizon — Low BF explains decoding sensitivity → fewer good options to prune — CoT stabilizes generation → shifts key info to late, low-BF regions — Avoid late branching → too many low-probability, low-quality continuations — Alignment surfaces low-entropy paths already latent in base models 📜 Paper: arxiv.org/abs/2506.17871 🌐 Website: yangalan123.github.io/branching_fact… 🎥 2-min explainer below. Joint work w/ @universeinanegg , thanks for the constructive feedback from @PeterWestTM @UChicagoCI @zhaoran_wang @_Hao_Zhu @ZhiyuanCS @TenghaoHuang45 @TuhinChakr

English

17.4K

Chenghao Yang@chrome1996·20h

Nice share! I agree with most of that -- taste, judgment, and those higher-level capabilities would be more important. It would be pretty good if researchers could finally be set free from doing a laundry list of "experiments to make reviewer #2 happy". I have some follow-up thoughts on long-context eval to share (forgive my nerdiness :-) ): 1) There have been debates on long-context eval -- basically, whether those metrics, combined with the eval datasets, really measure the model's long-context capability. I think there is a potential confounder that some prepended long context may actually not be that useful, as the topic can shift (e.g., multi-round interactions, storywriting). This may bring us to the "data-or-model" dilemma when debugging. 2) Predicting w/ and w/o context reminds me of the old days when people tried to pretrain the retriever using inverse-cloze tasks (aclanthology.org/P19-1612/). I previously experimented with similar ideas years ago, when we only had BART (direct.mit.edu/tacl/article/d…). It could win a lot on retrieval-focused eval (most current long-context evals fall in this category), but the dataset curation could be tricky, as we, in the end want some downstream utilities. Yes, in general, predicting performance using stats unrelated to applications would be wonderful. But as we move towards more grounded applications, like agentic tasks, those "dirty details" ("scaffold", "harness",..) are hard to ignore, and in many cases, this is sth we could win users and impact.

English

108

He He@hhexiy·2d

x.com/i/article/2036…

ZXX

121

836

107.8K

Chenghao Yang@chrome1996·5d

Glad to know that and thanks for sharing! Yeah, I believe the RL-ed model does not have that much space remaining to tune. We have made some initial exploration on this and developed BF as a unified explanatory framework. But further research and reports are still needed and are more than welcome!

English

Erika S@E_FutureFan·5d

@chrome1996 @WenhuChen So alignment doesn't add new paths, just locks us into existing low-entropy ones? That explains why CPT on RL-ed models feels so constrained. My own fine-tuning experiments suddenly make more sense.

English

Wenhu Chen@WenhuChen·6d

My 2-cents: Continue Pre-training can only work with "actual Base model". CPT from an RL-ed model just won't give you much gain. I am wondering whether there is any more rigorous study about this.

English

112

24K

Chenghao Yang@chrome1996·6d

Thanks to my friends at @OpenAI @GoogleDeepMind , and all the other passionate readers (and my kind interviewers lol) for the great questions! Here is a quick FAQ: 1) Why should I care about BF dynamics? Isn't it just about lexical-level diversity? How does it relate to application-grounded diversity (semantic uncertainty, artificial hivemind, etc.)? Great question! At first glance, BF shares similarities with lexical diversity, as it captures the length-averaged entropy for the whole space. However: Noise: Lexical diversity is known to be confounded by vocabulary size and generation length, correlating poorly with BF and leading to noisy interpretations. The Upper Bound: BF serves as the upper bound for all application-grounded diversity. "Semantics" comes from domain-specific grouping of outputs. BF demonstrates exactly how many instances exist for you to group in the first place. Control: Model probabilities and entropies are the most direct steering factors for training and inference. Studying the structure of LLM probability helps us actually control model outputs. (P.S. I already have follow-up work on RLVR rollout design based on this. Get ready for some hardcore MLSys acceleration to boost RLVR while maintaining stability and precision! 👀) 2) Is BF influenced by data contamination / seeing the prompt during training? Yes and no—it depends on how you define "influence" and "see." In the Appendix, we show that common data contamination metrics do not correlate well with BF. The BF dynamic is fundamentally tied to the structural progress of model generation, not just memorized data. When benchmarking, we intentionally chose model-task combinations to avoid severe contamination impacts while ensuring broad evaluation coverage. That said, data contamination remains an active, open problem in the field, and we welcome more discussion on this!

English

107

Chenghao Yang@chrome1996·6d

BranchingFactor v1.1 just dropped! 🚀 (Yes — it’s an actively updated paper.) (arxiv.org/abs/2506.17871) As models rely more on post-training, understanding the synergy between pre-training and alignment becomes crucial. Branching Factor (BF) offers a simple way to track the remaining generative potential of a model — since entropy inevitably decreases during generation, BF measures that process. What’s new in v1.1: 1️⃣ Major rewrite We now introduce BF directly — much clearer and easier to read. 2️⃣ Theorem correction + extension Thanks to @StarLi27496427 and Yuwei for catching my misunderstanding of the AEP theorem! We fixed the derivation and extended it to variable-length LLM outputs. The good news: the main result still holds — length-avg log-likelihood can estimate length-avg entropy for sufficiently long generations, in a memory-efficient way. Useful if you want to monitor entropy during training or inference. 3️⃣ Broader evaluation Added experiments on OLMo2 and Qwen3, plus multilingual and long-context tasks. Key findings so far still holds often: 📉 BF decreases during generation ✂️ Alignment significantly reduces BF ⚖️ Interestingly, OLMo2 appears less aggressively shrunk by alignment than Qwen3/Llama3 (preliminary observation). 4️⃣ SFT vs RL analysis We started dissecting how SFT and RL affect BF. Early signals from OLMo2: 🧠 Smaller models: BF shrink mostly happens during SFT (possible memorization effect). 🏗️ Larger models: SFT and RL have comparable impact. Still very preliminary — but it raises interesting questions about how post-training should scale with model size.

English

1.8K

Chenghao Yang@chrome1996·10 Mar

@ziqiao_ma @UMengineering BIG Congrats!

English

106

Martin Ziqiao Ma@ziqiao_ma·9 Mar

Thank you @UMengineering for this recognition! Forever go blue〽️

MichiganAI@michigan_AI

Big congratulations to @ziqiao_ma for winning the Towner Prize for Outstanding PhD Research.🎉 It's a major recognition of creativity, impact, and outstanding contributions to #AI research. A well-deserved recognition!👏 cse.engin.umich.edu/stories/martin…

English

6.9K

Chenghao Yang@chrome1996·3 Mar

@JustinLin610 Heartbroken news. Big thanks for all your work in Qwen! Qwen is nothing without its people. Wish you all the best!

English

2.4K

Junyang Lin@JustinLin610·3 Mar

me stepping down. bye my beloved qwen.

English

1.7K

738

13.6K

6.5M

Chenghao Yang@chrome1996·30 Oca

I will be doing my PhD defense today! Come and learn about my Grounded Alignment works! Detailed information (w/Zoom) below: Candidate: Chenghao Yang Date: Friday, January 30, 2026 Time: 2 pm CST Location: John Crerar Library 298 Zoom: uchicago.zoom.us/j/96014992390?… Meeting ID: 960 1499 2390 Passcode: 644684

English

2.9K

Chenghao Yang@chrome1996·30 Oca

@m2saxon @jxmnop @universeinanegg Thanks, Michael! This blog looks really nice! Fun Fact: When I initiated my branching factor project with @universeinanegg, I was actually thinking about persona collapsing. Later, we decided to generalize our findings, and that's when Branching Factor came out!

English

Michael Saxon@m2saxon·30 Oca

@jxmnop attn: @chrome1996 @universeinanegg your COLM branching factor paper is cited pretty nicely in this blog to explain the phenomenon :)

English

dr. jack morris@jxmnop·24 Oca

there are so many ways to make an "AI assistant", and yet all the ones that exist have almost the same personality how does post-training turn all LLMs into emojipilled markdownslop infodumpers? no human speaks like this. is this somehow the 'high-reward regime' of RLHF?

English

395

28.6K

Chenghao Yang@chrome1996·9 Oca

@zhuokaiz Exciting! Glad we all thought about model collaboration! My collaborator @YichenZW has a work collaborating based and aligned models, achieving a better diversity-quality trade-off with user-defined routers. Check it out! (He is looking for interns!): twitter.com/YichenZW/statu…

Yichen (Zach) Wang@YichenZW

Lack of diversity in your LLM generation? (also noted by Artificial Hivemind, best paper @NeurIPSConf) Time to bring your base model back! An inference-time, token-level collaboration between a base and an aligned model can optimize and control diversity and quality!

English

504

Zhuokai Zhao@zhuokaiz·9 Oca

Meta × TBD Lab × CMU × UChicago × UMaryland In our latest work, we introduce Token-Level LLM Collaboration via FusionRoute 📝: arxiv.org/pdf/2601.05106 LLMs have come a long way, but we continue to face the same trade-off: – one huge model that kind of does everything, but is expensive and inefficient, or – many small specialist models that are cheap, but brittle outside their comfort zones We’ve tried a lot of things in between — model merging, MoE, sequence-level agents, token-level routing, controlled decoding, etc. Each helps a bit, but all come with real limitations. A key realization behind FusionRoute is: Pure token-level model selection is fundamentally limited, unless you assume unrealistically strong global coverage. We show this formally. And then we fix it by letting the same router also generate. Concretely, FusionRoute is a lightweight router LLM that – performs token-level model selection, and – directly contributes complementary logits to refine or correct the selected specialist when it fails So it's not "routing + another model" — the router itself is part of the decoding policy as well. This turns token-level collaboration from a brittle "pick-an-expert" problem into a strictly more expressive policy. No joint training of specialized models. No model merging. No full multi-agent rollouts. In our experiments, FusionRoute works across math, coding, instruction following, and consistently outperforms sequence-level collaboration, prior token-level methods, model merging, and even direct fine-tuning. Feeling especially timely as LLM systems (e.g., GPT-5) move toward routing-based, heterogeneous model stacks (whether prompt-level or test-time).

English

287

42.4K

Chenghao Yang@chrome1996·22 Ara

Check out @YichenZW 's great work collaborating the base and aligned models to achieve a great diversity-quality trade-off! Yichen is an amazing collaborator with strong passion and clear communication. He is looking for a Summer 2026 research intern. Don’t miss out!

Yichen (Zach) Wang@YichenZW

English

764

Chenghao Yang@chrome1996·3 Ara

@yufei_t Would love to meet up! Just have a follow-up work on your StoryArc work using model collaboration to achieve a good diversity-quality trade-off! Joint with Yichen (@YichenZW ) and Tenghao (@TenghaoHuang45 ) on creative generation (arxiv.org/abs/2511.05650) (A quick demo below)

English

374

Yufei Tian@yufei_t·2 Ara

👋👋Like everyone, I’m heading to SD for #NeurIPS! First ML conference since wrapping up my PhD and joining OpenAI. These days I think about: - post-training for personal & private context - long-horizon, proactive, personalized agents - memory, skills, cognition & context engineering - maintaining diversity & creativity for LLMs Please please overwhelm me with interesting papers! If any of the above resonates (or you’re curious about life/work at OpenAI), come say hi! ☕️🌮 Would love to meet old and new friends.

English

104

10.1K

Chenghao Yang@chrome1996·3 Ara

@j_asminewang So excited that my must-read blogs have just come back!

English

Jasmine Wang@j_asminewang·1 Ara

Today, OpenAI is launching a new Alignment Research blog: a space for publishing more of our work on alignment and safety more frequently, and for a technical audience. alignment.openai.com

English

137

1.2K

458.7K

Chenghao Yang@chrome1996·3 Ara

@uzaymacar Sorry if my responses are a bit broken -- I am at NeurIPS. Happy to chat virtually if anything was not clear! I am also preparing more updates on this paper, including multilingual evals, rewritten theory framing, and a preliminary dissecting SFT vs. RL impacts on BF reduction!

English

133

Uzay Macar@uzaymacar·3 Ara

Cool, I'll be sure to check this out! Some of your findings there (BF declining over time & CoT stabilizing generation) seems related to what we observed in arxiv.org/abs/2506.19143. One thing I'd be curious re: BF is if you observed non-monotonicity at specific steps. Our resampling curves showed spikes/dips around planning and uncertainty management steps in the CoT, e.g., the model exploring a sub-optimal approach to a math problem and then eventually backtracking to the correct approach.

English

106

Chenghao Yang@chrome1996·2 Ara

Definitely agree! A single sample would often yield noisy interpretations, and we may miss a lot of hidden information on alternative "branches" (I love the tree-like animation!). We have a similar study on the "branching structure" of LLM outputs: x.com/chrome1996/sta…. Our study shows that, while tree searching may look inhibitive, for aligned models, where probability gets highly concentrated, we actually can get most information by only a few rollouts to get sufficiently many high-probability samples!

Uzay Macar@uzaymacar

New paper: You can’t interpret LLM reasoning from one chain-of-thought. You must study a distribution of possible trajectories! Repeated sampling reveals: self-preservation doesn’t drive LLM blackmail, unfaithful reasoning reflects a biased path, & resampling steers behavior. 🧵

English

614

Chenghao Yang@chrome1996·3 Ara

@uzaymacar Also, I read that thought anchors paper! Great work! I previously shared it within the UChicago community with folks interested in MechInterp lol

English

106

Chenghao Yang@chrome1996·3 Ara

Yeah, the spikes in uncertainty can happen and have a lot of correlation with planning and route-switching behaviors -- the RLVR community finds this as well (e.g., arxiv.org/abs/2506.01939). Our BF reduction observation indicates that those high-uncertain parts are more likely to occur at the beginning of the generation ("planning phase"), across closed and open-ended domains. I have a follow-up paper utilizing this dynamic to encourage diverse reasoning trajectories (while staying roughly on-policy) generation in the RL process: x.com/chrome1996/sta…

English

120

Chenghao Yang@chrome1996·2 Ara

Definitely agree on distributional perspectives! A single sample would often yield noisy outputs, and we may miss a lot of hidden information on alternative "branches" (I love the tree-like animation!). We have a similar study on the "branching structure" of LLM outputs: x.com/chrome1996/sta…. Our study shows that, while tree searching may look inhibitive, for aligned models, where probability gets highly concentrated, we actually can get most information by only a few rollouts to get sufficiently many high-probability samples!

English

355

Uzay Macar@uzaymacar·2 Ara

English

105

18.7K

Chenghao Yang รีทวีตแล้ว

NewInML @ NeurIPS 2025@NewInML·2 Ara

We are starting in 15 mins!! Join us for the #NewInML workshop happening now at @NeurIPSConf! Location: Upper Ballroom 31ABC, San Diego Convention Centre. We have an amazing speaker lineup that you don’t want to miss!

English

1.3K

Chenghao Yang รีทวีตแล้ว

Arian Khorasani 🦅@Arian_Khorasani·2 Ara

📢Our @NewInML workshop will be happening today at the Room Upper 31ABC, San Diego Convention Center, starting at 12 PM! If you're at @NeurIPSConf, it's a great opportunity for you to join us! We also have amazing speakers! We're looking forward to welcoming you! #NeurIPS2025

Arian Khorasani 🦅@Arian_Khorasani

🚨 New in ML Workshop at @NeurIPSConf We're so excited to invite you to the New In ML Workshop (@NewInML), taking place on Tuesday, December 2nd, 2025, at the San Diego Convention Center! Great opportunity, specifically for people who are new in machine learning! Details🧵

English

885

ค้นพบ

@WenhuChen @OpenAI @GoogleDeepMind @StarLi27496427 @ziqiao_ma @UMengineering @JustinLin610 @m2saxon