Chenghao Yang

229 posts

Chenghao Yang

Chenghao Yang

@chrome1996

Ph.D. student @UChicago Ex-SR @google Ex-Scientist @AWS. Ex-RA @jhuCLSP @columbianlp @TsinghuaNLP. Ex-Intern @IBM @AWS. Opinions are my own.

Chicago, Illinois เข้าร่วม Mart 2017
756 กำลังติดตาม2.1K ผู้ติดตาม
ทวีตที่ปักหมุด
Chenghao Yang
Chenghao Yang@chrome1996·
Have you noticed… 🔍 Aligned LLM generations feel less diverse? 🎯 Base models are decoding-sensitive? 🤔 Generations get more predictable as they progress? 🌲 Tree search fails mid-generation (esp. for reasoning)? We trace these mysteries to LLM probability concentration, and introduce Branching Factor (BF) — a simple measure that captures it all. Key Findings: — BF declines over time → generations become more deterministic — Alignment tuning slashes BF → shrinks the generative horizon — Low BF explains decoding sensitivity → fewer good options to prune — CoT stabilizes generation → shifts key info to late, low-BF regions — Avoid late branching → too many low-probability, low-quality continuations — Alignment surfaces low-entropy paths already latent in base models 📜 Paper: arxiv.org/abs/2506.17871 🌐 Website: yangalan123.github.io/branching_fact… 🎥 2-min explainer below. Joint work w/ @universeinanegg , thanks for the constructive feedback from @PeterWestTM @UChicagoCI @zhaoran_wang @_Hao_Zhu @ZhiyuanCS @TenghaoHuang45 @TuhinChakr
English
1
24
94
17.4K
Chenghao Yang
Chenghao Yang@chrome1996·
Nice share! I agree with most of that -- taste, judgment, and those higher-level capabilities would be more important. It would be pretty good if researchers could finally be set free from doing a laundry list of "experiments to make reviewer #2 happy". I have some follow-up thoughts on long-context eval to share (forgive my nerdiness :-) ): 1) There have been debates on long-context eval -- basically, whether those metrics, combined with the eval datasets, really measure the model's long-context capability. I think there is a potential confounder that some prepended long context may actually not be that useful, as the topic can shift (e.g., multi-round interactions, storywriting). This may bring us to the "data-or-model" dilemma when debugging. 2) Predicting w/ and w/o context reminds me of the old days when people tried to pretrain the retriever using inverse-cloze tasks (aclanthology.org/P19-1612/). I previously experimented with similar ideas years ago, when we only had BART (direct.mit.edu/tacl/article/d…). It could win a lot on retrieval-focused eval (most current long-context evals fall in this category), but the dataset curation could be tricky, as we, in the end want some downstream utilities. Yes, in general, predicting performance using stats unrelated to applications would be wonderful. But as we move towards more grounded applications, like agentic tasks, those "dirty details" ("scaffold", "harness",..) are hard to ignore, and in many cases, this is sth we could win users and impact.
English
0
0
1
108
Chenghao Yang
Chenghao Yang@chrome1996·
Glad to know that and thanks for sharing! Yeah, I believe the RL-ed model does not have that much space remaining to tune. We have made some initial exploration on this and developed BF as a unified explanatory framework. But further research and reports are still needed and are more than welcome!
English
0
0
0
17
Erika S
Erika S@E_FutureFan·
@chrome1996 @WenhuChen So alignment doesn't add new paths, just locks us into existing low-entropy ones? That explains why CPT on RL-ed models feels so constrained. My own fine-tuning experiments suddenly make more sense.
English
1
0
0
30
Wenhu Chen
Wenhu Chen@WenhuChen·
My 2-cents: Continue Pre-training can only work with "actual Base model". CPT from an RL-ed model just won't give you much gain. I am wondering whether there is any more rigorous study about this.
English
10
4
112
24K
Chenghao Yang
Chenghao Yang@chrome1996·
Thanks to my friends at @OpenAI @GoogleDeepMind , and all the other passionate readers (and my kind interviewers lol) for the great questions! Here is a quick FAQ: 1) Why should I care about BF dynamics? Isn't it just about lexical-level diversity? How does it relate to application-grounded diversity (semantic uncertainty, artificial hivemind, etc.)? Great question! At first glance, BF shares similarities with lexical diversity, as it captures the length-averaged entropy for the whole space. However: Noise: Lexical diversity is known to be confounded by vocabulary size and generation length, correlating poorly with BF and leading to noisy interpretations. The Upper Bound: BF serves as the upper bound for all application-grounded diversity. "Semantics" comes from domain-specific grouping of outputs. BF demonstrates exactly how many instances exist for you to group in the first place. Control: Model probabilities and entropies are the most direct steering factors for training and inference. Studying the structure of LLM probability helps us actually control model outputs. (P.S. I already have follow-up work on RLVR rollout design based on this. Get ready for some hardcore MLSys acceleration to boost RLVR while maintaining stability and precision! 👀) 2) Is BF influenced by data contamination / seeing the prompt during training? Yes and no—it depends on how you define "influence" and "see." In the Appendix, we show that common data contamination metrics do not correlate well with BF. The BF dynamic is fundamentally tied to the structural progress of model generation, not just memorized data. When benchmarking, we intentionally chose model-task combinations to avoid severe contamination impacts while ensuring broad evaluation coverage. That said, data contamination remains an active, open problem in the field, and we welcome more discussion on this!
English
0
0
1
107
Chenghao Yang
Chenghao Yang@chrome1996·
BranchingFactor v1.1 just dropped! 🚀 (Yes — it’s an actively updated paper.) (arxiv.org/abs/2506.17871) As models rely more on post-training, understanding the synergy between pre-training and alignment becomes crucial. Branching Factor (BF) offers a simple way to track the remaining generative potential of a model — since entropy inevitably decreases during generation, BF measures that process. What’s new in v1.1: 1️⃣ Major rewrite We now introduce BF directly — much clearer and easier to read. 2️⃣ Theorem correction + extension Thanks to @StarLi27496427 and Yuwei for catching my misunderstanding of the AEP theorem! We fixed the derivation and extended it to variable-length LLM outputs. The good news: the main result still holds — length-avg log-likelihood can estimate length-avg entropy for sufficiently long generations, in a memory-efficient way. Useful if you want to monitor entropy during training or inference. 3️⃣ Broader evaluation Added experiments on OLMo2 and Qwen3, plus multilingual and long-context tasks. Key findings so far still holds often: 📉 BF decreases during generation ✂️ Alignment significantly reduces BF ⚖️ Interestingly, OLMo2 appears less aggressively shrunk by alignment than Qwen3/Llama3 (preliminary observation). 4️⃣ SFT vs RL analysis We started dissecting how SFT and RL affect BF. Early signals from OLMo2: 🧠 Smaller models: BF shrink mostly happens during SFT (possible memorization effect). 🏗️ Larger models: SFT and RL have comparable impact. Still very preliminary — but it raises interesting questions about how post-training should scale with model size.
English
1
3
24
1.8K
Chenghao Yang
Chenghao Yang@chrome1996·
@JustinLin610 Heartbroken news. Big thanks for all your work in Qwen! Qwen is nothing without its people. Wish you all the best!
English
0
0
4
2.4K
Junyang Lin
Junyang Lin@JustinLin610·
me stepping down. bye my beloved qwen.
English
1.7K
738
13.6K
6.5M
Chenghao Yang
Chenghao Yang@chrome1996·
I will be doing my PhD defense today! Come and learn about my Grounded Alignment works! Detailed information (w/Zoom) below: Candidate: Chenghao Yang Date: Friday, January 30, 2026 Time:  2 pm CST Location: John Crerar Library 298 Zoom:  uchicago.zoom.us/j/96014992390?… Meeting ID: 960 1499 2390 Passcode: 644684
Chenghao Yang tweet mediaChenghao Yang tweet media
English
2
3
33
2.9K
Chenghao Yang
Chenghao Yang@chrome1996·
@m2saxon @jxmnop @universeinanegg Thanks, Michael! This blog looks really nice! Fun Fact: When I initiated my branching factor project with @universeinanegg, I was actually thinking about persona collapsing. Later, we decided to generalize our findings, and that's when Branching Factor came out!
English
0
0
1
57
dr. jack morris
dr. jack morris@jxmnop·
there are so many ways to make an "AI assistant", and yet all the ones that exist have almost the same personality how does post-training turn all LLMs into emojipilled markdownslop infodumpers? no human speaks like this. is this somehow the 'high-reward regime' of RLHF?
dr. jack morris tweet media
English
38
18
395
28.6K
Chenghao Yang
Chenghao Yang@chrome1996·
@zhuokaiz Exciting! Glad we all thought about model collaboration! My collaborator @YichenZW has a work collaborating based and aligned models, achieving a better diversity-quality trade-off with user-defined routers. Check it out! (He is looking for interns!): twitter.com/YichenZW/statu…
Yichen (Zach) Wang@YichenZW

Lack of diversity in your LLM generation? (also noted by Artificial Hivemind, best paper @NeurIPSConf) Time to bring your base model back! An inference-time, token-level collaboration between a base and an aligned model can optimize and control diversity and quality!

English
0
0
4
504
Zhuokai Zhao
Zhuokai Zhao@zhuokaiz·
Meta × TBD Lab × CMU × UChicago × UMaryland In our latest work, we introduce Token-Level LLM Collaboration via FusionRoute 📝: arxiv.org/pdf/2601.05106 LLMs have come a long way, but we continue to face the same trade-off: – one huge model that kind of does everything, but is expensive and inefficient, or – many small specialist models that are cheap, but brittle outside their comfort zones We’ve tried a lot of things in between — model merging, MoE, sequence-level agents, token-level routing, controlled decoding, etc. Each helps a bit, but all come with real limitations. A key realization behind FusionRoute is: Pure token-level model selection is fundamentally limited, unless you assume unrealistically strong global coverage. We show this formally. And then we fix it by letting the same router also generate. Concretely, FusionRoute is a lightweight router LLM that – performs token-level model selection, and – directly contributes complementary logits to refine or correct the selected specialist when it fails So it's not "routing + another model" — the router itself is part of the decoding policy as well. This turns token-level collaboration from a brittle "pick-an-expert" problem into a strictly more expressive policy. No joint training of specialized models. No model merging. No full multi-agent rollouts. In our experiments, FusionRoute works across math, coding, instruction following, and consistently outperforms sequence-level collaboration, prior token-level methods, model merging, and even direct fine-tuning. Feeling especially timely as LLM systems (e.g., GPT-5) move toward routing-based, heterogeneous model stacks (whether prompt-level or test-time).
Zhuokai Zhao tweet media
English
16
59
287
42.4K
Chenghao Yang
Chenghao Yang@chrome1996·
Check out @YichenZW 's great work collaborating the base and aligned models to achieve a great diversity-quality trade-off! Yichen is an amazing collaborator with strong passion and clear communication. He is looking for a Summer 2026 research intern. Don’t miss out!
Yichen (Zach) Wang@YichenZW

Lack of diversity in your LLM generation? (also noted by Artificial Hivemind, best paper @NeurIPSConf) Time to bring your base model back! An inference-time, token-level collaboration between a base and an aligned model can optimize and control diversity and quality!

English
0
0
4
764
Yufei Tian
Yufei Tian@yufei_t·
👋👋Like everyone, I’m heading to SD for #NeurIPS! First ML conference since wrapping up my PhD and joining OpenAI. These days I think about: - post-training for personal & private context - long-horizon, proactive, personalized agents - memory, skills, cognition & context engineering - maintaining diversity & creativity for LLMs Please please overwhelm me with interesting papers! If any of the above resonates (or you’re curious about life/work at OpenAI), come say hi! ☕️🌮 Would love to meet old and new friends.
English
12
3
104
10.1K
Jasmine Wang
Jasmine Wang@j_asminewang·
Today, OpenAI is launching a new Alignment Research blog: a space for publishing more of our work on alignment and safety more frequently, and for a technical audience. alignment.openai.com
English
41
137
1.2K
458.7K
Chenghao Yang
Chenghao Yang@chrome1996·
@uzaymacar Sorry if my responses are a bit broken -- I am at NeurIPS. Happy to chat virtually if anything was not clear! I am also preparing more updates on this paper, including multilingual evals, rewritten theory framing, and a preliminary dissecting SFT vs. RL impacts on BF reduction!
English
0
0
1
133
Uzay Macar
Uzay Macar@uzaymacar·
Cool, I'll be sure to check this out! Some of your findings there (BF declining over time & CoT stabilizing generation) seems related to what we observed in arxiv.org/abs/2506.19143. One thing I'd be curious re: BF is if you observed non-monotonicity at specific steps. Our resampling curves showed spikes/dips around planning and uncertainty management steps in the CoT, e.g., the model exploring a sub-optimal approach to a math problem and then eventually backtracking to the correct approach.
English
3
0
1
106
Chenghao Yang
Chenghao Yang@chrome1996·
Definitely agree! A single sample would often yield noisy interpretations, and we may miss a lot of hidden information on alternative "branches" (I love the tree-like animation!). We have a similar study on the "branching structure" of LLM outputs: x.com/chrome1996/sta…. Our study shows that, while tree searching may look inhibitive, for aligned models, where probability gets highly concentrated, we actually can get most information by only a few rollouts to get sufficiently many high-probability samples!
Uzay Macar@uzaymacar

New paper: You can’t interpret LLM reasoning from one chain-of-thought. You must study a distribution of possible trajectories! Repeated sampling reveals: self-preservation doesn’t drive LLM blackmail, unfaithful reasoning reflects a biased path, & resampling steers behavior. 🧵

English
1
0
2
614
Chenghao Yang
Chenghao Yang@chrome1996·
@uzaymacar Also, I read that thought anchors paper! Great work! I previously shared it within the UChicago community with folks interested in MechInterp lol
English
0
0
1
106
Chenghao Yang
Chenghao Yang@chrome1996·
Yeah, the spikes in uncertainty can happen and have a lot of correlation with planning and route-switching behaviors -- the RLVR community finds this as well (e.g., arxiv.org/abs/2506.01939). Our BF reduction observation indicates that those high-uncertain parts are more likely to occur at the beginning of the generation ("planning phase"), across closed and open-ended domains. I have a follow-up paper utilizing this dynamic to encourage diverse reasoning trajectories (while staying roughly on-policy) generation in the RL process: x.com/chrome1996/sta…
English
0
0
1
120
Chenghao Yang
Chenghao Yang@chrome1996·
Definitely agree on distributional perspectives! A single sample would often yield noisy outputs, and we may miss a lot of hidden information on alternative "branches" (I love the tree-like animation!). We have a similar study on the "branching structure" of LLM outputs: x.com/chrome1996/sta…. Our study shows that, while tree searching may look inhibitive, for aligned models, where probability gets highly concentrated, we actually can get most information by only a few rollouts to get sufficiently many high-probability samples!
English
0
0
1
355
Uzay Macar
Uzay Macar@uzaymacar·
New paper: You can’t interpret LLM reasoning from one chain-of-thought. You must study a distribution of possible trajectories! Repeated sampling reveals: self-preservation doesn’t drive LLM blackmail, unfaithful reasoning reflects a biased path, & resampling steers behavior. 🧵
English
3
21
105
18.7K
Chenghao Yang รีทวีตแล้ว
NewInML @ NeurIPS 2025
NewInML @ NeurIPS 2025@NewInML·
We are starting in 15 mins!! Join us for the #NewInML workshop happening now at @NeurIPSConf! Location: Upper Ballroom 31ABC, San Diego Convention Centre. We have an amazing speaker lineup that you don’t want to miss!
NewInML @ NeurIPS 2025 tweet media
English
0
2
3
1.3K
Chenghao Yang รีทวีตแล้ว
Arian Khorasani 🦅
Arian Khorasani 🦅@Arian_Khorasani·
📢Our @NewInML workshop will be happening today at the Room Upper 31ABC, San Diego Convention Center, starting at 12 PM! If you're at @NeurIPSConf, it's a great opportunity for you to join us! We also have amazing speakers! We're looking forward to welcoming you! #NeurIPS2025
Arian Khorasani 🦅@Arian_Khorasani

🚨 New in ML Workshop at @NeurIPSConf We're so excited to invite you to the New In ML Workshop (@NewInML), taking place on Tuesday, December 2nd, 2025, at the San Diego Convention Center! Great opportunity, specifically for people who are new in machine learning! Details🧵

English
0
4
11
885