Howard Chen

160 posts

Howard Chen banner
Howard Chen

Howard Chen

@__howardchen

Understanding machine intelligence. PhD @Princeton

Katılım Ağustos 2013
1.5K Takip Edilen1.2K Takipçiler
Lili Yu
Lili Yu@liliyu_lili·
Career update: I’ve joined @thinkymachines. Building ambitious multimodal AI—language, vision, and audio working together from day one—with a kind, world-class team of researchers & builders.
English
42
15
1.4K
208.5K
Howard Chen
Howard Chen@__howardchen·
Very beautiful, very powerful.
Chieh-Hsin (Jesse) Lai@JCJesseLai

Tired to go back to the original papers again and again? Our monograph: a systematic and fundamental recipe you can rely on! 📘 We’re excited to release 《The Principles of Diffusion Models》— with @DrYangSong, @gimdong58085414, @mittu1204, and @StefanoErmon. It traces the core ideas that shaped diffusion modeling and explains how today’s models work, why they work, and where they’re heading. 🧵You’ll find the link and a few highlights in the thread. We’d love to hear your thoughts and join some discussions! ⚡ Stay tuned for our markdown version, where you can drop your comments!

English
0
0
2
604
Howard Chen
Howard Chen@__howardchen·
@DimitrisPapail Essentially an updated version of the Shakespeare-typing monkey? But now the monkey gets rewarded and learns (slowly from scratch). Though conceptually it feels more like "inverse RL" where the reward func is simply the exact match against the expert demos and not a learned one.
English
0
0
1
198
Dimitris Papailiopoulos
Dimitris Papailiopoulos@DimitrisPapail·
Another interesting observation performing SGD on x-entropy loss on your text corpus is equivalent to REINFORCE, i.e., on-policy policy gradient, with binary reward "Did my model generate text from corpus"
Dimitris Papailiopoulos@DimitrisPapail

Why is cross-entropy a good loss for language pretraining? caveat: this is all known btw; interestingly, even though there are many viewpoints and intuitions on "why x-ent", they all can be arrived at from a single starting point. Here's a simple first-principles derivation that doesn't assume anything about the data distribution. It comes from a very reasonable operational requirement :) "I want my model to sound intelligent" but we can't measure that, so we ask "I want my model to sound like a human" Although we have access to all texts ever written, we can't quite measure that either, so we instead ask "I want my model to be as likely as possible to generate one of the texts ever written" Or more bluntly: "I want my model to memorize the training data." Consider this thought experiment: Given a dataset S of all text ever written by humans, we perform independent trials for each "text" in S: Sample: "sample text" from our model Pr( ;W) Check: did "sample text" exactly match the original? Note: we do not condition on anything! we just ask, of all the stuff the model could generate, did we get "text". Define success as the event E = "all per-sample checks succeed" The probability of E is, the product of the probabilities assigned to the correct ground truth by your model W Pr(E) = Π_{text in S} Pr(text; W) Maximizing log Pr(E) over W gives you the cross-entropy objective. How do you do you optimize this with SGD? sample text from corpus compute grad log Pr(token|prefix) for every prefix of text update model What's elegant is that this same simultaneously: 1) Minimizes the description length of the data under model P( ;W) (compression view) 2) Minimizes KL divergence to the true distribution—if one exists (though we never assumed one) 3) Implements maximum likelihood estimation The derivation is straightforward and well-known, but it highlights something important: cross-entropy emerges naturally from wanting exact reproduction of the training data. P.S. you could have instead asked to maximize Pr(text generated by the model is in ground truth) interestingly, optimizing this can lead to mode collapse, since an optimal solution is to always predict a single piece of text from the corpus. Yet the gradients again look like x-entropy but with a multiplying factor i.e., Pr(text;W) grad log Pr(text;W)

English
21
13
257
75.1K
Howard Chen retweetledi
Jiayi Geng
Jiayi Geng@JiayiiGeng·
I'm thrilled to share that I've moved to Pittsburgh and joined NeuLab at CMU as a research intern this summer, advised by @gneubig! I'll also start my PhD @LTIatCMU this fall. Feel free to reach out if you're interested in chatting about multi-agent systems, LLMs for scientific discovery, or cognitive science! Special thanks to all the amazing people who've inspired and supported me throughout my master's journey at Princeton, especially my advisors @danqi_chen and Tom Griffiths (@cocosci_lab), and my mentor @__howardchen! I'm deeply grateful for their incredible guidance and encouragement!🐯🎓
English
12
13
366
39.2K
Howard Chen retweetledi
Noam Razin
Noam Razin@noamrazin·
The success of RLHF depends heavily on the quality of the reward model (RM), but how should we measure this quality? 📰 We study what makes a good RM from an optimization perspective. Among other results, we formalize why more accurate RMs are not necessarily better teachers! 🧵
Noam Razin tweet media
English
8
132
832
108.2K
Howard Chen retweetledi
Alex Wettig
Alex Wettig@_awettig·
🤔 Ever wondered how prevalent some type of web content is during LM pre-training? In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐 Key takeaway: domains help us curate better pre-training data! 🧵/N
Alex Wettig tweet media
English
5
51
210
49.3K
Howard Chen
Howard Chen@__howardchen·
@abacaj Smells like model’s reward hacking party lol
English
0
0
3
175
anton
anton@abacaj·
The perfect reward function doesn't exis....
anton tweet media
English
63
63
1.8K
124K
Howard Chen
Howard Chen@__howardchen·
@archit_sharma97 Not exactly though? I mean the KL is itself E_{y~\pi}[log \pi / \pi_0] so if you want to wrap the expectation outside then it should be log \pi/\pi_0 in the bracket?
English
1
0
0
352
Archit Sharma
Archit Sharma@archit_sharma97·
I don't have a paper to write this in but there is an interesting property when thinking about iterative RL(HF) algorithms. It seems natural to use an improved policy to sample new data online when training LLMs -- turns out that this just lowers the weight on the KL constraint!
Archit Sharma tweet media
English
13
36
353
62.6K
Howard Chen
Howard Chen@__howardchen·
Are we so back or not?
Howard Chen tweet media
English
0
0
8
1.4K
Howard Chen
Howard Chen@__howardchen·
At #COLM2024 Monday to Wednesday. Excited to meet new people and catch up with old friends!
Howard Chen tweet media
English
0
0
11
1.6K
Howard Chen retweetledi
Tianyu Gao
Tianyu Gao@gaotianyu1350·
Very proud to introduce two of our recent long-context works: HELMET (best long-context benchmark imo): shorturl.at/JnBHD ProLong (a cont’d training & SFT recipe + a SoTA 512K 8B model): shorturl.at/XQV7a Here is a story of how we arrived there
Tianyu Gao tweet media
English
5
43
199
55.9K
Howard Chen retweetledi
Alex Wettig
Alex Wettig@_awettig·
How to train long-context LMs? (and beat Llama-3.1 🏆) Many takeaways from our new paper! - Focus on diverse & reliable evaluations (not just perplexity) - Find good sources of long data and high-quality short data - ... A 🧵 on how we produced ProLong, a SoTA 8B 512K model
Alex Wettig tweet media
English
3
28
123
21.3K