Howard Chen

160 posts

Howard Chen

@__howardchen

Understanding machine intelligence. PhD @Princeton

Katılım Ağustos 2013

1.5K Takip Edilen1.2K Takipçiler

Howard Chen@__howardchen·5 Kas

We let agents accumulate its context freely assuming little or no side-effects. This may not be the case! Sometimes they answer political or moral questions differently and even act differently after reading or conducting research. More analysis in the thread!

Jiayi Geng@JiayiiGeng

We use LLMs for everyday tasks—research, writing, coding, decision-making. They remember our conversations, adapt to our needs and preferences. Naturally, we trust them more with repeated use. But this growing trust might be masking a hidden risk: what if their beliefs are shifting and we don't notice? We study the question "Do LM assistants change their beliefs as context accumulates?" in our new preprint: 👇 (1/n)

English

2.1K

Howard Chen@__howardchen·4 Kas

@liliyu_lili @thinkymachines Congrats Lili!!!!

English

106

Lili Yu@liliyu_lili·3 Kas

Career update: I’ve joined @thinkymachines. Building ambitious multimodal AI—language, vision, and audio working together from day one—with a kind, world-class team of researchers & builders.

English

1.4K

208.5K

Howard Chen@__howardchen·29 Eki

Very beautiful, very powerful.

Chieh-Hsin (Jesse) Lai@JCJesseLai

Tired to go back to the original papers again and again? Our monograph: a systematic and fundamental recipe you can rely on! 📘 We’re excited to release 《The Principles of Diffusion Models》— with @DrYangSong, @gimdong58085414, @mittu1204, and @StefanoErmon. It traces the core ideas that shaped diffusion modeling and explains how today’s models work, why they work, and where they’re heading. 🧵You’ll find the link and a few highlights in the thread. We’d love to hear your thoughts and join some discussions! ⚡ Stay tuned for our markdown version, where you can drop your comments!

English

604

Howard Chen@__howardchen·19 Eyl

This is what we should be using AI for. Yeah, science!

Google DeepMind@GoogleDeepMind

We’re announcing a major advance in the study of fluid dynamics with AI 💧 in a joint paper with researchers from @BrownUniversity, @nyuniversity and @Stanford.

English

632

Howard Chen@__howardchen·12 Ağu

@DimitrisPapail Essentially an updated version of the Shakespeare-typing monkey? But now the monkey gets rewarded and learns (slowly from scratch). Though conceptually it feels more like "inverse RL" where the reward func is simply the exact match against the expert demos and not a learned one.

English

198

Dimitris Papailiopoulos@DimitrisPapail·11 Ağu

Another interesting observation performing SGD on x-entropy loss on your text corpus is equivalent to REINFORCE, i.e., on-policy policy gradient, with binary reward "Did my model generate text from corpus"

Dimitris Papailiopoulos@DimitrisPapail

Why is cross-entropy a good loss for language pretraining? caveat: this is all known btw; interestingly, even though there are many viewpoints and intuitions on "why x-ent", they all can be arrived at from a single starting point. Here's a simple first-principles derivation that doesn't assume anything about the data distribution. It comes from a very reasonable operational requirement :) "I want my model to sound intelligent" but we can't measure that, so we ask "I want my model to sound like a human" Although we have access to all texts ever written, we can't quite measure that either, so we instead ask "I want my model to be as likely as possible to generate one of the texts ever written" Or more bluntly: "I want my model to memorize the training data." Consider this thought experiment: Given a dataset S of all text ever written by humans, we perform independent trials for each "text" in S: Sample: "sample text" from our model Pr( ;W) Check: did "sample text" exactly match the original? Note: we do not condition on anything! we just ask, of all the stuff the model could generate, did we get "text". Define success as the event E = "all per-sample checks succeed" The probability of E is, the product of the probabilities assigned to the correct ground truth by your model W Pr(E) = Π_{text in S} Pr(text; W) Maximizing log Pr(E) over W gives you the cross-entropy objective. How do you do you optimize this with SGD? sample text from corpus compute grad log Pr(token|prefix) for every prefix of text update model What's elegant is that this same simultaneously: 1) Minimizes the description length of the data under model P( ;W) (compression view) 2) Minimizes KL divergence to the true distribution—if one exists (though we never assumed one) 3) Implements maximum likelihood estimation The derivation is straightforward and well-known, but it highlights something important: cross-entropy emerges naturally from wanting exact reproduction of the training data. P.S. you could have instead asked to maximize Pr(text generated by the model is in ground truth) interestingly, optimizing this can lead to mode collapse, since an optimal solution is to always predict a single piece of text from the corpus. Yet the gradients again look like x-entropy but with a multiplying factor i.e., Pr(text;W) grad log Pr(text;W)

English

257

75.1K

Howard Chen@__howardchen·19 Haz

Agents need a new type of learning in the era of experience (imagining 4.0). Not quite gradient descent and not exactly in-context learning. Experience never ends so you'd need metabolism. Many emerging ideas recently but none cracked it yet.

Rohan Paul@rohanpaul_ai

Andrej Karpathy: Software Is Changing (Again) Key learning points from this brilliant lecture from yesterday. 🚀 The Shifting Software Map For 70 years code flowed in one style, then neural networks arrived and rewrote large patches of logic. Karpathy divides eras into software 1.0 for handwritten instructions, software 2.0 for trained weights, and software 3.0 for programmable large language models that obey plain-text prompts. Each layer still matters, so future engineers must move smoothly among explicit code, dataset tuning, and prompt design.

English

1.8K

Howard Chen retweetledi

Jiayi Geng@JiayiiGeng·18 Haz

I'm thrilled to share that I've moved to Pittsburgh and joined NeuLab at CMU as a research intern this summer, advised by @gneubig! I'll also start my PhD @LTIatCMU this fall. Feel free to reach out if you're interested in chatting about multi-agent systems, LLMs for scientific discovery, or cognitive science! Special thanks to all the amazing people who've inspired and supported me throughout my master's journey at Princeton, especially my advisors @danqi_chen and Tom Griffiths (@cocosci_lab), and my mentor @__howardchen! I'm deeply grateful for their incredible guidance and encouragement!🐯🎓

English

366

39.2K

Howard Chen@__howardchen·27 May

Science is about inferring underlying rules/dynamics of a system. Can SoTA LLMs do it reliably? Despite all the progress in building AI scientists, we find it still nontrivial for models to reverse-engineer simple black-box systems. More insights/analysis in the thread!

Jiayi Geng@JiayiiGeng

Using LLMs to build AI scientists is all the rage now (e.g., Google’s AI co-scientist [1] and Sakana’s Fully Automated Scientist [2]), but how much do we understand about their core scientific abilities? We know how LLMs can be vastly useful (solving complex math problems) yet unreliable (counting the number of "R"s in "strawberry" or calculating 9.9 - 9.11) at the same time. Similarly, despite recent advances in applying LLMs to science, are we confident that they can reliably uncover the underlying mechanism of a simple black-box system in a controlled setting? We study this question in our new preprint: 📢👇 (1/n)

English

1.7K

Howard Chen@__howardchen·1 Nis

Experience is the data of AI. Absolutely.

Richard Sutton@RichardSSutton

Rich's slogans for AI research (revised 2006): 1. Approximate the solution, not the problem (no special cases) 2. Drive from the problem 3. Take the agent’s point of view 4. Don’t ask the agent to achieve what it can’t measure 5. Don't ask the agent to know what it can't verify 6. Set measurable goals for subparts of the agent 7. Discriminative models are usually better than generative models 8. Work by orthogonal dimensions. Work issue by issue 9. Work on ideas, not software 10. Experience is the data of AI incompleteideas.net/rlai.cs.ualber…

English

861

Howard Chen retweetledi

Noam Razin@noamrazin·20 Mar

The success of RLHF depends heavily on the quality of the reward model (RM), but how should we measure this quality? 📰 We study what makes a good RM from an optimization perspective. Among other results, we formalize why more accurate RMs are not necessarily better teachers! 🧵

English

132

832

108.2K

Howard Chen retweetledi

Alex Wettig@_awettig·18 Şub

🤔 Ever wondered how prevalent some type of web content is during LM pre-training? In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐 Key takeaway: domains help us curate better pre-training data! 🧵/N

English

210

49.3K

Howard Chen@__howardchen·30 Oca

@abacaj Smells like model’s reward hacking party lol

English

175

anton@abacaj·30 Oca

The perfect reward function doesn't exis....

English

1.8K

124K

Howard Chen@__howardchen·4 Oca

This is truly heartbreaking.

Nick Hill@nickhill33

@douwekiela @FelixHill84 Felix’s story: docs.google.com/document/d/1-j…

English

659

Howard Chen@__howardchen·28 Eki

@archit_sharma97 Not exactly though? I mean the KL is itself E_{y~\pi}[log \pi / \pi_0] so if you want to wrap the expectation outside then it should be log \pi/\pi_0 in the bracket?

English

352

Archit Sharma@archit_sharma97·28 Eki

@__howardchen yep, putting KL inside or outside the expectation is the same

English

1.2K

Archit Sharma@archit_sharma97·28 Eki

I don't have a paper to write this in but there is an interesting property when thinking about iterative RL(HF) algorithms. It seems natural to use an improved policy to sample new data online when training LLMs -- turns out that this just lowers the weight on the KL constraint!

English

353

62.6K

Howard Chen@__howardchen·8 Eki

Are we so back or not?

English

1.4K

Howard Chen@__howardchen·8 Eki

Great thread.

Sebastian Seung@SebastianSeung

The theoretical physics approach to neural nets was launched by @HopfieldJohn in this classic 1982 paper that introduced the "energy function" to associative memory models. pnas.org/doi/abs/10.107…

English

878

Howard Chen@__howardchen·7 Eki

At #COLM2024 Monday to Wednesday. Excited to meet new people and catch up with old friends!

English

1.6K

Howard Chen retweetledi

Tianyu Gao@gaotianyu1350·4 Eki

Very proud to introduce two of our recent long-context works: HELMET (best long-context benchmark imo): shorturl.at/JnBHD ProLong (a cont’d training & SFT recipe + a SoTA 512K 8B model): shorturl.at/XQV7a Here is a story of how we arrived there

English

199

55.9K

Howard Chen retweetledi

Alex Wettig@_awettig·4 Eki

How to train long-context LMs? (and beat Llama-3.1 🏆) Many takeaways from our new paper! - Focus on diverse & reliable evaluations (not just perplexity) - Find good sources of long data and high-quality short data - ... A 🧵 on how we produced ProLong, a SoTA 8B 512K model

English

123

21.3K

Keşfet

@liliyu_lili @thinkymachines @DimitrisPapail @gneubig @LTIatCMU @danqi_chen @cocosci_lab @abacaj