Margaret Li @ Neurips ‘25

70 posts

Margaret Li @ Neurips ‘25

@margs_li

👩‍💻 PhD student @UWCSE / @UWNLP & @MetaAI. Formerly RE @FacebookAI Research, @Penn CS | 🏂💃🧋🥯 certified bi-coastal bb IAH/PEK/PHL/NYC/SFO/SEA

Katılım Haziran 2019

136 Takip Edilen1K Takipçiler

Sabitlenmiş Tweet

Margaret Li @ Neurips ‘25@margs_li·27 Şub

We nearly drove ourselves insane trying to reproduce scaling laws papers 📉 So of course we wrote a paper about it 😵‍💫 1/9

English

160

14.4K

Margaret Li @ Neurips ‘25@margs_li·27 Şub

To appear at #ICLR2025 Thanks to the coauthors @snehaark @LukeZettlemoyer who descended into this madness with me Arxiv: arxiv.org/pdf/2502.18969 Code: github.com/hadasah/scalin… Checkpoints: huggingface.co/misfitting/mis… 9/9

English

925

Margaret Li @ Neurips ‘25@margs_li·27 Şub

(4) How are we optimizing the fit? (Loss? Optimizer? Initialization?) 📈: Do we need to perform a grid search for our initializations? How many points should we try? What happens if we initialize from a hypothesized law (e.g. chinchilla)? 8/9

English

835

Margaret Li @ Neurips ‘25@margs_li·27 Şub

We nearly drove ourselves insane trying to reproduce scaling laws papers 📉 So of course we wrote a paper about it 😵‍💫 1/9

English

160

14.4K

Margaret Li @ Neurips ‘25@margs_li·29 Haz

@maxforbes omg ty, a good figure is the only kind of eye candy I want but on a serious note, H/T to @universeinanegg , I think he was the first one of us to bring up sankey diagrams

English

395

Max Forbes@maxforbes·29 Haz

@margs_li These Sankey diagrams are 🔥

English

478

Margaret Li @ Neurips ‘25@margs_li·28 Haz

RLHF-aligned LMs excel at long-form generation, but how? We show how current models rely on anchor spans ⚓: strings that occur across many samples for the same prompt, forming an implicit outline, viz below.

English

220

31.9K

Margaret Li @ Neurips ‘25@margs_li·29 Haz

@HJCH0 We're trying to minimize the influence of, e.g., vocab distribution on our long-term generation diversity metrics p=0.9 for RLHF models reflects common practice. For Base models, we truncate using the p which most closely matches the RLHF model statistics More info in the paper!

English

241

Justin Cho@HJCH0·29 Haz

@margs_li Interesting! Quick question, why do you use different p values for the base model vs the RLHF model?

English

330

Margaret Li @ Neurips ‘25@margs_li·28 Haz

Of course, there’s much more in the paper than we could fit in a tweet thread! Paper: nlppapers.notion.site/Predicting-vs-… And thanks to all my amazing co-authors: @WeijiaShi2, @ArtidoroPagnoni, @PeterWestTM, and @universeinanegg!

English

1.3K

Margaret Li @ Neurips ‘25@margs_li·28 Haz

We use span-alignment algorithms from Bioinformatics to quantify the implicit outline that RLHF’d models use. It turns out that even when using truncated sampling to compensate for differences in diversity, aligned models exhibit significantly more overlap than their base LMs.

English

9.5K

Margaret Li @ Neurips ‘25@margs_li·17 May

Yes, the sneak peek is a joke, generated by @AnthropicAI 's Claude. The message is not though! We're super excited to discuss modular / sparse LLMs and how we train them ☺️

English

700

Margaret Li @ Neurips ‘25@margs_li·17 May

Sneak peek: "My fellow AI practitioners, I come to you today to spread the good news of embarrassingly parallel training of expert models. Too often we limit ourselves to single monolithic models. No more I say! The path to AI enlightenment is through specialization."

Stanford NLP Group@stanfordnlp

For this week's NLP Seminar, we are excited to host @margs_li and @ssgrn ! The talk will happen Thursday at 11 AM PT. Non-Stanford affiliates registration link: forms.gle/cvGobkVshhJcvN…. Information will be sent out one hour before the talk.

English

15.3K

Margaret Li @ Neurips ‘25 retweetledi

Terra Blevins@TerraBlvns·28 Nis

New paper alert!! ✨ Translate to Disambiguate: Zero-shot Multilingual Word Sense Disambiguation with Pretrained Language Models (PLMs) ✨ We evaluate how well PLMs translate words in context and then leverage this prompting setup to perform zero-shot WSD on 18 languages! 1/n

English

15.5K

Margaret Li @ Neurips ‘25 retweetledi

Mitchell Wortsman@Mitchnw·26 Nis

Sharing our project on 1) accelerating and 2) stabilizing training for large language-vision models 1) Towards accelerating training, we introduce SwitchBack, a linear layer for int8 quantized training which matches bfloat16 within 0.1 for CLIP ViT-Huge arxiv.org/abs/2304.13013

English

215

45K

Margaret Li @ Neurips ‘25@margs_li·20 Nis

@suchenzang @jefrankle @MosaicML differentiating b/t this and just having, e.g., a code model, a c4 model, a papers model, etc., because I'm interested in how the different restrictions play with each other / want some more careful curation and comparable results on the same tasks

English

117

Margaret Li @ Neurips ‘25@margs_li·20 Nis

@suchenzang @jefrankle ❤️ @suchenzang for the callout, @jefrankle lowkey wanna do this, but would also love it from @MosaicML: this expt w/ each model trained on diff data subsets from a giant heap of everything w/ diff quality/code/lang filters, domain, dedup, etc -- super curious how they'd compare

English

439

Jonathan Frankle@jefrankle·18 Nis

Would anybody be interested in a couple dozen 1B, llama-style (waaaay past Chinchilla) language models trained on different data mixes? I don't know if this question has been well-studied before.

English

186

91K

Keşfet

@snehaark @LukeZettlemoyer @maxforbes @universeinanegg @HJCH0 @WeijiaShi2 @ArtidoroPagnoni @PeterWestTM