Margaret Li @ Neurips ‘25

70 posts

Margaret Li @ Neurips ‘25 banner
Margaret Li @ Neurips ‘25

Margaret Li @ Neurips ‘25

@margs_li

👩‍💻 PhD student @UWCSE / @UWNLP & @MetaAI. Formerly RE @FacebookAI Research, @Penn CS | 🏂💃🧋🥯 certified bi-coastal bb IAH/PEK/PHL/NYC/SFO/SEA

Katılım Haziran 2019
136 Takip Edilen1K Takipçiler
Sabitlenmiş Tweet
Margaret Li @ Neurips ‘25
Margaret Li @ Neurips ‘25@margs_li·
We nearly drove ourselves insane trying to reproduce scaling laws papers 📉 So of course we wrote a paper about it 😵‍💫 1/9
Margaret Li @ Neurips ‘25 tweet media
English
1
29
160
14.4K
Margaret Li @ Neurips ‘25
Margaret Li @ Neurips ‘25@margs_li·
(4) How are we optimizing the fit? (Loss? Optimizer? Initialization?) 📈: Do we need to perform a grid search for our initializations? How many points should we try? What happens if we initialize from a hypothesized law (e.g. chinchilla)? 8/9
Margaret Li @ Neurips ‘25 tweet media
English
1
0
1
835
Margaret Li @ Neurips ‘25
Margaret Li @ Neurips ‘25@margs_li·
We nearly drove ourselves insane trying to reproduce scaling laws papers 📉 So of course we wrote a paper about it 😵‍💫 1/9
Margaret Li @ Neurips ‘25 tweet media
English
1
29
160
14.4K
Margaret Li @ Neurips ‘25
RLHF-aligned LMs excel at long-form generation, but how? We show how current models rely on anchor spans ⚓: strings that occur across many samples for the same prompt, forming an implicit outline, viz below.
Margaret Li @ Neurips ‘25 tweet media
English
6
37
220
31.9K
Margaret Li @ Neurips ‘25
@HJCH0 We're trying to minimize the influence of, e.g., vocab distribution on our long-term generation diversity metrics p=0.9 for RLHF models reflects common practice. For Base models, we truncate using the p which most closely matches the RLHF model statistics More info in the paper!
English
0
0
2
241
Justin Cho
Justin Cho@HJCH0·
@margs_li Interesting! Quick question, why do you use different p values for the base model vs the RLHF model?
English
1
0
1
330
Margaret Li @ Neurips ‘25
We use span-alignment algorithms from Bioinformatics to quantify the implicit outline that RLHF’d models use. It turns out that even when using truncated sampling to compensate for differences in diversity, aligned models exhibit significantly more overlap than their base LMs.
Margaret Li @ Neurips ‘25 tweet media
English
1
0
11
9.5K
Margaret Li @ Neurips ‘25
Yes, the sneak peek is a joke, generated by @AnthropicAI 's Claude. The message is not though! We're super excited to discuss modular / sparse LLMs and how we train them ☺️
English
1
0
6
700
Margaret Li @ Neurips ‘25
Sneak peek: "My fellow AI practitioners, I come to you today to spread the good news of embarrassingly parallel training of expert models. Too often we limit ourselves to single monolithic models. No more I say! The path to AI enlightenment is through specialization."
Stanford NLP Group@stanfordnlp

For this week's NLP Seminar, we are excited to host @margs_li and @ssgrn ! The talk will happen Thursday at 11 AM PT. Non-Stanford affiliates registration link: forms.gle/cvGobkVshhJcvN…. Information will be sent out one hour before the talk.

English
1
5
69
15.3K
Margaret Li @ Neurips ‘25 retweetledi
Terra Blevins
Terra Blevins@TerraBlvns·
New paper alert!! ✨ Translate to Disambiguate: Zero-shot Multilingual Word Sense Disambiguation with Pretrained Language Models (PLMs) ✨ We evaluate how well PLMs translate words in context and then leverage this prompting setup to perform zero-shot WSD on 18 languages! 1/n
Terra Blevins tweet media
English
1
23
60
15.5K
Margaret Li @ Neurips ‘25 retweetledi
Mitchell Wortsman
Mitchell Wortsman@Mitchnw·
Sharing our project on 1) accelerating and 2) stabilizing training for large language-vision models 1) Towards accelerating training, we introduce SwitchBack, a linear layer for int8 quantized training which matches bfloat16 within 0.1 for CLIP ViT-Huge arxiv.org/abs/2304.13013
Mitchell Wortsman tweet media
English
5
56
215
45K
Margaret Li @ Neurips ‘25
@suchenzang @jefrankle @MosaicML differentiating b/t this and just having, e.g., a code model, a c4 model, a papers model, etc., because I'm interested in how the different restrictions play with each other / want some more careful curation and comparable results on the same tasks
English
0
0
1
117
Jonathan Frankle
Jonathan Frankle@jefrankle·
Would anybody be interested in a couple dozen 1B, llama-style (waaaay past Chinchilla) language models trained on different data mixes? I don't know if this question has been well-studied before.
English
32
12
186
91K