Gregory Yauney

27 posts

Gregory Yauney banner
Gregory Yauney

Gregory Yauney

@gyauney

Cornell PhD student working on ML and NLP

Katılım Haziran 2010
140 Takip Edilen230 Takipçiler
Gregory Yauney
Gregory Yauney@gyauney·
Our Pretrainer's Guide won an ✨outstanding paper award✨ at #NAACL2024 today! Big congrats to all the coauthors, especially @ShayneRedford (who led this big project), @emilyrreif, @katherine1ee, @dmimno, and @daphneipp! Thanks @naaclmeeting!
Gregory Yauney tweet media
Gregory Yauney@gyauney

Come talk to us about pretraining data curation at #NAACL2024 at 2pm at poster session 2! We're presenting A Pretrainer's Guide to Training Data Paper: aclanthology.org/2024.naacl-lon…

English
10
11
110
15.8K
Gregory Yauney retweetledi
Shayne Longpre
Shayne Longpre@ShayneRedford·
Super appreciative of the recognition from #NAACL2024 — our Pretrainer’s Guide won an 🌟Outstanding Paper Award🌟🏆 This was a year long analysis into pretraining age, quality & toxicity data filters. Gratitude to our team 🙏🏼 @gyauney @emilyrreif @katherine1ee @ada_rob @denny_zhou @barret_zoph @_jasonwei Kevin @dmimno @daphneipp arxiv.org/abs/2305.13169
Shayne Longpre@ShayneRedford

#NewPaperAlert When and where does pretraining (PT) data matter? We conduct the largest published PT data study, varying: 1⃣ Corpus age 2⃣ Quality/toxicity filters 3⃣ Domain composition We have several recs for model creators… 📜: bit.ly/3WxsxyY 1/ 🧵

English
12
17
95
12.7K
Gregory Yauney
Gregory Yauney@gyauney·
In the paper, we show that this max random baseline can be a better predictor of whether the best prompt will outperform random guessing on an unseen set. You can use this baseline right away on your own classification tasks! Code: github.com/gyauney/max-ra…
English
0
0
3
161
Gregory Yauney
Gregory Yauney@gyauney·
This problem goes away if you have a large validation set, but for the kind of fast-moving settings where in-context learning shines, that’s not always feasible. And there’s nothing wrong with trying lots of prompts! You just have to make sure you factor that into your baseline.
English
1
0
1
184
Gregory Yauney
Gregory Yauney@gyauney·
Evaluating many prompts on small few-shot datasets can make you think you’ve beaten random guessing when you haven’t! @dmimno and I study a simple drop-in replacement random baseline that protects against validation set reuse and small datasets: arxiv.org/pdf/2404.13020
Gregory Yauney tweet media
English
3
6
28
3.1K
Gregory Yauney retweetledi
Shayne Longpre
Shayne Longpre@ShayneRedford·
#NewPaperAlert When and where does pretraining (PT) data matter? We conduct the largest published PT data study, varying: 1⃣ Corpus age 2⃣ Quality/toxicity filters 3⃣ Domain composition We have several recs for model creators… 📜: bit.ly/3WxsxyY 1/ 🧵
Shayne Longpre tweet media
English
11
86
354
121.4K
Gregory Yauney retweetledi
Emily Reif
Emily Reif@emilyrreif·
When and where does pretraining data matter? New paper on how varying the pretraining data of LLMs affects downstream performance: bit.ly/3WxsxyY But first, what do we know about the data itself? 1/ 🧵
Emily Reif tweet media
English
2
49
169
29.6K
Gregory Yauney
Gregory Yauney@gyauney·
Using fine-tuned language models makes a hard text classification task like MNLI easy, but why? (new work with @dmimno)
English
2
6
49
0