Gregory Yauney

27 posts

Gregory Yauney

@gyauney

Cornell PhD student working on ML and NLP

Katılım Haziran 2010

140 Takip Edilen230 Takipçiler

Gregory Yauney@gyauney·21 Haz

@dallascard @ShayneRedford @emilyrreif @katherine1ee @dmimno @daphneipp @naaclmeeting Thanks--it was great to meet you the other day!

English

109

Dallas Card@dallascard·20 Haz

@gyauney @ShayneRedford @emilyrreif @katherine1ee @dmimno @daphneipp @naaclmeeting Really enjoyed chatting with you about this! Congratulations!!

English

178

Gregory Yauney@gyauney·20 Haz

Our Pretrainer's Guide won an ✨outstanding paper award✨ at #NAACL2024 today! Big congrats to all the coauthors, especially @ShayneRedford (who led this big project), @emilyrreif, @katherine1ee, @dmimno, and @daphneipp! Thanks @naaclmeeting!

Gregory Yauney@gyauney

Come talk to us about pretraining data curation at #NAACL2024 at 2pm at poster session 2! We're presenting A Pretrainer's Guide to Training Data Paper: aclanthology.org/2024.naacl-lon…

English

110

15.8K

Gregory Yauney retweetledi

Shayne Longpre@ShayneRedford·20 Haz

Super appreciative of the recognition from #NAACL2024 — our Pretrainer’s Guide won an 🌟Outstanding Paper Award🌟🏆 This was a year long analysis into pretraining age, quality & toxicity data filters. Gratitude to our team 🙏🏼 @gyauney @emilyrreif @katherine1ee @ada_rob @denny_zhou @barret_zoph @_jasonwei Kevin @dmimno @daphneipp arxiv.org/abs/2305.13169

Shayne Longpre@ShayneRedford

#NewPaperAlert When and where does pretraining (PT) data matter? We conduct the largest published PT data study, varying: 1⃣ Corpus age 2⃣ Quality/toxicity filters 3⃣ Domain composition We have several recs for model creators… 📜: bit.ly/3WxsxyY 1/ 🧵

English

12.7K

Gregory Yauney@gyauney·17 Haz

Come talk to us about pretraining data curation at #NAACL2024 at 2pm at poster session 2! We're presenting A Pretrainer's Guide to Training Data Paper: aclanthology.org/2024.naacl-lon…

English

106

18.5K

Gregory Yauney@gyauney·22 May

In the paper, we show that this max random baseline can be a better predictor of whether the best prompt will outperform random guessing on an unseen set. You can use this baseline right away on your own classification tasks! Code: github.com/gyauney/max-ra…

English

161

Gregory Yauney@gyauney·22 May

This problem goes away if you have a large validation set, but for the kind of fast-moving settings where in-context learning shines, that’s not always feasible. And there’s nothing wrong with trying lots of prompts! You just have to make sure you factor that into your baseline.

English

184

Gregory Yauney@gyauney·22 May

Evaluating many prompts on small few-shot datasets can make you think you’ve beaten random guessing when you haven’t! @dmimno and I study a simple drop-in replacement random baseline that protects against validation set reuse and small datasets: arxiv.org/pdf/2404.13020

English

3.1K

Gregory Yauney@gyauney·8 Ara

I'm postering this afternoon at #EMNLP2023! Stop by if you want to talk about how Data Similarity is Not Enough to Explain Language Model Performance: arxiv.org/pdf/2311.09006…. Joint work with wonderful collaborators @emilyrreif and @dmimno

English

2.5K

Gregory Yauney retweetledi

Shayne Longpre@ShayneRedford·22 May

English

354

121.4K

Gregory Yauney retweetledi

Emily Reif@emilyrreif·22 May

When and where does pretraining data matter? New paper on how varying the pretraining data of LLMs affects downstream performance: bit.ly/3WxsxyY But first, what do we know about the data itself? 1/ 🧵

English

169

29.6K

Gregory Yauney@gyauney·21 Nis

Full details in our EMNLP 2021 paper: aclanthology.org/2021.emnlp-mai…

English

Gregory Yauney@gyauney·21 Nis

Read our blog post to find out more and get code to try it out on your own data! gyauney.github.io/data-label-ali…

English

Gregory Yauney@gyauney·21 Nis

Using fine-tuned language models makes a hard text classification task like MNLI easy, but why? (new work with @dmimno)

English

Keşfet

@dallascard @ShayneRedford @emilyrreif @katherine1ee @dmimno @daphneipp @naaclmeeting @ada_rob