Stephanie Schoch

29 posts

Stephanie Schoch

@stephschoch

PhD candidate working on NLP and data contribution estimation @CS_UVA. Member of @UVA_ILP.

انضم Eylül 2019

187 يتبع69 المتابعون

Stephanie Schoch أُعيد تغريده

Andrew Lampinen@AndrewLampinen·2 May

How do language models generalize from information they learn in-context vs. via finetuning? We show that in-context learning can generalize more flexibly, illustrating key differences in the inductive biases of these modes of learning — and ways to improve finetuning. Thread: 1/

English

150

763

102.4K

Stephanie Schoch@stephschoch·5 May

Had a great time presenting this work at the NAACL 2025 Insights Workshop yesterday! We adapted a Monte Carlo sampling method to analyze the impact of the number of in-context examples. aclanthology.org/2025.insights-…

English

Stephanie Schoch@stephschoch·30 Nis

I’ll be presenting our work “In-Context Learning (and Unlearning) of Length Biases” at NAACL 25 in Hall 3 from 11AM-12:30PM today. Looking forward to chatting about ICL with everyone!

English

115

Stephanie Schoch أُعيد تغريده

Alon Albalak@AlbalakAlon·18 Nis

@cwolferesearch If you thought the information on data they release is interesting, you should check out our recent survey on data for LLMs We include a TON more information about data processing, and most information Meta includes in the release isn't particularly new twitter.com/AlbalakAlon/st…

Alon Albalak@AlbalakAlon

{UCSB|AI2|UW|Stanford|MIT|UofT|Vector|Contextual AI} present a survey on🔎Data Selection for LLMs🔍 Training data is a closely guarded secret in industry🤫with this work we narrow the knowledge gap, advocating for open, responsible, collaborative progress arxiv.org/abs/2402.16827

English

11.6K

Stephanie Schoch أُعيد تغريده

Rafael Rafailov @ NeurIPS@rm_rafailov·19 Nis

From the LLaMa 3 blogpost - they use a combination of rejection sampling, DPO and PPO for post-training. Really interested to know what tasks/parts of the process each algorithms benefits the most.

English

118

71.7K

Stephanie Schoch أُعيد تغريده

Cameron R. Wolfe, Ph.D.@cwolferesearch·18 Nis

LLaMA-3 is a prime example of why training a good LLM is almost entirely about data quality… TL;DR. Meta released LLaMA-3-8B/70B today and 95% of the technical info we have so far is related to data quality: - 15T tokens of pretraining data - More code during pretraining (leads to better reasoning capabilities) - More efficient tokenizer with larger vocabulary - Super sophisticated (including LLM components) data quality filtering - Extensive empirical analysis of data mixture - Focus on quality filtering of post training data (for SFT/RLHF/DPO) All of the cool stuff in this report is related to how to curate data effectively for pre/post-training! This really shows that data curation/filtering is the most difficult and impactful aspect of training foundation models. (1) Model architecture: Only 5 sentences are provided about the model architecture, which simply state that LLaMa-3 uses a standard decoder-only architecture with grouped query attention to improve inference efficiency (and a longer 8K context). It’s pretty clear that model architectures are becoming standardized, and most of the research focus is going into constructing datasets. In fact, the main architecture modification made by LLaMA-3 is a more efficient tokenizer! “Llama 3 uses a tokenizer with a vocabulary of 128K tokens that encodes language much more efficiently, which leads to substantially improved model performance.” - from LLaMA-3 blog (2) Better tokenizer: LLaMA-3 comes with a custom tokenizer with a vocabulary of 128K tokens (LLaMA-2 had a vocabulary of 32K tokens). This tokenizer is more token efficient (i.e., fewer tokens are necessary to encode the same piece of text relative to LLaMA-2), which makes inference more efficient. Authors also note that the new tokenizer improves performance! In other words, making sure that we are encoding the model’s input data correctly is super important. (3) Massive pretraining corpus: LLaMa-3 is pretrained over 15T tokens of text (5% non-English), which is a 7X improvement over LLaMA-2 and even larger than the 12T pretraining corpus of DBRX. The pretraining corpus also has 4X more code relative to LLaMA-2 (this was a big criticism of LLaMA-2). With this in mind, it’s not a surprise that LLaMA-3 has strong reasoning/code capabilities—several papers have correlated pretraining on code to better downstream reasoning in LLMs. “We found that previous generations of Llama are surprisingly good at identifying high-quality data, hence we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3.” - from LLaMA-3 blog (4) FIltering pretraining data: Few concrete details are provided on the filtering process for the pretraining corpus of LLaMA-3, but it’s clear that a lot of filtering is done. These filters include heuristic filters, NSFW filters, semantic deduplication, and text classifiers to predict data quality. Plus, authors note that LLaMA-2 is very good at detecting text quality, so they use these models in the filtering process (see above). Authors also mention that they do extensive empirical analysis to figure out the correct data mixture (DBRX also mentions this is hugely important). (5) Overtraining: Chinchilla proposed the compute optimal training regime for LLMs, but recent work indicates that pretty much everyone overtrains their LLMs relative to the compute-optimal ratio. LLaMA-3 is pretrained on two orders of magnitude more data (for the 8B model) beyond the compute-optimal ratio, and we still see log-linear improvements. Sure, we could train a larger model on fewer tokens and achieve similar performance while spending less on training compute. But, this doesn’t consider inference costs! We almost always will pay for more training compute if it means we can deploy a smaller model with the same performance. “The quality of the prompts that are used in SFT and the preference rankings that are used in PPO and DPO has an outsized influence on the performance of aligned models.” - from LLaMA-3 blog (6) Post training data quality: Even beyond pretraining, data quality is pivotal for LLaMA-3! The model is aligned with a combination of SFT, rejection sampling, PPO, and DPO. During alignment, authors claim that the quality of supervised/preference data is super important. In fact, the biggest quality improvements in LLaMA-3 came from curating this data and performing multiple rounds of quality assurance on humans annotations!

English

209

874

105.9K

Stephanie Schoch أُعيد تغريده

Fred Oswald@FredOswald·3 Nis

Will the Real Linda Please Stand up...to Large Language Models? Examining the Representativeness Heuristic in LLMs arxiv.org/abs/2404.01461 @PanDwww @ZilinXiao2 @hanjie_chen

English

8.2K

Stephanie Schoch أُعيد تغريده

Jason Stock@itsstock·5 Mar

Chat with MLX 🚀 a high-performance macOS app linking your local docs to a custom large language model (LLM) on your machine 🧵 Now open-source in beta! github.com/mlx-chat/mlx-c… Collaboratively built by @itsstock & @parkersmith

English

12.6K

Stephanie Schoch أُعيد تغريده

Matthew Berman@MatthewBerman·16 Ara

OpenAI just dropped their Prompt Engineering guide. Here are 6 strategies they recommend for getting better results from LLMs:

English

591

5.3K

Stephanie Schoch أُعيد تغريده

Alon Jacovi@alon_jacovi·18 May

Worried about test data being used in training? The LLM world is going through a data contamination crisis. Here's us trying to do something about it: Paper: arxiv.org/abs/2305.10160 Blog: @alonjacovi/stop-uploading-test-data-in-plain-text-20928bed7a62" target="_blank" rel="nofollow noopener">medium.com/@alonjacovi/st… w\ @clu_avi @omerNLP @yoavgo

English

257

48.3K

Stephanie Schoch أُعيد تغريده

Yangfeng Ji@yangfeng_ji·28 Oca

Our group released a Python package of data valuation in machine learning, Valda. It supports five methods (LOO, Influence Function, TMC-Shapley, Beta-Shapley, and CS-Shapley) via a unified API. Please try it out if you are interested: uvanlp.org/valda/ @stephschoch

English

7.7K

Stephanie Schoch@stephschoch·15 Eyl

@Mayankksoni @yangfeng_ji @haifengxu0 We're addressing some comments for the final version and will release a preprint once it's ready!

English

Mayank Soni@Mayankksoni·15 Eyl

@yangfeng_ji @stephschoch @haifengxu0 Is there a pre-print available to read ?

English

Stephanie Schoch أُعيد تغريده

Yangfeng Ji@yangfeng_ji·15 Eyl

Our work on class-wise Shapley values for data valuation is accepted to #NeurIPS2022 Congratulations to my student @stephschoch and collaborator @haifengxu0! See you in New Orleans!

English

Stephanie Schoch أُعيد تغريده

siggen_acl@siggen_acl·9 Ara

INLG 2022 will be 18-22 July, in Colby College (Waterville, Maine, USA)! Calls for papers, workshops, etc available at inlgmeeting.github.io/calls.html

English

Stephanie Schoch أُعيد تغريده

UVA ILP@UVA_ILP·19 Kas

UVA ILP Lab Group Photo: Fall 2021

English

Stephanie Schoch@stephschoch·5 Kas

Very excited to share this update: I passed my PhD Qualifying Examination! A big thank you to my committee and to my advisor @yangfeng_ji for all of his support and guidance!

UVA ILP@UVA_ILP

Congratulations to UVA ILP Lab members @WanyuDu and @stephschoch for passing their PhD Qualifying Exams this week!

English

Stephanie Schoch@stephschoch·28 Eki

@_dmh Congratulations!

English

Stephanie Schoch أُعيد تغريده

Yangfeng Ji@yangfeng_ji·23 Eyl

After three years @CS_UVA, my group finally has its Twitter account.

UVA ILP@UVA_ILP

The UVA Information and Language Processing Lab is officially on Twitter! We have ongoing research on Model Interpretability and Robustness, Text Generation, and Text Dataset Analysis. Check out our current research and publications here: uvanlp.org

English

Stephanie Schoch أُعيد تغريده

INLG 2026@inlgmeeting·21 Eyl

The commendation for outstanding position paper goes to "Underreporting of errors in NLG output, and what to do about it" by van Miltenburg, Clinciu, Dušek, Gkatzia, Inglis, Leppänen, Mahamood, Manning, Schoch, Thomson, & Wen

English

اكتشف

@cwolferesearch @PanDwww @ZilinXiao2 @hanjie_chen @itsstock @parkersmith @clu_avi @omerNLP