Stephanie Schoch

29 posts

Stephanie Schoch

Stephanie Schoch

@stephschoch

PhD candidate working on NLP and data contribution estimation @CS_UVA. Member of @UVA_ILP.

انضم Eylül 2019
187 يتبع69 المتابعون
Stephanie Schoch أُعيد تغريده
Andrew Lampinen
Andrew Lampinen@AndrewLampinen·
How do language models generalize from information they learn in-context vs. via finetuning? We show that in-context learning can generalize more flexibly, illustrating key differences in the inductive biases of these modes of learning — and ways to improve finetuning. Thread: 1/
Andrew Lampinen tweet media
English
8
150
763
102.4K
Stephanie Schoch
Stephanie Schoch@stephschoch·
Had a great time presenting this work at the NAACL 2025 Insights Workshop yesterday! We adapted a Monte Carlo sampling method to analyze the impact of the number of in-context examples. aclanthology.org/2025.insights-…
English
0
0
1
73
Stephanie Schoch
Stephanie Schoch@stephschoch·
I’ll be presenting our work “In-Context Learning (and Unlearning) of Length Biases” at NAACL 25 in Hall 3 from 11AM-12:30PM today. Looking forward to chatting about ICL with everyone!
Stephanie Schoch tweet media
English
0
0
5
115
Stephanie Schoch أُعيد تغريده
Alon Albalak
Alon Albalak@AlbalakAlon·
@cwolferesearch If you thought the information on data they release is interesting, you should check out our recent survey on data for LLMs We include a TON more information about data processing, and most information Meta includes in the release isn't particularly new twitter.com/AlbalakAlon/st…
Alon Albalak@AlbalakAlon

{UCSB|AI2|UW|Stanford|MIT|UofT|Vector|Contextual AI} present a survey on🔎Data Selection for LLMs🔍 Training data is a closely guarded secret in industry🤫with this work we narrow the knowledge gap, advocating for open, responsible, collaborative progress arxiv.org/abs/2402.16827

English
1
12
55
11.6K
Stephanie Schoch أُعيد تغريده
Rafael Rafailov @ NeurIPS
Rafael Rafailov @ NeurIPS@rm_rafailov·
From the LLaMa 3 blogpost - they use a combination of rejection sampling, DPO and PPO for post-training. Really interested to know what tasks/parts of the process each algorithms benefits the most.
Rafael Rafailov @ NeurIPS tweet media
English
3
14
118
71.7K
Stephanie Schoch أُعيد تغريده
Cameron R. Wolfe, Ph.D.
Cameron R. Wolfe, Ph.D.@cwolferesearch·
LLaMA-3 is a prime example of why training a good LLM is almost entirely about data quality… TL;DR. Meta released LLaMA-3-8B/70B today and 95% of the technical info we have so far is related to data quality: - 15T tokens of pretraining data - More code during pretraining (leads to better reasoning capabilities) - More efficient tokenizer with larger vocabulary - Super sophisticated (including LLM components) data quality filtering - Extensive empirical analysis of data mixture - Focus on quality filtering of post training data (for SFT/RLHF/DPO) All of the cool stuff in this report is related to how to curate data effectively for pre/post-training! This really shows that data curation/filtering is the most difficult and impactful aspect of training foundation models. (1) Model architecture: Only 5 sentences are provided about the model architecture, which simply state that LLaMa-3 uses a standard decoder-only architecture with grouped query attention to improve inference efficiency (and a longer 8K context). It’s pretty clear that model architectures are becoming standardized, and most of the research focus is going into constructing datasets. In fact, the main architecture modification made by LLaMA-3 is a more efficient tokenizer! “Llama 3 uses a tokenizer with a vocabulary of 128K tokens that encodes language much more efficiently, which leads to substantially improved model performance.” - from LLaMA-3 blog (2) Better tokenizer: LLaMA-3 comes with a custom tokenizer with a vocabulary of 128K tokens (LLaMA-2 had a vocabulary of 32K tokens). This tokenizer is more token efficient (i.e., fewer tokens are necessary to encode the same piece of text relative to LLaMA-2), which makes inference more efficient. Authors also note that the new tokenizer improves performance! In other words, making sure that we are encoding the model’s input data correctly is super important. (3) Massive pretraining corpus: LLaMa-3 is pretrained over 15T tokens of text (5% non-English), which is a 7X improvement over LLaMA-2 and even larger than the 12T pretraining corpus of DBRX. The pretraining corpus also has 4X more code relative to LLaMA-2 (this was a big criticism of LLaMA-2). With this in mind, it’s not a surprise that LLaMA-3 has strong reasoning/code capabilities—several papers have correlated pretraining on code to better downstream reasoning in LLMs. “We found that previous generations of Llama are surprisingly good at identifying high-quality data, hence we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3.” - from LLaMA-3 blog (4) FIltering pretraining data: Few concrete details are provided on the filtering process for the pretraining corpus of LLaMA-3, but it’s clear that a lot of filtering is done. These filters include heuristic filters, NSFW filters, semantic deduplication, and text classifiers to predict data quality. Plus, authors note that LLaMA-2 is very good at detecting text quality, so they use these models in the filtering process (see above). Authors also mention that they do extensive empirical analysis to figure out the correct data mixture (DBRX also mentions this is hugely important). (5) Overtraining: Chinchilla proposed the compute optimal training regime for LLMs, but recent work indicates that pretty much everyone overtrains their LLMs relative to the compute-optimal ratio. LLaMA-3 is pretrained on two orders of magnitude more data (for the 8B model) beyond the compute-optimal ratio, and we still see log-linear improvements. Sure, we could train a larger model on fewer tokens and achieve similar performance while spending less on training compute. But, this doesn’t consider inference costs! We almost always will pay for more training compute if it means we can deploy a smaller model with the same performance. “The quality of the prompts that are used in SFT and the preference rankings that are used in PPO and DPO has an outsized influence on the performance of aligned models.” - from LLaMA-3 blog (6) Post training data quality: Even beyond pretraining, data quality is pivotal for LLaMA-3! The model is aligned with a combination of SFT, rejection sampling, PPO, and DPO. During alignment, authors claim that the quality of supervised/preference data is super important. In fact, the biggest quality improvements in LLaMA-3 came from curating this data and performing multiple rounds of quality assurance on humans annotations!
Cameron R. Wolfe, Ph.D. tweet media
English
21
209
874
105.9K
Stephanie Schoch أُعيد تغريده
Jason Stock
Jason Stock@itsstock·
Chat with MLX 🚀 a high-performance macOS app linking your local docs to a custom large language model (LLM) on your machine 🧵 Now open-source in beta! github.com/mlx-chat/mlx-c… Collaboratively built by @itsstock & @parkersmith
Jason Stock tweet media
English
4
18
99
12.6K
Stephanie Schoch أُعيد تغريده
Matthew Berman
Matthew Berman@MatthewBerman·
OpenAI just dropped their Prompt Engineering guide. Here are 6 strategies they recommend for getting better results from LLMs:
English
67
591
5.3K
2M
Stephanie Schoch أُعيد تغريده
Alon Jacovi
Alon Jacovi@alon_jacovi·
Worried about test data being used in training? The LLM world is going through a data contamination crisis. Here's us trying to do something about it: Paper: arxiv.org/abs/2305.10160 Blog: @alonjacovi/stop-uploading-test-data-in-plain-text-20928bed7a62" target="_blank" rel="nofollow noopener">medium.com/@alonjacovi/st… w\ @clu_avi @omerNLP @yoavgo
Alon Jacovi tweet media
English
7
69
257
48.3K
Stephanie Schoch أُعيد تغريده
Yangfeng Ji
Yangfeng Ji@yangfeng_ji·
Our group released a Python package of data valuation in machine learning, Valda. It supports five methods (LOO, Influence Function, TMC-Shapley, Beta-Shapley, and CS-Shapley) via a unified API. Please try it out if you are interested: uvanlp.org/valda/ @stephschoch
English
2
5
43
7.7K
Stephanie Schoch أُعيد تغريده
Yangfeng Ji
Yangfeng Ji@yangfeng_ji·
Our work on class-wise Shapley values for data valuation is accepted to #NeurIPS2022 Congratulations to my student @stephschoch and collaborator @haifengxu0! See you in New Orleans!
English
1
4
35
0
Stephanie Schoch أُعيد تغريده
siggen_acl
siggen_acl@siggen_acl·
INLG 2022 will be 18-22 July, in Colby College (Waterville, Maine, USA)! Calls for papers, workshops, etc available at inlgmeeting.github.io/calls.html
English
0
19
32
0
Stephanie Schoch أُعيد تغريده
UVA ILP
UVA ILP@UVA_ILP·
UVA ILP Lab Group Photo: Fall 2021
UVA ILP tweet media
English
2
2
38
0
Stephanie Schoch أُعيد تغريده
INLG 2026
INLG 2026@inlgmeeting·
The commendation for outstanding position paper goes to "Underreporting of errors in NLG output, and what to do about it" by van Miltenburg, Clinciu, Dušek, Gkatzia, Inglis, Leppänen, Mahamood, Manning, Schoch, Thomson, & Wen
English
1
4
16
0