
Dataset curation for language models has long relied on brittle, hand-crafted rules. It's time for a more principled, automated approach.
Enter DataRater: a meta-learning framework that learns to value data based on downstream training efficiency. Great summary by Luisa below 👇
Luisa Zintgraf@luisa_zintgraf
Excited to share our new paper, "DataRater: Meta-Learned Dataset Curation"! We explore a fundamental question: How can we *automatically* learn which data is most valuable for training foundation models? Paper: arxiv.org/pdf/2505.17895 to appear @NeurIPSConf Thread 👇
English