DatologyAI

183 posts

DatologyAI banner
DatologyAI

DatologyAI

@datologyai

DatologyAI builds tools to automatically select and optimize the best data on which to train AI models, leading to better, smaller models which train faster.

Redwood City, CA Katılım Eylül 2023
12 Takip Edilen2.9K Takipçiler
DatologyAI retweetledi
DatologyAI retweetledi
Pratyush Maini
Pratyush Maini@pratyushmaini·
If I had to compress my PhD into one idea, it is this "The data a model sees early in training leaves an imprint on its representations that is very hard to undo later" This thread runs through - Rephrasing the Web - Safety Pretraining - TOFU This is the Finetuner’s Fallacy🧵
English
21
55
728
54.8K
DatologyAI
DatologyAI@datologyai·
New Datology Research: We expose "The Finetuner's Fallacy" The standard approach to domain adaptation (pretrain on web data, finetune on your data) is leaving performance on the table. Mixing just 1-5% domain data into pretraining, then finetuning, produces a strictly better model: ◾ 1.75x fewer tokens to reach the same domain loss ◾ 1B SPT model outperforms a 3B finetuned-only model ◾ +6pts MATH accuracy at 200B pretraining tokens ◾ Less forgetting of general knowledge Tested across chemistry, symbolic music, and formal math proofs. SPT wins on every metric. Led by @_christinabaek and @pratyushmaini, with the full Datology team.
DatologyAI tweet media
English
4
28
228
53.5K
DatologyAI retweetledi
Matthew Leavitt
Matthew Leavitt@leavittron·
Two nursing home residents are eating lunch. One says, "Boy, the food at this place is terrible." The other says, "Yeah, I know, and such small portions, too." This is the multilingual data problem. The data is bad, AND there's not enough of it. Yesterday at @datologyai we released ÜberWeb: our study of multilingual curation that gets 4-10x train FLOPs improvements on multilingual benchmarks compared to strong public baselines like Qwen3-1.7B and Tiny Aya Base.
Matthew Leavitt tweet media
English
1
9
38
3.5K
DatologyAI
DatologyAI@datologyai·
New research! ÜberWeb: multilingual data curation across 13 languages and 20 trillion tokens. The "curse of multilinguality" is largely a data quality problem, and it's fixable. tl;dr: we get 4-10x training efficiency improvements over models like Qwen3 and Tiny Aya
DatologyAI tweet media
English
4
12
81
11.2K
DatologyAI
DatologyAI@datologyai·
This curation approach also powers @arceeai's Trinity Large Base, which shows exceptionally strong multilingual performance relative to its compute budget.
DatologyAI tweet media
English
1
0
2
177
DatologyAI
DatologyAI@datologyai·
Proud to power @arcee_ai’s Trinity Large. A 400B/A13B MoE model trained on 17T tokens of our curated data (including 8T synthetic), delivering true frontier-level performance with just 13B active parameters. Incredible collaboration with @PrimeIntellect and @arcee_ai! The most exciting part? This is just a preview model. We can't wait to see how strong it is after full post-training!
Arcee.ai@arcee_ai

Today, we’re releasing the first weights from Trinity Large, our first frontier-scale model in the Trinity MoE family.

English
0
3
24
2.2K
DatologyAI
DatologyAI@datologyai·
Our team at DatologyAI is excited to launch Data & Dialogue, a new executive dinner series bringing together leaders for an intimate evening of great food, candid conversation, and fresh perspectives on data and AI. Join us on February 26 at 5:30 PM for our first gathering and a relaxed discussion with peers. Join the waitlist here: lnkd.in/gVDKHWVZ
DatologyAI tweet media
English
0
3
18
4.5K
DatologyAI retweetledi
Ari Morcos
Ari Morcos@arimorcos·
Great data curation isn't just for training! We @datologyai just released DatBench, a refined VLM eval suite with a simple motivation: VLM evals are broken. VLM evals are noisy, often measure the wrong thing, and expensive, often consuming ~20% of train compute. No longer!
Ari Morcos tweet media
English
5
17
122
14.3K
DatologyAI
DatologyAI@datologyai·
Benchmarks can make VLMs look smarter than they are. DatBench shows why: MCQ formats invite guessing, many items are solvable without vision (yes, “blind-solvable”), and label noise/ambiguity can blur real progress. We curate evals to be faithful + discriminative, and cut eval cost by 13× by selecting the highest-signal examples (up to 50×). Blog: datologyai.com/blog/datbench-… HuggingFace: huggingface.co/datasets/Datol… GitHub: github.com/datologyai/Dat… Huge shoutout to the team for continuing to raise the bar on what rigorous VLM evaluation looks like! Want to work on cool stuff like this full-time? We’re hiring: #open-positions" target="_blank" rel="nofollow noopener">datologyai.com/careers#open-p…
Siddharth Joshi@sjoshi804

Presenting DatBench (arxiv.org/abs/2601.02316…): a VLM eval suite that isn't plagued with the variety of data quality issues and is statistically guaranteed to maximally discriminate amongst models. Check out more details in 🧵👇! (ofc every release needs a milkshake meme)

English
0
0
7
783
DatologyAI retweetledi
Alexander Doria
Alexander Doria@Dorialexander·
And new @datologyai release addressing a major issue for synthetic pipelines: fast/accurate text retrieval AT SCALE.
Alexander Doria tweet media
English
0
7
81
8.4K