DatologyAI

183 posts

DatologyAI

@datologyai

DatologyAI builds tools to automatically select and optimize the best data on which to train AI models, leading to better, smaller models which train faster.

Redwood City, CA Katılım Eylül 2023

12 Takip Edilen2.9K Takipçiler

DatologyAI retweetledi

Saurabh Shah@saurabh_shah2·4d

This is p sick and completely surprising to me wowie great work

DatologyAI@datologyai

New Datology Research: We expose "The Finetuner's Fallacy" The standard approach to domain adaptation (pretrain on web data, finetune on your data) is leaving performance on the table. Mixing just 1-5% domain data into pretraining, then finetuning, produces a strictly better model: ◾ 1.75x fewer tokens to reach the same domain loss ◾ 1B SPT model outperforms a 3B finetuned-only model ◾ +6pts MATH accuracy at 200B pretraining tokens ◾ Less forgetting of general knowledge Tested across chemistry, symbolic music, and formal math proofs. SPT wins on every metric. Led by @_christinabaek and @pratyushmaini, with the full Datology team.

English

10.9K

DatologyAI retweetledi

Pratyush Maini@pratyushmaini·5d

If I had to compress my PhD into one idea, it is this "The data a model sees early in training leaves an imprint on its representations that is very hard to undo later" This thread runs through - Rephrasing the Web - Safety Pretraining - TOFU This is the Finetuner’s Fallacy🧵

English

728

54.8K

DatologyAI@datologyai·6d

📄 arxiv.org/abs/2603.16177 📝 datologyai.com/blog/finetuner…

QME

1.9K

DatologyAI@datologyai·6d

English

228

53.5K

DatologyAI retweetledi

Matthew Leavitt@leavittron·19 Şub

Two nursing home residents are eating lunch. One says, "Boy, the food at this place is terrible." The other says, "Yeah, I know, and such small portions, too." This is the multilingual data problem. The data is bad, AND there's not enough of it. Yesterday at @datologyai we released ÜberWeb: our study of multilingual curation that gets 4-10x train FLOPs improvements on multilingual benchmarks compared to strong public baselines like Qwen3-1.7B and Tiny Aya Base.

English

3.5K

DatologyAI@datologyai·18 Şub

@Arceeai @RicardoMonti9 If you're interested in joining our team to do cool stuff like this, head to datologyai.com/careers. And if you need to improve your data (you definitely need to improve your data), get in touch via our website datologyai.com

English

336

DatologyAI@datologyai·18 Şub

@Arceeai See @RicardoMonti9's thread for a deep-dive: x.com/RicardoMonti9/… ArXiv: arxiv.org/abs/2602.15210 Blog: datologyai.com/blog/berweb-in…

Ricardo Monti@RicardoMonti9

1/ People often think better multilingual models must come at the cost of English performance. Not true. The constraint isn’t capacity, it’s data quality, and we can fix it. Today @datologyAI shares ÜberWeb: a year of multilingual curation lessons, scaled to 20T+ tokens.

English

652

DatologyAI@datologyai·18 Şub

New research! ÜberWeb: multilingual data curation across 13 languages and 20 trillion tokens. The "curse of multilinguality" is largely a data quality problem, and it's fixable. tl;dr: we get 4-10x training efficiency improvements over models like Qwen3 and Tiny Aya

English

11.2K

DatologyAI@datologyai·18 Şub

Full paper and blog below. For the technical deep-dive, see {@matthew}’s thread. For why this matters for the future of AI, see {@ari}'s thread. ArXiv: arxiv.org/html/2602.1521… Blog: datologyai.com/blog/berweb-in…

English

252

DatologyAI@datologyai·18 Şub

This curation approach also powers @arceeai's Trinity Large Base, which shows exceptionally strong multilingual performance relative to its compute budget.

English

177

DatologyAI@datologyai·28 Oca

Proud to power @arcee_ai’s Trinity Large. A 400B/A13B MoE model trained on 17T tokens of our curated data (including 8T synthetic), delivering true frontier-level performance with just 13B active parameters. Incredible collaboration with @PrimeIntellect and @arcee_ai! The most exciting part? This is just a preview model. We can't wait to see how strong it is after full post-training!

Arcee.ai@arcee_ai

Today, we’re releasing the first weights from Trinity Large, our first frontier-scale model in the Trinity MoE family.

English

2.2K

DatologyAI@datologyai·27 Oca

Our team at DatologyAI is excited to launch Data & Dialogue, a new executive dinner series bringing together leaders for an intimate evening of great food, candid conversation, and fresh perspectives on data and AI. Join us on February 26 at 5:30 PM for our first gathering and a relaxed discussion with peers. Join the waitlist here: lnkd.in/gVDKHWVZ

English

4.5K

DatologyAI retweetledi

Ari Morcos@arimorcos·6 Oca

Great data curation isn't just for training! We @datologyai just released DatBench, a refined VLM eval suite with a simple motivation: VLM evals are broken. VLM evals are noisy, often measure the wrong thing, and expensive, often consuming ~20% of train compute. No longer!

English

122

14.3K

DatologyAI@datologyai·6 Oca

Benchmarks can make VLMs look smarter than they are. DatBench shows why: MCQ formats invite guessing, many items are solvable without vision (yes, “blind-solvable”), and label noise/ambiguity can blur real progress. We curate evals to be faithful + discriminative, and cut eval cost by 13× by selecting the highest-signal examples (up to 50×). Blog: datologyai.com/blog/datbench-… HuggingFace: huggingface.co/datasets/Datol… GitHub: github.com/datologyai/Dat… Huge shoutout to the team for continuing to raise the bar on what rigorous VLM evaluation looks like! Want to work on cool stuff like this full-time? We’re hiring: #open-positions" target="_blank" rel="nofollow noopener">datologyai.com/careers#open-p…

Siddharth Joshi@sjoshi804

Presenting DatBench (arxiv.org/abs/2601.02316…): a VLM eval suite that isn't plagued with the variety of data quality issues and is statistically guaranteed to maximally discriminate amongst models. Check out more details in 🧵👇! (ofc every release needs a milkshake meme)

English

783

DatologyAI retweetledi

Alexander Doria@Dorialexander·12 Ara

And new @datologyai release addressing a major issue for synthetic pipelines: fast/accurate text retrieval AT SCALE.

English

8.4K

Keşfet

@_christinabaek @pratyushmaini @Arceeai @RicardoMonti9 @matthew @Ari @arcee_ai @PrimeIntellect