Anshuman Suri

2

51

Ricardo Monti@RicardoMonti9·5d

@pratyushmaini This meme has @iamgroot42 written all over it (it is excellent)

English

Christina Baek@_christinabaek

0

5

98

Pratyush Maini@pratyushmaini·5d

Models are typically specialized to new domains by finetuning on small, high-quality datasets. We find that repeating the same dataset 10–50× starting from pretraining leads to substantially better downstream performance, in some cases outperforming larger models. 🧵

ZXX

New Datology Research: We expose "The Finetuner's Fallacy" The standard approach to domain adaptation (pretrain on web data, finetune on your data) is leaving performance on the table. Mixing just 1-5% domain data into pretraining, then finetuning, produces a strictly better model: ◾ 1.75x fewer tokens to reach the same domain loss ◾ 1B SPT model outperforms a 3B finetuned-only model ◾ +6pts MATH accuracy at 200B pretraining tokens ◾ Less forgetting of general knowledge Tested across chemistry, symbolic music, and formal math proofs. SPT wins on every metric. Led by @_christinabaek and @pratyushmaini, with the full Datology team.

35

3.6K

Anshuman Suri@iamgroot42·5d

DatologyAI@datologyai

ZXX

7

74

7.9K

Anshuman Suri@iamgroot42·17 Mar

@Utd_Y14 My guess is that they decouple Paul's own role from 'Children of Dune' and somehow merge it with hers, and also looking at Paul's teeth in one of the posters, merge the Worm's role into what Paul will be in this one (given they're pitching it as a "conclusion")?

English

Bobby Samuels@BobbySamuels

2

6K

Y?@Utd_Y14·17 Mar

compelled to see how they handle Chani given this is Messiah they’re adapting and the whole relationship between Paul and Chani is given based on Chani being a plank of wood and is fine with being Paul’s concubine in contrast to the films where she’s got a sense of urgency and actively holds paul accountable plus she dips at the end of part two.

Film Updates@FilmUpdates

Zendaya in the new poster for ‘DUNE: PART THREE’ In theaters December 18.

English

47

211

8.6K

984.9K

Anshuman Suri retweetledi

Ari Morcos@arimorcos·12 Mar

Love seeing the timeline wake up to the fact that data is the most underinvested area in ML. But let’s set the record straight: the world’s premier data research company isn't hypothetical. It already exists. It’s called @datologyai, and we’ve been building it for 2.5 years. 🧵

x.com/i/article/2030…

English

9

25

128

26.9K

Anshuman Suri@iamgroot42·11 Mar

@sarahookr 👀 it's also pretty cheap to deanonymize models, even for vision-based models (arxiv.org/pdf/2601.09647). Infinite money? 📈📈📈

English

@RicardoMonti9 @chhaviyadav_ @datologyai @KaleighMentzer @agcrnz

131

Sara Hooker@sarahookr·11 Mar

Woah. Just saw kalshi settles bets based on lmarena. Do they realize how prone to manipulation lmarena results are. Like if I placed a big enough bet I could just pay annotators to skew the market towards the model I wanted to win.

English

14

5

94

9.9K

Anshuman Suri@iamgroot42·22 Şub

GIF

QME

1

71

Ricardo Monti@RicardoMonti9·22 Şub

@chhaviyadav_ @iamgroot42 @datologyai @KaleighMentzer @agcrnz .@iamgroot42 has only been with @datologyai for a few weeks but is already crushing 🚀🚀🚀

English

0

1

69

Anshuman Suri@iamgroot42·20 Şub

memes apart, another banger by @datologyai, pushing yet another pareto frontier (this time for multilingual curation)! 📈 read more at datologyai.com/blog/berweb-in… for a whole lot of knowledge nuggets lead by @RicardoMonti9 @KaleighMentzer @agcrnz 💪

1/ People often think better multilingual models must come at the cost of English performance. Not true. The constraint isn’t capacity, it’s data quality, and we can fix it. Today @datologyAI shares ÜberWeb: a year of multilingual curation lessons, scaled to 20T+ tokens.

English

0

12

575

Anshuman Suri@iamgroot42·19 Şub

10/ We are using these insights to empower our customers This is the curation strategy used by @arcee to train the Trinity Large family of models. Chat with them here: chat.arcee.ai they are natively multilingual !

ZXX

0

4

312

Anshuman Suri@iamgroot42·19 Şub

@RicardoMonti9

QME

0

4

55

Ricardo Monti@RicardoMonti9·19 Şub

I may to tweet this weekly.. @iamgroot42 truly a generational talent

this week I have observed first hand the elite meme game possessed by @iamgroot42 .. truly a generational talent

English

0

4

314

Anshuman Suri@iamgroot42·19 Şub

@datologyai looking at multilingual data

1/ People often think better multilingual models must come at the cost of English performance. Not true. The constraint isn’t capacity, it’s data quality, and we can fix it. Today @datologyAI shares ÜberWeb: a year of multilingual curation lessons, scaled to 20T+ tokens.

English

@Matthewagi @RicardoMonti9 @eliebakouch @datologyai

2

10

532

Anshuman Suri@iamgroot42·19 Şub

QME

2

47

Matt@Matthewagi·18 Şub

@RicardoMonti9 @eliebakouch @datologyai Mfw it's data again

English

0

7

891

Ricardo Monti@RicardoMonti9·18 Şub

1/ People often think better multilingual models must come at the cost of English performance. Not true. The constraint isn’t capacity, it’s data quality, and we can fix it. Today @datologyAI shares ÜberWeb: a year of multilingual curation lessons, scaled to 20T+ tokens.

English

7

31

150

37.5K

Anshuman Suri@iamgroot42·19 Şub

@datologyai

QME

1

7

178

DatologyAI@datologyai·18 Şub

New research! ÜberWeb: multilingual data curation across 13 languages and 20 trillion tokens. The "curse of multilinguality" is largely a data quality problem, and it's fixable. tl;dr: we get 4-10x training efficiency improvements over models like Qwen3 and Tiny Aya

English

4

12

80

11.2K

Anshuman Suri retweetledi

Kaleigh Mentzer@KaleighMentzer·18 Şub

🌎Making your model multilingual doesn't have to sacrifice English performance—you just need better data. @agcrnz, @RicardoMonti9, and I have been working on curating the best possible multilingual data with the team @datologyai, and it works! Check out the results 👇

1/ People often think better multilingual models must come at the cost of English performance. Not true. The constraint isn’t capacity, it’s data quality, and we can fix it. Today @datologyAI shares ÜberWeb: a year of multilingual curation lessons, scaled to 20T+ tokens.

English

13

31

2.9K

Anshuman Suri@iamgroot42·18 Şub

8/ Lesson 4: our results hold at frontier scale (10s of trillions of tokens!) We curated a 20T-token dataset (~8% multilingual) and trained 3B/8B models on a random 1T subset. We define a new Pareto frontier, in some cases matching baselines with 4–10× fewer training FLOPs.

ZXX