Manan Dey retweetledi

Thrilled our global data ecosystem audit was accepted to #ICLR2025!
Empirically, we find:
1⃣ Soaring synthetic text data: ~10M tokens (pre-2018) to 100B+ (2024).
2⃣ YouTube is now 70%+ of speech/video data but could block third-party collection.
3⃣ <0.2% of data from Africa/South America.
1/

English























