Icea
3.4K posts

Icea
@Icea75
@hubxyz team⼁AI Filmmaking Raw milk and disruptive tech





Are we running out of data to train language models? State-of-the-art LLMs use datasets with tens of trillions of words, and use 2-3x more per year. Our new ICML paper estimates when we might exhaust all text data on the internet. 1/12













Larry Ellison $ORCL highlighted something critical: models like ChatGPT, Gemini, Grok, and Llama are all trained on largely the same public internet data. When everyone trains on the same information, models inevitably converge. That’s why AI is moving toward commoditization. The real moat isn’t the model itself. It’s the proprietary data behind it. Companies that can train on exclusive datasets gain an advantage competitors can’t replicate. Having data that no one else has will allow you to dominate your market.

Anthropic CEO Dario Amodei: AI Progress isn’t magic, it’s just compute, data, and training. "All the cleverness, all the techniques, all of the “we need a new method,” doesn’t matter very much. There are only a few things that matter, and I listed seven of them. One is how much raw compute you have. Two is the quantity of data you have. Three is the quality and distribution of the data, meaning it needs to be broad. Four is how long you train for. Five is an objective function that can scale to the moon. Pre-training is one. Another is an RL-style objective: you have a goal and you go reach it. That includes more objective rewards (like math and coding), and more subjective rewards (like RL from human feedback, or higher-order versions of that). Six and seven are about normalization/conditioning: keeping numerical stability so the huge blob of compute flows cleanly instead of blowing up or getting weird." --- From 'Dwarkesh Patel' YT Channel (link in comment)























