Josh McGrath (@j_mcgraph) - Twitter Profili | Zamantika Mersobahis Locabet

Josh McGrath@j_mcgraph·4h

@BarrAlexandra the gummy bear depletion is an entire floor of endurance athletes making sure they dont bonk

English

123

Josh McGrath@j_mcgraph·12h

@code_star Renaissance Technologies for the modern man

English

Cody Blakeney@code_star·1d

@j_mcgraph Running a joke hedge fund for the bois

English

Cody Blakeney@code_star·1d

I used to immediately spend all my X money on drinks with friends (because I don’t make much and that’s fun) But now they offer 6% apy so I have to be financially responsible with my “dumb jokes” money

English

1.5K

Josh McGrath@j_mcgraph·1d

@aidan_mclau Never underestimate how fast I can construct a normalcy field. I’m already mad when planes don’t have starlink

English

234

Aidan McLaughlin@aidan_mclau·1d

the most radical belief someone with short timelines can have is that things will go normally

English

422

16.7K

Josh McGrath@j_mcgraph·1d

I keep finding that as the model gets smarter, the more I need sites and richer interactions to consume what it’s telling me

jason@jxnlco

vibe coding a bunch of sites so i can learn how to play drums

English

507

Josh McGrath@j_mcgraph·1d

@antoniogm rally raid also included

English

133

Antonio García Martínez (agm.eth)@antoniogm·1d

My big contrarian take is that rally car racing is the last real form of motorsport.

maybestains2@maybestains2

Darling.....

English

415

47.2K

Josh McGrath@j_mcgraph·1d

Andrew Ambrosino@ajambrosino

can we have an "unhinged model/reasoning picker" faceoff? similar to the old worst volume control ui competition

ZXX

1.7K

Josh McGrath@j_mcgraph·1d

Vo2 Max? LOC max

swedishasian67@michellezfr

Wearing noice cancelling masks to talk to Claude is crazy

3.5K

Josh McGrath@j_mcgraph·1d

@Sauers_ @JayaGup10 I’m sorry

English

Sauers@Sauers_·1d

@j_mcgraph @JayaGup10 Me

Jaya Gupta@JayaGup10·2d

Friend is willing to exchange 6 H100s for 3 World Cup Final tickets Lmk

English

134

33.6K

Josh McGrath@j_mcgraph·2d

@ninklefitz Did u solo a 300k??

English

528

Nicole Fitzgerald@ninklefitz·2d

yeah ok you ship but do you send

English

657

56.7K

Josh McGrath@j_mcgraph·2d

@ninklefitz @agniv_s What food stops did you like?

English

Nicole Fitzgerald@ninklefitz·2d

@agniv_s strava.com/routes/3511462… mostly road plus probably <5km total of random gravel

English

1.7K

Josh McGrath@j_mcgraph·3d

@ericmitchellai No u won’t

English

645

Eric@ericmitchellai·3d

Drinking game: take a shot every time the announcer refers to Messi as "little" or "small" or "compact" or otherwise physically insignificant Will report back

English

6.6K

Josh McGrath@j_mcgraph·4d

@jam3scampbell Bro what’s matplotlib

English

1.6K

James Campbell@jam3scampbell·4d

weird to think ML research once meant spending time learning about matplotlib Figure vs Axes

English

1.4K

69.7K

Josh McGrath@j_mcgraph·4d

Grifters are crazy they they literally learned how to reward hack enough people to live off it

English

314

Josh McGrath@j_mcgraph·4d

@BarrAlexandra Surprising you think so!!

English

295

Alexandra Barr@BarrAlexandra·5d

tldr: human data & data acq good

will depue@willdepue

A Stargate for Data Labs are on a trajectory towards >$100B/year of data spend by 2030. As we begin the trillion-dollar compute project, we need to think about the equivalent civilizational-scale effort for the other core ingredient: data. At the foundation of the scaling revolution is a simple empirical law: deep neural networks improve smoothly, near magically, as you scale two things in proportion — (1) the size of the model and (2) the amount of data you train on. And despite the scaling laws being brutally diminishing, we’ve successfully bitten the bullet of logarithmic scaling with exponentially larger clusters and datasets, and received incredible new capabilities in return. But this exponential scaling is bound to hit some limits. Oddly enough, compute has compounded fairly smoothly without limit, with trillions flowing into hypercluster buildout. Instead, we’re starting to hit the limits of an exponential demand for data. Gone are the days of being purely in the compute-limited regime, where we had effectively infinite internet data but never enough GPUs, we’re now entering a data-limited regime. Luckily, this limitation is coinciding with staggering improvements in AI capabilities. Incredibly, we seem to have a real line of sight towards automating a majority of knowledge work with the methods we have today. RL + pretraining, and the data for each, will be generally sufficient to achieve most economically valuable tasks, given some minimal algorithmic progress and continued compute scaling. In a data-limited world, economic progress & scientific acceleration will be directly bottlenecked by our coverage in each domain. We need to see data collection as imperative, deserving the same civilizational ambition we’ve given compute. The internet as a one-time subsidy It’s underrated how much all progress in AI owes everything to the blessing of the internet, this one-time civilizational subsidy to deep learning, decades of unintentional accumulation of a perfect dataset: every book, blog post, image, video, paper, discussion, etc. all digitized and freely available. Without the internet, we’d likely see comparably minimal progress in AI today, and in fact, if you notice where systems currently underperform, it’s almost always a domain where web coverage is limited and data is private, expensive, non-digitized, or non-existent. But we’re running out of it. There are only about 300 trillion tokens of useful public human text, and the internet doesn’t produce nearly enough new high-quality data to match what scaling demands — we’re soon to hit the limits of public data for pretraining. And though the advent of RL bought us reprieve — chain-of-thought RL needed a new form of untapped data, gradable math & coding tasks, also available online — we’re quickly running dry of hard tasks for RL as well. Why do we need so much data anyways? Humans learn comparably in far less time, needing just one textbook where language models might need the equivalent of hundreds to learn a new topic. It’s possible we discover methods that are massively more data efficient — synthetic data, data efficient architectures, other exotic algorithms — but fundamental progress is slow and highly unpredictable, and the recipe we have just works today. And, while I’m wary of getting too deep here, even arbitrary data efficiency can’t replace data that just doesn’t exist in the first place. There’s a massive amount of missing information on the web: the dark matter of the internet — tacit knowledge, undocumented processes, etc. — most of which was never published and lives only inside organizations, the physical world, or just in people’s heads. I’ll leave it here and say, for reasons far longer than I can fit in this post [1], it’s best to operate on the assumption that our insatiable desire for data will continue as it has for the last decade. There will be >$100B/year in data spend by 2030 We’re not screwed yet, of course. Only a fraction of useful data in the world is on the public internet, the rest is stored inside private datasets, corporations, personal archives, universities, governments, and otherwise. Labs can and will continue to license these private datasets, or create them from scratch, like Anthropic’s book scanning project. And we’ll increasingly task human experts to manufacture new high-quality data, with a large fraction of hard RL training tasks already being sourced this way. But collecting this data, unlike before, will be expensive. As the free internet dries up and demand for data rises, we should see labs investing equally in data as compute, likely spending a significant fraction of their compute budgets on data. As we see trillions spent on compute, we should also expect hundreds of billions spent on data (human data & collection budgets), given their equivalent importance. And, notably, data spend is already tracking this way: total data spend across vendors, not counting internal lab efforts, is already roughly $7 billion per year. It’s quite reasonable we’ll see >10x by 2030. Data is the moat Data becoming increasingly private will also majorly shift the competitive landscape. While compute is a commodity — everyone buys the same chips and builds the same clusters — data really isn’t. The big reason why frontier models have felt eerily similar to one another, until now, is they were trained on substantially the same internet (pretraining data variability across labs seems pretty low). As labs diverge onto more exclusive, manually collected corpora, I think models will begin to increasingly diverge. OpenAI pulling ahead in mathematics and Anthropic in cybersecurity isn’t an accident. I really think laser-focused collection of high-quality midtraining tokens, custom RL tasks, environments, with dedicated research effort, has driven much of the visible progress in the last year. James Betker has an excellent blog about “the ‘it’ in a model is the dataset”: model architecture and compute buy you efficiency and order-of-magnitude performance, but ultimately, models, of any architecture, are such incredible approximators of their dataset that the core meat of a model boils down to just that, nothing else. Data is a major moat. AGI long, ASI short As I’ve tweeted before, I’m confident that, despite the narrative, the data labeling industry will continue to fuel great businesses and be an excellent AGI long, ASI short. The argument is just: By the time the AGI labs no longer need data, it’s probably over for everything else too [2]. In this frame, the last companies left should be the data companies, as the last speck of economically relevant data is sucked in. And these companies are already among some of the fastest-growing companies in history: Mercor, founded three years ago, is rumored to be doing $2 billion in revenue with something like a few million expert labelers under contract. While these businesses are very non-stationary, what type of data is needed shifts constantly, I don’t think that diminishes their value. The long-tail of the economy is long, and the value isn’t diminishing as you extend farther into more obscure information: as models get more capable, the value of the marginal dataset goes up, not down. Automating a full job means covering its full distribution of tasks, tools, edge-cases, and long-horizon loops. There’s some O-ring logic to it: a dataset that buys a 1% bump can justify a previously unjustifiable collection cost when it’s the difference between a system that does 99% of a job and one that does all of it [3]. The competitive dynamics of the data industry are still evolving but as demand for data is increasingly niche, ultra high-quality, expert-generated, I think we’ll see real consolidation. Again, contra-narrative, we’ll probably see true competitive differentiation built on brand, quality control of data (which, from personal experience, can vary massively), as well as in network effects from the talent networks themselves over time. We’ve already seen rapidly shifting data type demand work in favor of incumbents, benefiting those with early knowledge of where the market is headed. The binding constraint It’s truly remarkable that we seem to have the recipe — pretraining + RL — to absorb most economically valuable work, despite being far from a lot of what we expected from “AGI”. The same way chess engines revealed we never needed general intelligence to solve chess, as we originally thought, we’ll soon realize that software, mathematics, and the vast majority of the economy (including physical, just running ~3 years behind!) are the same. If recursive self-improvement or some other algorithmic breakthrough arrives, that’s wonderful, but we really don’t have to wait for it. The binding constraint between here and an automated economy isn’t that, it’s data coverage: every app, workflow, edge case, process, etc. sitting in private stores or someone’s head. Ultimately, while we make tremendous strides in more efficient model architectures, and clusters like Stargate equip us with zettaflop-scale compute, we really aren’t making rapid progress collecting the data we lack. We’ll soon live in a world where we have the methods & compute to accelerate scientific progress or economic growth, but not the data. And we’re already there today: frontier models would surely be as good at accounting/many medical tasks/legal advice as they are at software engineering if we only had the same pretraining & RL coverage as we did for code. I really want to drill this in: The speed at which we automate the economy is going to be directly rate-limited by our ability to collect data about it. Worth noting that under this assumption, with data as defensible and directly proportional to economic & scientific progress, data should also be considered a national strategic asset like compute. Imagine what we’d do in a world where we had a Manhattan Project-effort for AI and needed to mobilize data collection as a limiting factor. We should be concerned about China, with greater state capacity and authoritarian economic control, being capable of mobilizing data collection at national scale, potentially compounding their economy and scientific output faster than us down the line. A Stargate for data I’m leaving my complete ideas for a future post, as this one is already far too long, so I’d really like to pose the question here. Stargate exists because we organized trillions of dollars, international strategy, gigawatts around compute as a fundamental ingredient. What would equivalent ambition look like for data? Obviously, scaling data collection, a heterogeneous mass of information across the economy, isn’t going to be as clear as scaling compute, as a homogenous infrastructural effort. A core division will be first, coverage — all uncaptured knowledge sitting across the economy/science/physical world and all that simply isn’t recorded — and, secondly, sheer volume in the domains we already train on: more hard math tasks, more high-quality web text, way more coding data, more legal drafts, etc. I have a post coming soon which breaks down my proposals. There’s a lot of room for creativity. Quickly, we’ll probably want to start with a deep census of what we have and what we’re missing, predict what the 2030 model will still be bad at and work backward to what we should be collecting today. You can probably license a large amount, leveraging high lab valuations to buy datasets or companies altogether. There’s an adversarial nature to a lot of this collection with firms, so there’s lots of engineering to do this correctly. We should go convince important companies to turn off deletion policies, even if we’re not buying from them yet. Data flywheels in consumer products will be massive. Confidential training, government legislation for grant-funded research, running companies at a loss for their data, etc. We’re headed towards hundreds of billions in expenditure, national prioritization, and major data limitation on the horizon. We have a great opportunity to think creatively about what a megaproject for data would look like: How do we, deliberately this time, construct the next internet’s worth of data? Footnotes: [1]: I’ll probably soon publish my much longer post explaining my position on data efficiency and why the value of this data is still pretty high in most worlds regardless of new algorithms. [2]: The “AGI freeroll” bet: heads you win, tails ASI flips the world upside down anyways. [3]: We already see a glint of validation of this point, given the data market is strongly tilting towards ultra-high-quality agentic data, rather than unskilled labeling — niche expert workflows, live environments, and evaluations requiring increasingly obscure talent & knowledge — yet shows increasing, not decreasing, revenues.

Filipino

13.5K

Josh McGrath@j_mcgraph·5d

@john__allard We actually have one of those! openai.com/careers/

English

1.2K