Brendan Farmer

548 posts

Brendan Farmer

@bfarmer

works on ZK, prev mir

Bergabung Ekim 2018

865 Mengikuti8.3K Pengikut

Brendan Farmer@bfarmer·10 Nis

@yenwod_ No, it’s all vibes 😅 but that is a good idea. What about you?

English

yenwod@yenwod_·10 Nis

@bfarmer speaking of benchmarks do you guys do any bespoke llm benchmarks in your codebase? eg to understand which models perform better on your typical bugfixes or if adding docs, changing instructions or code organization, impact coding agent performance?

English

154

Brendan Farmer@bfarmer·10 Nis

I think Opus 4.6 overperformance and GPT-5.4 underperformance should be read as failures of the benchmark. imo citing the METR doubling timeline is statistically illiterate. If you look at the task distribution wrt duration, assignment of duration doesn't seem rigorous, there aren't enough long-duration tasks, there's bias in how tasks are constructed (clustered around particular skills/capabilities) and the eval is susceptible to benchmaxxing. I'm not a statistician, but it's unclear to me that logistic regression is appropriate. I think it's a useful measure of whether models are improving, but improved scores indicate improvements in specific skills/capabilities, not reliability as a function of duration.

METR@METR_Evals

We ran GPT-5.4 (xhigh) on our tasks. Its time-horizon depends greatly on our treatment of reward hacks: the point estimate would be 5.7hrs (95% CI of 3hrs to 13.5hrs) under our standard methodology, but 13hrs (95% CI of 5hrs to 74hrs) if we allow reward hacks.

English

747

Brendan Farmer@bfarmer·20 Mar

“Inference is low-margin” is maybe not quite right. Inference is low-margin relative to zero-marginal-cost software businesses, but it’s plausible that compute profit margins can be robust for inference (especially as hardware improves). OAI cited a 70% compute margin in 2025, but this is based on a blend that includes monthly subscriptions with low usage. No lab, afaik, has released a per-token API profit margin. Margins are more complicated for inference vs widgets in a factory because R&D spend is so high (and increasing superlinearly). R&D is a mandatory continuing cost to retain usage. I don’t think it’s quite right to say that frontier labs have an amazing business, compared to, say, producing widgets in a factory, because margins are (guessing) 33%, given the structure of the business. Also, as models rely increasingly on TTC, does this meaningfully reduce compute margins?

English

138

Brendan Farmer@bfarmer·20 Mar

This post reminds me that I don't understand the economics of frontier labs. Inference is a low-margin business to begin with, and OpenAI and Anthropic must continuously invest huge amounts in R&D compute to maintain pricing power (if cheaper open-source models catch up, frontier labs' margins should approach zero). This is made worse because training cost increases superlinearly, so labs are caught in a rapidly compressing window as open source models catch up on one side, while the amount of capital/compute/electricity required to meaningfully improve models goes vertical on the other. MSFT and GOOG are ~zero-marginal-cost, high-growth, cash-generating businesses valued at 20-30x earnings. It's unclear that labs should even command 20-30x earnings, much less infinity-x earnings and 30x+ ARR. Seems pretty clear privates are overpriced.

Gaurav Ahuja@gauravahuja

One of these two groups is mispriced Private AI labs: OpenAI valued around $840B, Anthropic north of $600B on secondaries. Both at 30x+ ARR. Public giants: Microsoft at ~$3T on 23x forward earnings. Amazon at ~$2.3T on 28x. Microsoft likely owns ~25% of OpenAI. Amazon likely owns ~15% of Anthropic and ~5% of OpenAI If private investors are pricing these labs for a $5T+ venture-style outcome then… Microsoft’s implied stake in a $5T OpenAI is $1.25T embedded inside a $3T company. Amazon’s combined stakes embed roughly $1T inside a $2.3T company. Publics too cheap on Al exposure? Or privates/secondaries in bubble territory? Which breaks first?

English

1.4K

Brendan Farmer@bfarmer·17 Mar

Assume this is bait, but "great men sitting around moaning about their feelings" is basically as old as western civilization. See Achilles in the Iliad sitting around moaning about his wounded pride and grief. It's difficult to find great men and women of the past who *didn't* write letters moaning about their feelings.

Marc Andreessen 🇺🇸@pmarca

It is 100% true that great men and women of the past were not sitting around moaning about their feelings. I regret nothing.

English

1.7K

Brendan Farmer me-retweet

Robin Salen@RobinSalen·10 Mar

🚀 New Plonky3 release just dropped. This is probably our most impactful and ambitious release so far: - MUCH faster lookups - High-arity folding - N-ary Merkle trees + Merkle caps - Major Poseidon2 optimizations - Poseidon1 support - And many more… Let’s break it down 👇

English

132

8.4K

Brendan Farmer@bfarmer·25 Şub

Hegseth [yelling]: and if you won't train WarClaude... we have an effective altruist who will.

English

449

Brendan Farmer@bfarmer·24 Şub

Dario has claimed that inference revenue exceeds training cost for Anthropic's models, so training yields a positive return on investment. I'm curious whether "training cost" refers to the final run or total R&D compute cost (up to ~4-5x greater, includes experiments, failed runs, etc). It's definitely the case that Anthropic has negative cashflows and this is mainly because training cost for the next model is increasing at a high rate, but I'm curious whether there's positive ROI on total R&D compute spend for current models.

English

576

Brendan Farmer@bfarmer·24 Şub

@AlexGodofsky @sporadica Sure, don't disagree, said above that usage might increase exponentially. Just saying that if training costs increase at 10x/year and inference costs decrease at 10x/year, then without margin growth, positive returns on training assume a lot of utilization growth.

English

Alex Godofsky@AlexGodofsky·24 Şub

@bfarmer @sporadica They don't have to grow margins if they can grow utilization instead, and the whole theory of the enterprise is that smarter models will see wildly growing utilization (plus growth from general technology diffusion).

English

102

Alex Godofsky@AlexGodofsky·24 Şub

This seems to be a widely believed error. Gross margins on inference are large and positive. The AI companies have negative cash flow (not losses) because they are growing geometrically and model training expenses precede revenue.

spor@sporadica

Everyone remembers how Uber was so insanely cheap those first few years, operating at a huge loss, subsidized by investment capital - all to get its user base, hook you, and blitzscale. Obviously, the big AI labs are running a similar playbook (absolutely burning cash on your $20/mo sub). So, I do always wonder when (like Uber in 2018) that other shoe will drop and the labs will start actually trying to make a profit. What happens then? Or is the goal just superintelligence at all costs, any hope of near-term profitability be damned? We'll see, I guess.

English

343

39.4K

Brendan Farmer@bfarmer·24 Şub

Training costs aren't public, but looks like Grok 4 cost $500m, GPT-5 cost $500m, but add up failed runs + experiments and I would imagine total cost is $1-2b (OAI's R&D budget is $9b/year). Dario in 2024 said that they had models in development that would cost >$1b to train. I'm not applying current training costs to revenue from older models - Opus 4.5 was released in November and I'm using December ARR. Obviously there's a mix of revenue from different models but it's fair as a rough comparison. I think positive contribution margins are believable (but he does have to say that), but my point is that the model is fairly uncertain as training costs continue to increase (significantly) while inference costs are decreasing and anthropic has to grow margins in a competitive environment to maintain inference revenue growth relative to training costs.

English

117

Alex Godofsky@AlexGodofsky·24 Şub

@bfarmer @sporadica I think your training costs are too high, but also you are applying current training costs to revenue from older models, and Anthropic themselves assert that the actual per-model math pencils: x.com/i/status/20263…

hari raghavan@haridigresses

@AlexGodofsky Yes. Dario has said publicly that on a per-model basis they are contribution margin profitable (inference more than pays for training).

English

194

Brendan Farmer@bfarmer·24 Şub

I think that expected future returns to model training might be high but very unclear that current returns are > 0. Each frontier model has a useful lifespan of maybe ~6 months and current models cost $1-2b to train (speculating). Anthropic claims 40% gross margins and $9b in 12/2025 ARR. So it's unclear that gross profits for opus 4.5 will exceed training cost. You might say that it's a safe bet that inference costs will decline heavily (would agree), but I don't think that it follows that labs can directly translate cost savings to improved margins in a competitive market. On the other hand, training costs definitely increase, so it's unclear to me that training models will yield high returns to labs. Maybe as task complexity and usage increase exponentially, inference revenue still keeps pace with training cost?

English

148

Alex Godofsky@AlexGodofsky·24 Şub

@sporadica The returns to model training are high. They're so high that it would be absurd to limit your model training investment to what you can internally finance from inference profits. This is the point of having investors.

English

229

Brendan Farmer@bfarmer·23 Şub

Plenty of caveats, but there are also fairly extensive theoretical and empirical results suggesting superlinearly increasing costs for RL (definitely true for pretraining). Even if there isn't a theoretical basis for a glass fence, costs might increase by too many OOMs to be practical on a reasonable timescale.

English

Peli Grietzer@peligrietzer·23 Şub

@xuanalogue @gleech @littmath i feel like there's some miscommunication with both you and gavin. i don't think anything in post suggests that we know there's a limit to how much generalization you can get with current methods. the post says we don't know that there isn't

English

159

Peli Grietzer@peligrietzer·23 Şub

Compared to @littmath I'm more open to the possibility of a 'glass fence' for RL-LLM performance. Could be that just like GOFAI is good for Chess-complexity tasks but not for Go-complexity tasks, big data neural nets are good for Go-complexity tasks but not for all tasks

Daniel Litt@littmath

Some thoughts on AI and mathematics, inspired by "First Proof."

English

10.4K

Brendan Farmer@bfarmer·21 Şub

I was randomly thinking about this exact example, that it's somewhat easy (though impressive) to imagine LLMs proving isolated lemmas in math, or coding. I think it's harder to imagine an LLM spontaneously picturing itself in an elevator accelerating to the speed of light, and then formulating GR. Conceptual creativity like this seems really difficult to capture via training.

Dustin@r0ck3t23

Demis Hassabis just defined the real test for AGI. It’s more brutal than anyone expected. Train AI on all human knowledge. Cut it off at 1911. See if it independently discovers general relativity like Einstein did in 1915. If it can, we have AGI. If not, we’re still building pattern matchers. Hassabis: “My definition of AGI has never changed. A system that can exhibit all the cognitive capabilities that humans can.” Not bar exams. Not coding competitions. All cognitive capabilities. Hassabis: “The brain is the only existence proof we have, maybe in the universe, of a general intelligence.” That’s why DeepMind studies neuroscience. Not for inspiration. For data. The human brain is the only confirmed evidence that general intelligence is physically possible. If you want to build it, you study the only example that exists. Hassabis: “True creativity, continual learning, long-term planning. They’re not good at those things.” Current systems are impressive and broken simultaneously. Hassabis: “They can get gold medals in international math olympiad questions, but they can still fall over on relatively simple math problems if you pose it in a certain way.” Jagged intelligence. Brilliant in narrow domains. Incompetent when approached differently. That inconsistency is the tell. A true general intelligence doesn’t spike in one direction and collapse in another. The Einstein test cuts through all of it. No benchmarks. No leaderboards. No carefully curated evals. Just a model, a knowledge cutoff, and the question of whether it can do what one human did alone in 1915. Hassabis: “Training an AI system with a knowledge cutoff of 1911 and seeing if it could come up with general relativity like Einstein did in 1915. That’s the true test of whether we have a full AGI system.” Current models can’t. They remix brilliantly. They don’t generate paradigm-shifting theories from first principles. Hassabis: “I think we’re still a few years away from that.” A few years. Not decades. The system that can be Einstein once can be Einstein a thousand times simultaneously across every domain. That’s not AGI anymore. That’s the beginning of something we don’t have words for yet. When that test gets passed, we won’t need a press release to know what happened.

English

1.3K

Brendan Farmer@bfarmer·17 Şub

I'm thinking about software engineering relative to, say, math research. My argument is that the act of writing software is more decomposable into discrete and simple steps. Ex: map from a spec to an interface, match the interface to a known pattern, write code to properly call the interface, debug. Compilers are basically a perfect verifier for RL, there aren't that many patterns in SWE, and there's a ton of training data, so it makes sense that SWE benefits hugely from RL. If you're correct that verbal reasoning from software RL 1) translates ~well to other domains and 2) improves the reasoning abilities of the underlying model, then I would expect to see transfer to domains like math research (but models are kinda bad at this and make reasoning mistakes all the time, despite math-specific RL). Not an expert, but it seems that there's some current research suggesting that RLVR doesn't improve pass@k over base model, but does improve pass@1. This sorta makes plausible intuitive sense to me? RLVR can't explore beyond the reasoning capabilities of the base model, so if the architecture or training data isn't providing some capability, then RL can't learn it.

English

Matthew Honnibal@honnibal·17 Şub

@bfarmer I think you're thinking about tracing through the code, but there's tonnes of verbal reasoning in things like debugging. I have this hypothesis, does this observation contradict it? What can I do to narrow the search space? etc

English

Matthew Honnibal@honnibal·16 Şub

I still see a lot of people discussing LLMs as next-token predictors, which is by now quite a misunderstanding. A related opinion is that LLM progress will probably plateau. This post explains why I don't think the "plateau" argument holds up. honnibal.dev/blog/ai-bubble

English

136

34.6K

Brendan Farmer@bfarmer·17 Şub

Sure, but I don't think it's straightforward to claim that abstract verbal reasoning in coding necessarily transfers to other domains. The model could be learning the reasoning moves that are relevant to coding (and a mapping from those moves to language), but in a way that doesn't generalize. Or, maybe it generalizes but doesn't scale with compute as well as pretraining. To me, it's plausible that coding requires fairly basic reasoning moves and RLVR reweights a model's solution mass to yield improve accuracy, but RL doesn't help the base model learn new reasoning modes. I think it's still an open question whether RL can improve the base model's reasoning ability efficiently enough to extend it to AGI.

English

Matthew Honnibal@honnibal·17 Şub

@bfarmer Using the coding models it feels pretty clear to me it learns abstract verbal reasoning pretty well. The thing about transfer learning is, generalisable skills are generally available. You can get them wherever they're convenient

English

10.2K

Brendan Farmer@bfarmer·17 Şub

It would be interesting to look at failure rates for projects with organic pmf vs incentivized - ie exclude projects with significant drawdowns in TVL or fees after TGE.

0xngmi@0xngmi

Was curious if this was true and looks like it Among projects that reached some pmf (>10m in tvl or >1m/mo in fees), those that launched a token were +50% more likely to die compared to those that didn't launch a token

English

1.3K

Brendan Farmer@bfarmer·14 Şub

First Proof discussion: codeberg.org/tgkolda/1stpro… LLMs only solved 9, 10, with the caveat that important ideas for 9, 10 were previously published. OpenAI claim to have solved more problems using an unreleased model, but they may have violated the benchmark rules by having human experts direct the model to expand upon some proofs and select the best proof from a set. The rules as-written are that an AI should not rely on human input for any mathematical idea or content, but it seems possible that a human directing an LLM to expand on a certain section of a proof is implicitly providing mathematical input. Will be interesting to see if future rounds enforce autonomous solutions and independent verification more strictly.

English

1.4K

Brendan Farmer@bfarmer·14 Şub

Hmm, so an H100 runs at 700W, so 8xH100 = 5.6kW, so ~2.5 OOM to get to 20W for current LLMs. If we assume a few OOM (3-6?) increase in inference cost to get to an Einstein, that's a significant cost. It also might be the case that a million Einsteins is not very helpful, as the work of RSI is not very parallelizable, and what you'd actually want is an Einstein that operates OOMs faster than humans, pushing cost further. Also, if we're considering RSI, then we have to include training in the loop, which should require OOMs more power/chips to achieve a linear capability increase. To me, this suggests that after RSI, power+chips will still be constraints. Otherwise, I think you need to assume that there are significant algorithmic improvements that are just out of reach and would be quickly discovered by RSI. Maybe I'm thinking about this incorrectly though, not my field!

English

ueaj@_ueaj·14 Şub

@bfarmer @connorshepherd Well my point was that another o1-preview tier innovation would get us much closer to 20W/Einstein, atleast a few OOMs, which means OOMs more Einsteins than your opponent

English

121

ueaj@_ueaj·13 Şub

What's scary is that this will never happen again, the ~6mo gap between o1-preview and sonnet 3.7 today would be automated scientist vs opus 4.6. Whoever makes the next big breakthrough wins, and I feel like I know what it'll be, which makes it all feel like watching a really slow train crash

ueaj@_ueaj

It's crazy that like a ~1.5 years ago OpenAI had a year long lead ahead of everyone else, they were literally unstoppable. Even today Elon, Zuck, MS, Amazon, etc. all with their enormous advantages and many times the capex still haven't really caught up. Only Google is even close and their advantages are literally incomprehensibly unfair. But Anthropic, with nothing but good leadership and vibes, is right on their tail. Absolutely legendary run, couldn't have happened to a better group of people

English

150

37.7K

Brendan Farmer@bfarmer·14 Şub

Maybe, but my argument is that the first iteration of RSI will probably be >> 20W/Einstein, so speed at which labs can produce Einsteins is limited by power and chips. I don't think that RSI => unbounded immediate capability growth. Moreover, different RSI implementations might have different trajectories, so first to RSI doesn't imply permanent winning lab.

English

134

ueaj@_ueaj·14 Şub

@bfarmer @connorshepherd I don't think it will if we actually did make a digital einstein, down to the 20W/Einstein energy effeciency then we could just make ~infinity of them

English

140

Jelajahi

@yenwod_ @AlexGodofsky @sporadica @xuanalogue @gleech @littmath @elonmusk @BarackObama