Brendan Farmer

548 posts

Brendan Farmer

Brendan Farmer

@bfarmer

works on ZK, prev mir

Bergabung Ekim 2018
865 Mengikuti8.3K Pengikut
Brendan Farmer
Brendan Farmer@bfarmer·
@yenwod_ No, it’s all vibes 😅 but that is a good idea. What about you?
English
1
0
1
36
yenwod
yenwod@yenwod_·
@bfarmer speaking of benchmarks do you guys do any bespoke llm benchmarks in your codebase? eg to understand which models perform better on your typical bugfixes or if adding docs, changing instructions or code organization, impact coding agent performance?
English
2
0
1
154
Brendan Farmer
Brendan Farmer@bfarmer·
I think Opus 4.6 overperformance and GPT-5.4 underperformance should be read as failures of the benchmark. imo citing the METR doubling timeline is statistically illiterate. If you look at the task distribution wrt duration, assignment of duration doesn't seem rigorous, there aren't enough long-duration tasks, there's bias in how tasks are constructed (clustered around particular skills/capabilities) and the eval is susceptible to benchmaxxing. I'm not a statistician, but it's unclear to me that logistic regression is appropriate. I think it's a useful measure of whether models are improving, but improved scores indicate improvements in specific skills/capabilities, not reliability as a function of duration.
METR@METR_Evals

We ran GPT-5.4 (xhigh) on our tasks. Its time-horizon depends greatly on our treatment of reward hacks: the point estimate would be 5.7hrs (95% CI of 3hrs to 13.5hrs) under our standard methodology, but 13hrs (95% CI of 5hrs to 74hrs) if we allow reward hacks.

English
1
0
5
747
Brendan Farmer
Brendan Farmer@bfarmer·
“Inference is low-margin” is maybe not quite right. Inference is low-margin relative to zero-marginal-cost software businesses, but it’s plausible that compute profit margins can be robust for inference (especially as hardware improves). OAI cited a 70% compute margin in 2025, but this is based on a blend that includes monthly subscriptions with low usage. No lab, afaik, has released a per-token API profit margin. Margins are more complicated for inference vs widgets in a factory because R&D spend is so high (and increasing superlinearly). R&D is a mandatory continuing cost to retain usage. I don’t think it’s quite right to say that frontier labs have an amazing business, compared to, say, producing widgets in a factory, because margins are (guessing) 33%, given the structure of the business. Also, as models rely increasingly on TTC, does this meaningfully reduce compute margins?
English
0
0
0
138
Brendan Farmer
Brendan Farmer@bfarmer·
This post reminds me that I don't understand the economics of frontier labs. Inference is a low-margin business to begin with, and OpenAI and Anthropic must continuously invest huge amounts in R&D compute to maintain pricing power (if cheaper open-source models catch up, frontier labs' margins should approach zero). This is made worse because training cost increases superlinearly, so labs are caught in a rapidly compressing window as open source models catch up on one side, while the amount of capital/compute/electricity required to meaningfully improve models goes vertical on the other. MSFT and GOOG are ~zero-marginal-cost, high-growth, cash-generating businesses valued at 20-30x earnings. It's unclear that labs should even command 20-30x earnings, much less infinity-x earnings and 30x+ ARR. Seems pretty clear privates are overpriced.
Gaurav Ahuja@gauravahuja

One of these two groups is mispriced Private AI labs: OpenAI valued around $840B, Anthropic north of $600B on secondaries. Both at 30x+ ARR. Public giants: Microsoft at ~$3T on 23x forward earnings. Amazon at ~$2.3T on 28x. Microsoft likely owns ~25% of OpenAI. Amazon likely owns ~15% of Anthropic and ~5% of OpenAI If private investors are pricing these labs for a $5T+ venture-style outcome then… Microsoft’s implied stake in a $5T OpenAI is $1.25T embedded inside a $3T company. Amazon’s combined stakes embed roughly $1T inside a $2.3T company. Publics too cheap on Al exposure? Or privates/secondaries in bubble territory? Which breaks first?

English
3
0
8
1.4K
Brendan Farmer
Brendan Farmer@bfarmer·
Assume this is bait, but "great men sitting around moaning about their feelings" is basically as old as western civilization. See Achilles in the Iliad sitting around moaning about his wounded pride and grief. It's difficult to find great men and women of the past who *didn't* write letters moaning about their feelings.
Marc Andreessen 🇺🇸@pmarca

It is 100% true that great men and women of the past were not sitting around moaning about their feelings. I regret nothing.

English
2
2
15
1.7K
Brendan Farmer me-retweet
Robin Salen
Robin Salen@RobinSalen·
🚀 New Plonky3 release just dropped. This is probably our most impactful and ambitious release so far: - MUCH faster lookups - High-arity folding - N-ary Merkle trees + Merkle caps - Major Poseidon2 optimizations - Poseidon1 support - And many more… Let’s break it down 👇
English
4
31
132
8.4K
Brendan Farmer
Brendan Farmer@bfarmer·
Hegseth [yelling]: and if you won't train WarClaude... we have an effective altruist who will.
Brendan Farmer tweet media
English
2
0
7
449
Brendan Farmer
Brendan Farmer@bfarmer·
Dario has claimed that inference revenue exceeds training cost for Anthropic's models, so training yields a positive return on investment. I'm curious whether "training cost" refers to the final run or total R&D compute cost (up to ~4-5x greater, includes experiments, failed runs, etc). It's definitely the case that Anthropic has negative cashflows and this is mainly because training cost for the next model is increasing at a high rate, but I'm curious whether there's positive ROI on total R&D compute spend for current models.
English
1
0
3
576
Brendan Farmer
Brendan Farmer@bfarmer·
@AlexGodofsky @sporadica Sure, don't disagree, said above that usage might increase exponentially. Just saying that if training costs increase at 10x/year and inference costs decrease at 10x/year, then without margin growth, positive returns on training assume a lot of utilization growth.
English
0
0
0
27
Alex Godofsky
Alex Godofsky@AlexGodofsky·
@bfarmer @sporadica They don't have to grow margins if they can grow utilization instead, and the whole theory of the enterprise is that smarter models will see wildly growing utilization (plus growth from general technology diffusion).
English
1
0
0
102
Brendan Farmer
Brendan Farmer@bfarmer·
Training costs aren't public, but looks like Grok 4 cost $500m, GPT-5 cost $500m, but add up failed runs + experiments and I would imagine total cost is $1-2b (OAI's R&D budget is $9b/year). Dario in 2024 said that they had models in development that would cost >$1b to train. I'm not applying current training costs to revenue from older models - Opus 4.5 was released in November and I'm using December ARR. Obviously there's a mix of revenue from different models but it's fair as a rough comparison. I think positive contribution margins are believable (but he does have to say that), but my point is that the model is fairly uncertain as training costs continue to increase (significantly) while inference costs are decreasing and anthropic has to grow margins in a competitive environment to maintain inference revenue growth relative to training costs.
English
1
0
0
117
Brendan Farmer
Brendan Farmer@bfarmer·
I think that expected future returns to model training might be high but very unclear that current returns are > 0. Each frontier model has a useful lifespan of maybe ~6 months and current models cost $1-2b to train (speculating). Anthropic claims 40% gross margins and $9b in 12/2025 ARR. So it's unclear that gross profits for opus 4.5 will exceed training cost. You might say that it's a safe bet that inference costs will decline heavily (would agree), but I don't think that it follows that labs can directly translate cost savings to improved margins in a competitive market. On the other hand, training costs definitely increase, so it's unclear to me that training models will yield high returns to labs. Maybe as task complexity and usage increase exponentially, inference revenue still keeps pace with training cost?
English
1
0
1
148
Alex Godofsky
Alex Godofsky@AlexGodofsky·
@sporadica The returns to model training are high. They're so high that it would be absurd to limit your model training investment to what you can internally finance from inference profits. This is the point of having investors.
English
1
0
9
229
Brendan Farmer
Brendan Farmer@bfarmer·
Plenty of caveats, but there are also fairly extensive theoretical and empirical results suggesting superlinearly increasing costs for RL (definitely true for pretraining). Even if there isn't a theoretical basis for a glass fence, costs might increase by too many OOMs to be practical on a reasonable timescale.
English
0
0
1
17
Peli Grietzer
Peli Grietzer@peligrietzer·
@xuanalogue @gleech @littmath i feel like there's some miscommunication with both you and gavin. i don't think anything in post suggests that we know there's a limit to how much generalization you can get with current methods. the post says we don't know that there isn't
English
2
0
5
159
Brendan Farmer
Brendan Farmer@bfarmer·
I was randomly thinking about this exact example, that it's somewhat easy (though impressive) to imagine LLMs proving isolated lemmas in math, or coding. I think it's harder to imagine an LLM spontaneously picturing itself in an elevator accelerating to the speed of light, and then formulating GR. Conceptual creativity like this seems really difficult to capture via training.
Dustin@r0ck3t23

Demis Hassabis just defined the real test for AGI. It’s more brutal than anyone expected. Train AI on all human knowledge. Cut it off at 1911. See if it independently discovers general relativity like Einstein did in 1915. If it can, we have AGI. If not, we’re still building pattern matchers. Hassabis: “My definition of AGI has never changed. A system that can exhibit all the cognitive capabilities that humans can.” Not bar exams. Not coding competitions. All cognitive capabilities. Hassabis: “The brain is the only existence proof we have, maybe in the universe, of a general intelligence.” That’s why DeepMind studies neuroscience. Not for inspiration. For data. The human brain is the only confirmed evidence that general intelligence is physically possible. If you want to build it, you study the only example that exists. Hassabis: “True creativity, continual learning, long-term planning. They’re not good at those things.” Current systems are impressive and broken simultaneously. Hassabis: “They can get gold medals in international math olympiad questions, but they can still fall over on relatively simple math problems if you pose it in a certain way.” Jagged intelligence. Brilliant in narrow domains. Incompetent when approached differently. That inconsistency is the tell. A true general intelligence doesn’t spike in one direction and collapse in another. The Einstein test cuts through all of it. No benchmarks. No leaderboards. No carefully curated evals. Just a model, a knowledge cutoff, and the question of whether it can do what one human did alone in 1915. Hassabis: “Training an AI system with a knowledge cutoff of 1911 and seeing if it could come up with general relativity like Einstein did in 1915. That’s the true test of whether we have a full AGI system.” Current models can’t. They remix brilliantly. They don’t generate paradigm-shifting theories from first principles. Hassabis: “I think we’re still a few years away from that.” A few years. Not decades. The system that can be Einstein once can be Einstein a thousand times simultaneously across every domain. That’s not AGI anymore. That’s the beginning of something we don’t have words for yet. When that test gets passed, we won’t need a press release to know what happened.

English
3
0
7
1.3K
Brendan Farmer
Brendan Farmer@bfarmer·
I'm thinking about software engineering relative to, say, math research. My argument is that the act of writing software is more decomposable into discrete and simple steps. Ex: map from a spec to an interface, match the interface to a known pattern, write code to properly call the interface, debug. Compilers are basically a perfect verifier for RL, there aren't that many patterns in SWE, and there's a ton of training data, so it makes sense that SWE benefits hugely from RL. If you're correct that verbal reasoning from software RL 1) translates ~well to other domains and 2) improves the reasoning abilities of the underlying model, then I would expect to see transfer to domains like math research (but models are kinda bad at this and make reasoning mistakes all the time, despite math-specific RL). Not an expert, but it seems that there's some current research suggesting that RLVR doesn't improve pass@k over base model, but does improve pass@1. This sorta makes plausible intuitive sense to me? RLVR can't explore beyond the reasoning capabilities of the base model, so if the architecture or training data isn't providing some capability, then RL can't learn it.
English
0
0
0
54
Matthew Honnibal
Matthew Honnibal@honnibal·
@bfarmer I think you're thinking about tracing through the code, but there's tonnes of verbal reasoning in things like debugging. I have this hypothesis, does this observation contradict it? What can I do to narrow the search space? etc
English
1
0
0
51
Matthew Honnibal
Matthew Honnibal@honnibal·
I still see a lot of people discussing LLMs as next-token predictors, which is by now quite a misunderstanding. A related opinion is that LLM progress will probably plateau. This post explains why I don't think the "plateau" argument holds up. honnibal.dev/blog/ai-bubble
English
8
12
136
34.6K
Brendan Farmer
Brendan Farmer@bfarmer·
Sure, but I don't think it's straightforward to claim that abstract verbal reasoning in coding necessarily transfers to other domains. The model could be learning the reasoning moves that are relevant to coding (and a mapping from those moves to language), but in a way that doesn't generalize. Or, maybe it generalizes but doesn't scale with compute as well as pretraining. To me, it's plausible that coding requires fairly basic reasoning moves and RLVR reweights a model's solution mass to yield improve accuracy, but RL doesn't help the base model learn new reasoning modes. I think it's still an open question whether RL can improve the base model's reasoning ability efficiently enough to extend it to AGI.
English
1
0
0
85
Matthew Honnibal
Matthew Honnibal@honnibal·
@bfarmer Using the coding models it feels pretty clear to me it learns abstract verbal reasoning pretty well. The thing about transfer learning is, generalisable skills are generally available. You can get them wherever they're convenient
English
1
0
2
10.2K
Brendan Farmer
Brendan Farmer@bfarmer·
First Proof discussion: codeberg.org/tgkolda/1stpro… LLMs only solved 9, 10, with the caveat that important ideas for 9, 10 were previously published. OpenAI claim to have solved more problems using an unreleased model, but they may have violated the benchmark rules by having human experts direct the model to expand upon some proofs and select the best proof from a set. The rules as-written are that an AI should not rely on human input for any mathematical idea or content, but it seems possible that a human directing an LLM to expand on a certain section of a proof is implicitly providing mathematical input. Will be interesting to see if future rounds enforce autonomous solutions and independent verification more strictly.
English
0
2
8
1.4K
Brendan Farmer
Brendan Farmer@bfarmer·
Hmm, so an H100 runs at 700W, so 8xH100 = 5.6kW, so ~2.5 OOM to get to 20W for current LLMs. If we assume a few OOM (3-6?) increase in inference cost to get to an Einstein, that's a significant cost. It also might be the case that a million Einsteins is not very helpful, as the work of RSI is not very parallelizable, and what you'd actually want is an Einstein that operates OOMs faster than humans, pushing cost further. Also, if we're considering RSI, then we have to include training in the loop, which should require OOMs more power/chips to achieve a linear capability increase. To me, this suggests that after RSI, power+chips will still be constraints. Otherwise, I think you need to assume that there are significant algorithmic improvements that are just out of reach and would be quickly discovered by RSI. Maybe I'm thinking about this incorrectly though, not my field!
English
0
0
0
32
ueaj
ueaj@_ueaj·
@bfarmer @connorshepherd Well my point was that another o1-preview tier innovation would get us much closer to 20W/Einstein, atleast a few OOMs, which means OOMs more Einsteins than your opponent
English
1
0
0
121
Brendan Farmer
Brendan Farmer@bfarmer·
Maybe, but my argument is that the first iteration of RSI will probably be >> 20W/Einstein, so speed at which labs can produce Einsteins is limited by power and chips. I don't think that RSI => unbounded immediate capability growth. Moreover, different RSI implementations might have different trajectories, so first to RSI doesn't imply permanent winning lab.
English
1
0
0
134
ueaj
ueaj@_ueaj·
@bfarmer @connorshepherd I don't think it will if we actually did make a digital einstein, down to the 20W/Einstein energy effeciency then we could just make ~infinity of them
English
1
0
0
140