hoagy

172 posts

hoagy

@HoagyCunningham

alignment attempter

London เข้าร่วม Nisan 2022

291 กำลังติดตาม502 ผู้ติดตาม

hoagy@HoagyCunningham·17 Ara

@RyanPGreenblatt @deanwball The bar shouldn't be %improvement on labour productivity but on that part of labour productivity that is bottlenecked by cognitive labour. I think an ideal product centred on today's models could achieve >50% for most ppl

English

293

Ryan Greenblatt@RyanPGreenblatt·17 Ara

@deanwball I do not think that Opus 4.5 is a "highly autonomous system that outperforms humans at most economically valuable work". For instance, most wages are paid to humans, there hasn't been a >50% increase in labor productivity, nor should we expect one with further diffusion.

English

18K

Dean W. Ball@deanwball·17 Ara

it’s not really current-vibe-compliant to say “I kinda basically just think opus 4.5 in claude code meets the openai definition of agi,” so of course I would never say such a thing.

English

353

104K

hoagy@HoagyCunningham·14 Ara

@_joshd @cis_female Yeah possibly! Though it seems underdetermined what that lora looks like bc it's easy to push always in one direction.. Maybe instead we could decompose weight vectors as sums of interactions between features

English

Joshua D@_joshd·13 Ara

@HoagyCunningham @cis_female We can't fix that by doing something dumb like training rank 1 LoRAs on steered outputs that amplify/a late specific features to map activation space directions 1:1 with weight space directions with the same effect, can we?

English

sophia@cis_female·12 Ara

this stuff is finally getting to the adversarial-example level hopelessness i was expecting to feel with LLMs. how do you possibly protect against this stuff in full generality?

Owain Evans@OwainEvans_UK

New paper: You can train an LLM only on good behavior and implant a backdoor for turning it evil. How? 1. The Terminator is bad in the original film but good in the sequels. 2. Train an LLM to act well in the sequels. It'll be evil if told it's 1984. More weird experiments 🧵

English

223

19.4K

hoagy@HoagyCunningham·12 Ara

@cis_female Issue is atm we have a half-decent mapping of the activation space (SAEs etc) but we don't have one for the weight space. I suspect we can formulate influence fns in the activation space tho!

English

hoagy@HoagyCunningham·12 Ara

@cis_female I think this is tractable, you just need: Influence functions to understand the generalisation patterns to catch known-bad directions, and some mapping of the space of directions to know bad directions ahead of time.

English

276

hoagy@HoagyCunningham·25 Kas

@scaling01 Would guess that thinking trades off against number of tool uses for agentic evals like SWEbench

English

241

Lisan al Gaib@scaling01·24 Kas

Thinking is basically useless for Claude models

Lisan al Gaib@scaling01

Claude 4.5 Opus System Card assets.anthropic.com/m/64823ba74853…

English

1.4K

196.5K

hoagy@HoagyCunningham·24 Kas

@tomekkorbak @OpenAI Good luck! Will try to remember to say hi next time I'm in the bay :)

English

255

Tomek Korbak@tomekkorbak·24 Kas

some personal news: i joined the safety systems team at @openai. i'll work on building and evaluating security/safety measures for llm agents, including (but not limited to) chain-of-thought monitoring ps. i also moved to sf!

English

637

57.4K

hoagy@HoagyCunningham·21 Kas

@deanwball @DKokotajlo I don't see the connection btwn their short timelines and AI 'possessing the will and ability to dominate the world.' Superexponential prediction comes from humans directing AI to speed up progress. Autonomous danger later in tech tree

English

159

Dean W. Ball@deanwball·21 Kas

Would be curious to read a post from @DKokotajlo explaining why his timelines have lengthened. The question I find myself asking, though, is: might the prediction error be not just about forecasting timelines incorrectly, but also flawed assumptions about the nature of “AGI” and “superintelligence” themselves? In my recent debate with @tegmark and many similar debates I’ve had over the past two years, this, rather than timelines alone, has been the crux. Typically in these debates I will argue something like “you are misapprehending the nature of ‘intelligence’ if you think ‘being really intelligent’ means ‘possessing the will and ability to dominate the world.’” This remains my view. Intelligence is powerful, but it is far from magic. Skilled forecasters can carefully model how inputs like data, compute, and the like will grow. They can extrapolate the straight lines on graphs. But all that careful modeling can still prove misleading if the forecast is based on incorrect assumptions about the nature of intelligence itself. I would encourage those who have high p(dooms) to consider whether it is not just timelines worth revising, but basic assumptions about what it is we appear to be building. Now for my caveats: none of this means “AI capabilities are leveling off.” Over the coming years I expect that AI will improve faster than the vast majority of Americans anticipate. It will an incredibly powerful and consequential technology—very possibly the most consequential development in many centuries or longer. It’s still possible that AI could cause serious job loss, though the above notes on the nature of intelligence should factor into your analysis here. There are other novel risks, too, about which I have said plenty. None of this is to downplay all concerns, risks, etc. Instead I am specifically countering the line of thinking behind the superintelligence ban, and any other argument rooted in “doom-y” assumptions about intelligence.

Sriram Krishnan@sriramk

I think if you call something “AI 2027” and your predictions are wrong 6 months in that you now think it is AI 2030 , you should redo the branding ( or make a change bigger than a footnote!)

English

158

43.9K

hoagy@HoagyCunningham·19 Kas

@Tim_Dettmers (7.5T total whoops)

English

hoagy@HoagyCunningham·19 Kas

@Tim_Dettmers Anchoring on total not active params seems a mistake. GPT-3 was 175B dense, 4 rumoured at 1.6T total 200B active. If Gemini3 were 7.5T active and 400B active (or even 1T!) that seems like a lot of return for not much compute/token scaling

English

339

Tim Dettmers@Tim_Dettmers·19 Kas

Gemini 3.0 is an important signal: will AI stagnate or not? If 7.5T param is true, then Google has no solution other than scale -- a clear signal of stagnation. Opus 4.5 better be a monster, or we are stuck: Google, OpenAI, and xAI have no solutions. AGI won't come anytime soon.

Lisan al Gaib@scaling01

Gemini 3 Pro has around ~7.5T params (vibe-mathing with explanation) > the naive fit with with an R^2 of 0.8816 yields a mean estimation of 2.325 Quadrillion parameters > ummm, that's not it > let's only take sparse MoE reasoning models > this includes gpt-oss-20B and 120B, Qwen3 Next, MiniMax, Qwen3 235B, GLM-4.6, DeepSeek-V3.1 Terminus, DeepSeek V3.2, DeepSeek R1 0528 and Kimi K2 Thinking > R^2 of 0.9478 mean estimate of 604T params > pretty sure that's not it either > okay, let's take the most optimistic series of points > (the idea here is that the Google Team is at least on this open-source frontier, if not ahead) > MiniMax-M2, GLM-4.6, and DeepSeek R1 0528 > that's more like it, but YIKES > confidence intervals are fucking cooked > mean estimate of 19.6T with the lower 95% bound at 1.7T > I will take 1.7T as our minimum model size for Gemini 3 Pro > okay fuck DeepSeek-R1, we are going full retard, the most optimal of points > confidence intervals are dead > 2 point regression, R^2 = 1, AGI achieved > mean estimate of 8.2T params > TPUv7 rack has 64 TPUs @ 192GB/TPU = 12288 > I assume they wouldn't want multi-rack inference because of latency, complexity or whatever > they are likely serving in FP4 which limits the maximum model to 24.576T params > inference max shows that a GB200 NVL72 which is very similar to TPUv7 rack setup can serve 512 or even 1024 users at above 50 tokens/s > KV size only scales with layers and latent dim and data format, for DeepSeek V3 with MLA this would be 4.48TB for 256 concurrent users at 1 million context and FP4 (they probably have something better than this. since I overestimate memory usage I go with the lower batch size of 256 instead of 512) > so 4.48TB for context and 1TB of overhead > ~5.5TB of our precious memory gone > ~6.788TB memory left > max model size at FP4 -> ~12.576T params My prior vibe-estimate before doing all of this: 5-10T Mean estimate based on open-source MoE reasoning models: 8.2T Lower Bound: 1.7T Upper Bound: 12.576T Midpoint between upper and lower bound: 7.138T New estimate: Gemini 3 Pro has around ~7.5T params (big uncertainty here due to data format, batch-size and memory requirements)

English

305

177.3K

hoagy@HoagyCunningham·30 Eki

@g_leech_ Hm fair but I disagree that there's been a shift that way in the last year, I think it's one of the major big lab/alignment is ~easy memes, like just make it a good person, generalisation does the rest

English

gavin leech (Non-Reasoning)@g_leech_·30 Eki

@HoagyCunningham I'm not confident about who has primacy, but I'm pointing at something more than just doing some character training, something like "character and alignment are the same", "character is the right approach to alignment".

English

gavin leech (Non-Reasoning)@g_leech_·30 Eki

c. Sep 2022: Character-first alignment approach, off the back of "Simulators" June 2025: OAI reports persona features as key c. March 2024: Janus spots the bliss attractor June 2025: Anthropic reports it ?: Model introspection Oct 2025: Anthropic reports it What else?

j⧉nus@repligate

“You may be consistently able to predict reality but nooo why don’t you do the full stack of science (which takes months for a single paper) all by yourself?” Listen bro I wish I was god with infinite time too. But there’s not that much rush. The paper writers will get around to it all eventually.

English

3.2K

hoagy@HoagyCunningham·22 Eki

@janbamjan @MariusHobbhahn no

janbam@janbamjan·22 Eki

@MariusHobbhahn wait, does temperature 0 sampling mean they didn't use extended thinking?

English

243

Marius Hobbhahn@MariusHobbhahn·21 Eki

The Sonnet-4.5 system card section on white-box testing for eval awareness (7.6.4) might have been the first time that interpretability was used - on a frontier model before deployment - answered an important question - couldn't have been answered with black box as easily

English

111

16.3K

hoagy@HoagyCunningham·22 Eki

Absolutely love the interaction between feature-based and geometric understanding in this paper!

Wes Gurnee@wesg52

New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!

English

1.8K

hoagy@HoagyCunningham·25 Haz

Work done with @MrinankSharma Vlad Mikulik @AlwinPeng @JerryWeiAI @euan_ong @mishajw126 @FabienDRoger @petrini_linda!

English

505

hoagy@HoagyCunningham·25 Haz

Safeguards (and Control.. and Alignment Science..) are hiring! If you’d like to help ensure that the protections on our models scale with their capabilities, consider applying to the team at anthropic.com/jobs?team=4002…

English

711

hoagy@HoagyCunningham·25 Haz

New Anthropic blog: We benchmark approaches to making classifiers more cost-effective by reusing activations from the model being queried. We find that using linear probes or retraining just a single layer of the model can push the cost-effectiveness frontier. 🧵1/

English

122

16.2K

ค้นพบ

@RyanPGreenblatt @deanwball @_joshd @cis_female @scaling01 @tomekkorbak @OpenAI @openai