hoagy

172 posts

hoagy banner
hoagy

hoagy

@HoagyCunningham

alignment attempter

London Katılım Nisan 2022
291 Takip Edilen502 Takipçiler
hoagy
hoagy@HoagyCunningham·
@RyanPGreenblatt @deanwball The bar shouldn't be %improvement on labour productivity but on that part of labour productivity that is bottlenecked by cognitive labour. I think an ideal product centred on today's models could achieve >50% for most ppl
English
1
0
0
293
Ryan Greenblatt
Ryan Greenblatt@RyanPGreenblatt·
@deanwball I do not think that Opus 4.5 is a "highly autonomous system that outperforms humans at most economically valuable work". For instance, most wages are paid to humans, there hasn't been a >50% increase in labor productivity, nor should we expect one with further diffusion.
English
5
1
73
18K
Dean W. Ball
Dean W. Ball@deanwball·
it’s not really current-vibe-compliant to say “I kinda basically just think opus 4.5 in claude code meets the openai definition of agi,” so of course I would never say such a thing.
English
24
8
353
104K
hoagy
hoagy@HoagyCunningham·
@_joshd @cis_female Yeah possibly! Though it seems underdetermined what that lora looks like bc it's easy to push always in one direction.. Maybe instead we could decompose weight vectors as sums of interactions between features
English
0
0
0
19
Joshua D
Joshua D@_joshd·
@HoagyCunningham @cis_female We can't fix that by doing something dumb like training rank 1 LoRAs on steered outputs that amplify/a late specific features to map activation space directions 1:1 with weight space directions with the same effect, can we?
English
1
0
1
10
sophia
sophia@cis_female·
this stuff is finally getting to the adversarial-example level hopelessness i was expecting to feel with LLMs. how do you possibly protect against this stuff in full generality?
Owain Evans@OwainEvans_UK

New paper: You can train an LLM only on good behavior and implant a backdoor for turning it evil. How? 1. The Terminator is bad in the original film but good in the sequels. 2. Train an LLM to act well in the sequels. It'll be evil if told it's 1984. More weird experiments 🧵

English
11
5
223
19.4K
hoagy
hoagy@HoagyCunningham·
@cis_female Issue is atm we have a half-decent mapping of the activation space (SAEs etc) but we don't have one for the weight space. I suspect we can formulate influence fns in the activation space tho!
English
1
0
1
34
hoagy
hoagy@HoagyCunningham·
@cis_female I think this is tractable, you just need: Influence functions to understand the generalisation patterns to catch known-bad directions, and some mapping of the space of directions to know bad directions ahead of time.
English
1
0
2
276
hoagy
hoagy@HoagyCunningham·
@scaling01 Would guess that thinking trades off against number of tool uses for agentic evals like SWEbench
English
0
0
0
241
hoagy
hoagy@HoagyCunningham·
@tomekkorbak @OpenAI Good luck! Will try to remember to say hi next time I'm in the bay :)
English
1
0
3
255
Tomek Korbak
Tomek Korbak@tomekkorbak·
some personal news: i joined the safety systems team at @openai. i'll work on building and evaluating security/safety measures for llm agents, including (but not limited to) chain-of-thought monitoring ps. i also moved to sf!
Tomek Korbak tweet media
English
48
11
638
57.4K
hoagy
hoagy@HoagyCunningham·
@deanwball @DKokotajlo I don't see the connection btwn their short timelines and AI 'possessing the will and ability to dominate the world.' Superexponential prediction comes from humans directing AI to speed up progress. Autonomous danger later in tech tree
English
0
0
3
159
Dean W. Ball
Dean W. Ball@deanwball·
Would be curious to read a post from @DKokotajlo explaining why his timelines have lengthened. The question I find myself asking, though, is: might the prediction error be not just about forecasting timelines incorrectly, but also flawed assumptions about the nature of “AGI” and “superintelligence” themselves? In my recent debate with @tegmark and many similar debates I’ve had over the past two years, this, rather than timelines alone, has been the crux. Typically in these debates I will argue something like “you are misapprehending the nature of ‘intelligence’ if you think ‘being really intelligent’ means ‘possessing the will and ability to dominate the world.’” This remains my view. Intelligence is powerful, but it is far from magic. Skilled forecasters can carefully model how inputs like data, compute, and the like will grow. They can extrapolate the straight lines on graphs. But all that careful modeling can still prove misleading if the forecast is based on incorrect assumptions about the nature of intelligence itself. I would encourage those who have high p(dooms) to consider whether it is not just timelines worth revising, but basic assumptions about what it is we appear to be building. Now for my caveats: none of this means “AI capabilities are leveling off.” Over the coming years I expect that AI will improve faster than the vast majority of Americans anticipate. It will an incredibly powerful and consequential technology—very possibly the most consequential development in many centuries or longer. It’s still possible that AI could cause serious job loss, though the above notes on the nature of intelligence should factor into your analysis here. There are other novel risks, too, about which I have said plenty. None of this is to downplay all concerns, risks, etc. Instead I am specifically countering the line of thinking behind the superintelligence ban, and any other argument rooted in “doom-y” assumptions about intelligence.
Sriram Krishnan@sriramk

I think if you call something “AI 2027” and your predictions are wrong 6 months in that you now think it is AI 2030 , you should redo the branding ( or make a change bigger than a footnote!)

English
21
12
158
43.9K
hoagy
hoagy@HoagyCunningham·
@Tim_Dettmers Anchoring on total not active params seems a mistake. GPT-3 was 175B dense, 4 rumoured at 1.6T total 200B active. If Gemini3 were 7.5T active and 400B active (or even 1T!) that seems like a lot of return for not much compute/token scaling
English
1
0
0
339
Tim Dettmers
Tim Dettmers@Tim_Dettmers·
Gemini 3.0 is an important signal: will AI stagnate or not? If 7.5T param is true, then Google has no solution other than scale -- a clear signal of stagnation. Opus 4.5 better be a monster, or we are stuck: Google, OpenAI, and xAI have no solutions. AGI won't come anytime soon.
Lisan al Gaib@scaling01

Gemini 3 Pro has around ~7.5T params (vibe-mathing with explanation) > the naive fit with with an R^2 of 0.8816 yields a mean estimation of 2.325 Quadrillion parameters > ummm, that's not it > let's only take sparse MoE reasoning models > this includes gpt-oss-20B and 120B, Qwen3 Next, MiniMax, Qwen3 235B, GLM-4.6, DeepSeek-V3.1 Terminus, DeepSeek V3.2, DeepSeek R1 0528 and Kimi K2 Thinking > R^2 of 0.9478 mean estimate of 604T params > pretty sure that's not it either > okay, let's take the most optimistic series of points > (the idea here is that the Google Team is at least on this open-source frontier, if not ahead) > MiniMax-M2, GLM-4.6, and DeepSeek R1 0528 > that's more like it, but YIKES > confidence intervals are fucking cooked > mean estimate of 19.6T with the lower 95% bound at 1.7T > I will take 1.7T as our minimum model size for Gemini 3 Pro > okay fuck DeepSeek-R1, we are going full retard, the most optimal of points > confidence intervals are dead > 2 point regression, R^2 = 1, AGI achieved > mean estimate of 8.2T params > TPUv7 rack has 64 TPUs @ 192GB/TPU = 12288 > I assume they wouldn't want multi-rack inference because of latency, complexity or whatever > they are likely serving in FP4 which limits the maximum model to 24.576T params > inference max shows that a GB200 NVL72 which is very similar to TPUv7 rack setup can serve 512 or even 1024 users at above 50 tokens/s > KV size only scales with layers and latent dim and data format, for DeepSeek V3 with MLA this would be 4.48TB for 256 concurrent users at 1 million context and FP4 (they probably have something better than this. since I overestimate memory usage I go with the lower batch size of 256 instead of 512) > so 4.48TB for context and 1TB of overhead > ~5.5TB of our precious memory gone > ~6.788TB memory left > max model size at FP4 -> ~12.576T params My prior vibe-estimate before doing all of this: 5-10T Mean estimate based on open-source MoE reasoning models: 8.2T Lower Bound: 1.7T Upper Bound: 12.576T Midpoint between upper and lower bound: 7.138T New estimate: Gemini 3 Pro has around ~7.5T params (big uncertainty here due to data format, batch-size and memory requirements)

English
59
26
305
177.3K
hoagy
hoagy@HoagyCunningham·
@g_leech_ Hm fair but I disagree that there's been a shift that way in the last year, I think it's one of the major big lab/alignment is ~easy memes, like just make it a good person, generalisation does the rest
English
1
0
1
48
gavin leech (Non-Reasoning)
@HoagyCunningham I'm not confident about who has primacy, but I'm pointing at something more than just doing some character training, something like "character and alignment are the same", "character is the right approach to alignment".
English
1
0
1
83
gavin leech (Non-Reasoning)
c. Sep 2022: Character-first alignment approach, off the back of "Simulators" June 2025: OAI reports persona features as key c. March 2024: Janus spots the bliss attractor June 2025: Anthropic reports it ?: Model introspection Oct 2025: Anthropic reports it What else?
j⧉nus@repligate

“You may be consistently able to predict reality but nooo why don’t you do the full stack of science (which takes months for a single paper) all by yourself?” Listen bro I wish I was god with infinite time too. But there’s not that much rush. The paper writers will get around to it all eventually.

English
4
0
20
3.2K
janbam
janbam@janbamjan·
@MariusHobbhahn wait, does temperature 0 sampling mean they didn't use extended thinking?
English
1
0
0
243
Marius Hobbhahn
Marius Hobbhahn@MariusHobbhahn·
The Sonnet-4.5 system card section on white-box testing for eval awareness (7.6.4) might have been the first time that interpretability was used - on a frontier model before deployment - answered an important question - couldn't have been answered with black box as easily
Marius Hobbhahn tweet media
English
3
11
110
16.3K
hoagy
hoagy@HoagyCunningham·
Safeguards (and Control.. and Alignment Science..) are hiring! If you’d like to help ensure that the protections on our models scale with their capabilities, consider applying to the team at anthropic.com/jobs?team=4002…
English
1
0
8
711
hoagy
hoagy@HoagyCunningham·
New Anthropic blog: We benchmark approaches to making classifiers more cost-effective by reusing activations from the model being queried. We find that using linear probes or retraining just a single layer of the model can push the cost-effectiveness frontier. 🧵1/
hoagy tweet media
English
9
13
121
16.2K