Buck Shlegeris

1.1K posts

Buck Shlegeris banner
Buck Shlegeris

Buck Shlegeris

@bshlgrs

CEO@Redwood Research (@redwood_ai), working on technical research to reduce catastrophic risk from AI misalignment. [email protected]

Berkeley, CA Beigetreten Ocak 2015
347 Folgt5.5K Follower
Angehefteter Tweet
Buck Shlegeris
Buck Shlegeris@bshlgrs·
We’ve just released the biggest and most intricate study of AI control to date, in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters, e.g. hacking servers they’re working on.
English
8
23
249
31.2K
Danielle Fong 🔆
Danielle Fong 🔆@DanielleFong·
@arram yes it is so good. @bshlgrs's band and they started doing this at lighthaven too. i should set it up for my band at my shop when times grow more abundant
English
2
0
10
2.8K
Brangus🔍⏹️
Brangus🔍⏹️@RatOrthodox·
Other quantifiers are really under explored in band names. It’s always “The Monkeys” and never “All Monkey” or “Some Monkeys”.
English
6
2
51
2.7K
Buck Shlegeris
Buck Shlegeris@bshlgrs·
"We find that frontier models like GPT-5.5 answer questions that take humans roughly three minutes with 50% reliability, and this time horizon has doubled approximately every year since 2019."
Dewi Gould@dswg97

New paper! Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models @METR_Evals showed that models' time horizons have doubled every few months. We ask: what length of tasks can models complete without any CoT?

English
0
3
63
5.3K
Buck Shlegeris
Buck Shlegeris@bshlgrs·
@robinhanson What factor from break even are they? Like do you think use would have to be 10x higher for them to make sense? 100x?
English
0
0
3
476
Robin Hanson
Robin Hanson@robinhanson·
Even in liberal Berkeley, bike lanes are mostly just a huge waste of space,
English
53
9
194
47.5K
Buck Shlegeris
Buck Shlegeris@bshlgrs·
@geoffreyirving Lol I have literally used this analogy before, but talking about driving from SF to LA. I wonder if there’s a shared origin or it’s just very natural…
English
1
0
11
676
Geoffrey Irving
Geoffrey Irving@geoffreyirving·
Here is a metaphor for AGI definitions. Imagine you’re on a long drive from Los Angeles to the Bay Area (for me: undergrad to grad school). From far away, this is unambiguous: the Bay Area is very small relative to the Los Angeles/Bay Area distance. People can and do dispute what exactly “The Bay Area” is (there are many definitions), but no one in LA would say, “I have no idea what direction you are going”. But now you approach the actual Bay Area. It’s a vague place! The definitional ambiguity starts to ramp up. If you’re in Los Gatos and you say “I’m driving to the Bay Area”, people will have questions. If we track the conversation as we drive from LA on, the definitional ambiguity and disagreement will ramp up over time. A skeptic that “the Bay Area” is a coherent idea might look at the ramp and think “aha I was right, people are starting to realize that the concept was incoherent all along”. And indeed, the people with questions are right to ask them, the relative distances have changed, “But where in the Bay Area?” matters more. But the definitional ambiguity is because we’re getting close! Something is about to happen!
English
8
16
178
14K
Buck Shlegeris
Buck Shlegeris@bshlgrs·
James and Adam study control techniques for AI agents like Claude Code's auto mode, where the policy is asked to generate replacement actions when an action is flagged as dangerous. Main results: these techniques can cheaply improve safety, but some variants are exploitable.
Redwood Research@redwood_ai

Some coding scaffolds block and retry risky actions. In a new paper, we find this reveals information a malicious AI can use to bypass monitoring. Resampling without blocked actions in context is less exploitable, but techniques that help in one setting can hurt in another. 🧵

English
0
1
43
3.3K
Buck Shlegeris retweetet
Redwood Research
Redwood Research@redwood_ai·
Some coding scaffolds block and retry risky actions. In a new paper, we find this reveals information a malicious AI can use to bypass monitoring. Resampling without blocked actions in context is less exploitable, but techniques that help in one setting can hurt in another. 🧵
Redwood Research tweet media
English
1
6
32
4.6K
Aaron Bergman 🔍
Aaron Bergman 🔍@AaronBergman18·
@bshlgrs Wait is this making the substantive claim that the mimo model is in fact trustworthy
English
1
0
0
37
Aaron Bergman 🔍
Aaron Bergman 🔍@AaronBergman18·
I didn't realize this until like yesterday but open weight models are just clearly the best at the low cost/low intelligence end of the Pareto frontier
Aaron Bergman 🔍 tweet media
English
5
2
76
5.5K
Aaron Bergman 🔍
Aaron Bergman 🔍@AaronBergman18·
OpenRouter has "MiMo-v2.5", which I had never thought for 2 seconds about, at 14% the cost of Haiku (input) and 6% the cost of Haiku (output)...
Aaron Bergman 🔍 tweet media
English
4
0
21
1.6K
Buck Shlegeris
Buck Shlegeris@bshlgrs·
We wrote some tips on avoiding this problem here, they might be useful for researchers who want to use this methodology. blog.redwoodresearch.org/p/advice-for-m… I'm still honestly pretty confused what's going on; this seems inconsistent with some of the backdoor literature.
English
1
0
56
1.3K
Buck Shlegeris
Buck Shlegeris@bshlgrs·
An obvious way to study whether a training technique removes misalignment is to run that technique on a model organism (MO). But we've found that MOs are often weirdly fragile. E.g. training them to talk like a pirate often removes their bad behavior. 1/2
English
2
5
138
7.7K
Buck Shlegeris
Buck Shlegeris@bshlgrs·
@VictorTaelin My understanding is that Bend is still way less performant on GPUs than code written with Triton/CUDA; is that correct? I'm curious when you think you'll make substantial improvements here. (I'm sure you get this question all the time...)
English
0
0
2
1.4K
Buck Shlegeris
Buck Shlegeris@bshlgrs·
@philhchen @karpathy Yeah that would definitely be disallowed under the rule I proposed. Definitely there's some tricky question for Anthropic about how to manage existing relationships with counterparties who use Claude for their AI work.
English
0
0
0
57
Phil Chen
Phil Chen@philhchen·
@bshlgrs @karpathy actually a clear counterexample to this would be Google DeepMind using Opus for Gemini pretraining code
English
1
0
1
73
Phil Chen
Phil Chen@philhchen·
Thought experiment: if @karpathy's efforts at Anthropic yield a Claude model that is capable of pretraining the next generation of Claude, then any company with sufficient GPU infrastructure could use Claude to pretrain their own Claude-clone. Of course, Anthropic would then ban that company from using Claude. But then wouldn't any company with enough Claude spend be incentivized to use Claude to train their own Claude-clone eventually? What happens in 1-2 years when even open-weights models become good enough to run their own training?
English
16
0
44
10.5K
Buck Shlegeris
Buck Shlegeris@bshlgrs·
@philhchen @karpathy Idk how they will/should draw the line. You could operationalize as "Claude won't help you train models that will be within a factor of 10x of cost competitiveness of any currently deployed Anthropic model"?
English
1
0
1
96
Phil Chen
Phil Chen@philhchen·
@bshlgrs @karpathy where do you draw the line between nanoGPT runs on 8xH100s (obviously allowed) and big pretrain on 100k B200s?
English
2
0
1
165
Cas (Stephen Casper)
Cas (Stephen Casper)@StephenLCasper·
I recently finished preparing a 90-minute crash course presentation on [technical] AI governance, where we're at in 2026, and the different ways that things might change in the next few years. Let me know if you want me to give it sometime or share slides.
English
33
3
107
5.2K
Elias Schmied
Elias Schmied@reconfigurthing·
it's still so crazy to me that the best depiction of post-scarcity AI-powered utopia is at the end of a 1.5 million word web serial, and you can't even tell people which one it is because it would spoil the entire story.
English
32
7
376
43.2K
Buck Shlegeris retweetet
Alex Mallen
Alex Mallen@alextmallen·
Risk reports need to address deployment-time spread of misalignment. They currently lean heavily on pre-deployment alignment evals. But a model that *starts* aligned can develop dangerous motivations during deployment, and I think this is becoming important. New post 🧵
English
1
5
33
2.8K
Eliezer Yudkowsky
Eliezer Yudkowsky@allTheYud·
@bshlgrs @kromem2dot0 I think there's an actual problem where a bunch of the people strutting around, thrusting out their chests and proclaiming "It only predicts the next token!" are in fact not comprehending the gradient flows on that level.
English
3
0
31
1.7K