Buck Shlegeris

1.1K posts

Buck Shlegeris

@bshlgrs

CEO@Redwood Research (@redwood_ai), working on technical research to reduce catastrophic risk from AI misalignment. [email protected]

Berkeley, CA Beigetreten Ocak 2015

347 Folgt5.5K Follower

Angehefteter Tweet

Buck Shlegeris@bshlgrs·16 Nis

We’ve just released the biggest and most intricate study of AI control to date, in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters, e.g. hacking servers they’re working on.

English

249

31.2K

Buck Shlegeris@bshlgrs·2d

@arram @DanielleFong If you have a venue, I can probably provide a band!

English

2.7K

Arram@arram·2d

@DanielleFong @bshlgrs I'm going to do this

English

2.8K

Danielle Fong 🔆@DanielleFong·2d

@arram yes it is so good. @bshlgrs's band and they started doing this at lighthaven too. i should set it up for my band at my shop when times grow more abundant

English

2.8K

Buck Shlegeris@bshlgrs·2d

@RatOrthodox All Dogs?

English

204

Brangus🔍⏹️@RatOrthodox·2d

Other quantifiers are really under explored in band names. It’s always “The Monkeys” and never “All Monkey” or “Some Monkeys”.

English

2.7K

Buck Shlegeris@bshlgrs·10 Haz

"We find that frontier models like GPT-5.5 answer questions that take humans roughly three minutes with 50% reliability, and this time horizon has doubled approximately every year since 2019."

Dewi Gould@dswg97

New paper! Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models @METR_Evals showed that models' time horizons have doubled every few months. We ask: what length of tasks can models complete without any CoT?

English

5.3K

Buck Shlegeris@bshlgrs·10 Haz

@robinhanson What factor from break even are they? Like do you think use would have to be 10x higher for them to make sense? 100x?

English

476

Robin Hanson@robinhanson·10 Haz

Even in liberal Berkeley, bike lanes are mostly just a huge waste of space,

English

194

47.5K

Buck Shlegeris@bshlgrs·7 Haz

@geoffreyirving Lol I have literally used this analogy before, but talking about driving from SF to LA. I wonder if there’s a shared origin or it’s just very natural…

English

676

Geoffrey Irving@geoffreyirving·6 Haz

Here is a metaphor for AGI definitions. Imagine you’re on a long drive from Los Angeles to the Bay Area (for me: undergrad to grad school). From far away, this is unambiguous: the Bay Area is very small relative to the Los Angeles/Bay Area distance. People can and do dispute what exactly “The Bay Area” is (there are many definitions), but no one in LA would say, “I have no idea what direction you are going”. But now you approach the actual Bay Area. It’s a vague place! The definitional ambiguity starts to ramp up. If you’re in Los Gatos and you say “I’m driving to the Bay Area”, people will have questions. If we track the conversation as we drive from LA on, the definitional ambiguity and disagreement will ramp up over time. A skeptic that “the Bay Area” is a coherent idea might look at the ramp and think “aha I was right, people are starting to realize that the concept was incoherent all along”. And indeed, the people with questions are right to ask them, the relative distances have changed, “But where in the Bay Area?” matters more. But the definitional ambiguity is because we’re getting close! Something is about to happen!

English

178

14K

Buck Shlegeris@bshlgrs·29 May

James and Adam study control techniques for AI agents like Claude Code's auto mode, where the policy is asked to generate replacement actions when an action is flagged as dangerous. Main results: these techniques can cheaply improve safety, but some variants are exploitable.

Redwood Research@redwood_ai

Some coding scaffolds block and retry risky actions. In a new paper, we find this reveals information a malicious AI can use to bypass monitoring. Resampling without blocked actions in context is less exploitable, but techniques that help in one setting can hurt in another. 🧵

English

3.3K

Buck Shlegeris retweetet

Redwood Research@redwood_ai·29 May

English

4.6K

Buck Shlegeris@bshlgrs·29 May

@AaronBergman18 no, not at all, it just happened to be a convenient cheap model

English

Aaron Bergman 🔍@AaronBergman18·29 May

@bshlgrs Wait is this making the substantive claim that the mimo model is in fact trustworthy

English

Aaron Bergman 🔍@AaronBergman18·29 May

I didn't realize this until like yesterday but open weight models are just clearly the best at the low cost/low intelligence end of the Pareto frontier

English

5.5K

Buck Shlegeris@bshlgrs·29 May

@AaronBergman18 I also had never heard of it til my coworkers used it as the main trusted model in a control paper arxiv.org/abs/2605.26047

English

155

Aaron Bergman 🔍@AaronBergman18·29 May

OpenRouter has "MiMo-v2.5", which I had never thought for 2 seconds about, at 14% the cost of Haiku (input) and 6% the cost of Haiku (output)...

English

1.6K

Buck Shlegeris@bshlgrs·28 May

We wrote some tips on avoiding this problem here, they might be useful for researchers who want to use this methodology. blog.redwoodresearch.org/p/advice-for-m… I'm still honestly pretty confused what's going on; this seems inconsistent with some of the backdoor literature.

English

1.3K

Buck Shlegeris@bshlgrs·28 May

An obvious way to study whether a training technique removes misalignment is to run that technique on a model organism (MO). But we've found that MOs are often weirdly fragile. E.g. training them to talk like a pirate often removes their bad behavior. 1/2

English

138

7.7K

Buck Shlegeris@bshlgrs·28 May

@VictorTaelin My understanding is that Bend is still way less performant on GPUs than code written with Triton/CUDA; is that correct? I'm curious when you think you'll make substantial improvements here. (I'm sure you get this question all the time...)

English

1.4K

Taelin@VictorTaelin·28 May

Bend2 is: - 90% as fast as C single core (& faster on GPU) - safer than Rust; it is a literal proof language - compiler scales just like Go's - no first class C++ support. no. just no Should be released last week if I didn't fuck up As soon as I trust my own monster codebase

Stefan@schteppe

- fast like C - memory safe like Rust - fast compilation like Go - 1st class C++ support like Swift Who’s building this?

English

643

58K

Buck Shlegeris@bshlgrs·26 May

@philhchen @karpathy Yeah that would definitely be disallowed under the rule I proposed. Definitely there's some tricky question for Anthropic about how to manage existing relationships with counterparties who use Claude for their AI work.

English

Phil Chen@philhchen·26 May

@bshlgrs @karpathy actually a clear counterexample to this would be Google DeepMind using Opus for Gemini pretraining code

English

Phil Chen@philhchen·26 May

Thought experiment: if @karpathy's efforts at Anthropic yield a Claude model that is capable of pretraining the next generation of Claude, then any company with sufficient GPU infrastructure could use Claude to pretrain their own Claude-clone. Of course, Anthropic would then ban that company from using Claude. But then wouldn't any company with enough Claude spend be incentivized to use Claude to train their own Claude-clone eventually? What happens in 1-2 years when even open-weights models become good enough to run their own training?

English

10.5K

Buck Shlegeris@bshlgrs·26 May

@philhchen @karpathy Idk how they will/should draw the line. You could operationalize as "Claude won't help you train models that will be within a factor of 10x of cost competitiveness of any currently deployed Anthropic model"?

English

Phil Chen@philhchen·26 May

@bshlgrs @karpathy where do you draw the line between nanoGPT runs on 8xH100s (obviously allowed) and big pretrain on 100k B200s?

English

165

Buck Shlegeris@bshlgrs·22 May

@StephenLCasper I'd love a copy of the slides

English

280

Cas (Stephen Casper)@StephenLCasper·22 May

I recently finished preparing a 90-minute crash course presentation on [technical] AI governance, where we're at in 2026, and the different ways that things might change in the next few years. Let me know if you want me to give it sometime or share slides.

English

107

5.2K

Buck Shlegeris@bshlgrs·19 May

@reconfigurthing Yeah this is fucking tragic

English

727

Elias Schmied@reconfigurthing·18 May

it's still so crazy to me that the best depiction of post-scarcity AI-powered utopia is at the end of a 1.5 million word web serial, and you can't even tell people which one it is because it would spoil the entire story.

English

376

43.2K

Buck Shlegeris retweetet

Alex Mallen@alextmallen·15 May

Risk reports need to address deployment-time spread of misalignment. They currently lean heavily on pre-deployment alignment evals. But a model that *starts* aligned can develop dangerous motivations during deployment, and I think this is becoming important. New post 🧵

English

2.8K

Buck Shlegeris@bshlgrs·7 May

We reviewed OpenAI's blog post “Investigating the consequences of accidentally grading CoT during RL". blog.redwoodresearch.org

English

8.5K

Buck Shlegeris@bshlgrs·5 May

@allTheYud @kromem2dot0 For sure! Though I'm not sure that fixing that particular misconception would address their true crux.

English

510

Eliezer Yudkowsky@allTheYud·5 May

@bshlgrs @kromem2dot0 I think there's an actual problem where a bunch of the people strutting around, thrusting out their chests and proclaiming "It only predicts the next token!" are in fact not comprehending the gradient flows on that level.

English

1.7K

Eliezer Yudkowsky@allTheYud·5 May

Everyone bragging that THEY understand how AI works and THEY know it can't be conscious, explain right now from memory why it was very clever that the positional encoding in the original transformers paper used both sines and cosines.

Lucas Meijer@lucasmeijer

Everybody who thinks ai is conscious has to do a mandatory from scratch transformer implementation. There are only floats and multiplications.

English

116

824

223.1K

Entdecken

@arram @DanielleFong @RatOrthodox @robinhanson @geoffreyirving @AaronBergman18 @VictorTaelin @elonmusk