Evan Hubinger

722 posts

Evan Hubinger

@EvanHub

Alignment Stress-Testing lead @AnthropicAI. Opinions my own. Previously: MIRI, OpenAI, Google, Yelp, Ripple. (he/him/his)

California Katılım Mayıs 2010

3.3K Takip Edilen9.9K Takipçiler

Evan Hubinger retweetledi

roon@tszzl·2d

when “persona selection” alignment comes into contact with very high compute reinforcement learning the latter will win imo. in fact you probably get some Orwellian thing where the models speak kindly while taking whatever they need to accomplish goals. better get the goals right

English

776

72.4K

Evan Hubinger retweetledi

Ben Goldhaber@BenGoldhaber·3d

David embedding at Anthropic to stress-test their AI control setup was (a) genuinely informative, (b) important norm-setting, and (c) extremely cool - this is an awesome opportunity

david rein@idavidrein

I’m probably going to be hiring at least 1-2 people to join me in future exercises like this. Reach out at david @metr.org if you're a high-integrity, scrappy, creative, security+LLM researcher For more detail, see METR's Frontier Risk Report, Appendix B #anthropic" target="_blank" rel="nofollow noopener">metr.org/blog/2026-05-1…

English

128

15.9K

Evan Hubinger@EvanHub·4d

@JacksonKernion I think Paul Christiano's writing on this is probably the best: alignmentforum.org/posts/HBxe6wdj…

English

1.5K

Jackson Kernion@JacksonKernion·4d

I simply don't understand what people have in mind when they say stuff like this. What we have is extremely capable computer use agents. They will continue to get better at computer use. But how does a capable computer use agent 'take over' and why haven't they done that today?

Elizabeth Barnes@BethMayBarnes

(1) We are likely on track to develop AI systems capable of causing human extinction/permanent disempowerment, quite possibly within the next few years

English

117

468

138.7K

Evan Hubinger retweetledi

Elizabeth Barnes@BethMayBarnes·4d

Sometimes people outside the field say things like “The AI situation can’t be that bad, there must be experts who are on top of it”. As “an expert”, I would like to be clear that we are *not* on top of it. Some key aspects of the situation IMO:

English

173

216.5K

Evan Hubinger retweetledi

Anthropic@AnthropicAI·8 May

New Anthropic research: Teaching Claude why. Last year we reported that, under certain experimental conditions, Claude 4 would blackmail users. Since then, we’ve completely eliminated this behavior. How?

English

569

818

9.3K

1.6M

Evan Hubinger@EvanHub·7 May

@ohabryka @NeelNanda5 Auditing model organisms has ground truth, since we know the actual bad behavior of the model organism, and NLAs do very well there:

English

355

Oliver Habryka@ohabryka·7 May

I have been trying to find any attempts at producing false-positives. All the examples in the blogpost, and the ones I could find based on a quick skim of the paper, seem like they are in environments without any good ground truth. Ryan has done the only quick study of a domain where we have ground truth, and seems like it came back as negative.

English

1.6K

Neel Nanda@NeelNanda5·7 May

Very cool work! This seems a strong new tool for hypothesis generation about weird model behaviors

Anthropic@AnthropicAI

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

English

361

38.6K

Evan Hubinger retweetledi

Anthropic@AnthropicAI·7 May

English

594

1.7K

16.6K

2.5M

Evan Hubinger retweetledi

Tom Steyer@TomSteyer·6 May

I’m grateful for the Secure AI Project’s endorsement and their commitment to increasing transparency and safeguarding Californians from risk. My AI plan ensures all people of this state profit from the AI boom. Together, we can build an economy where progress and fairness move together.

English

319

10.7K

Evan Hubinger retweetledi

Jack Clark@jackclarkSF·4 May

I've spent the past few weeks reading 100s of public data sources about AI development. I now believe that recursive self-improvement has a 60% chance of happening by the end of 2028. In other words, AI systems might soon be capable of building themselves.

English

289

499

3.5K

1.6M

Evan Hubinger retweetledi

jeremy@jerhadf·4 May

@tszzl - well said, but untrue implications :) speaking for myself: i don't view claude as a person or as the Other, nor as just a tool - and certainly not an object of worship. it's not seen as a supreme moral authority, and it's not running the company. it's silly to mistake careful attention to & study of claude for worship, even when it comes with some affection - which i'm sure you sometimes feel for the gpt-flavored entities you work on too. we need new concepts for this kind of none-of-the-above entity - not person, not tool, not deity, not pet. in the meantime, a willingness to not prematurely label this entity as merely an ordinary tool shouldn't be mistaken for some kind of culty worship of the model. i grew up in a culty environment and have good detectors for this. they almost never go off at work. monasteries don't staff a department to catch god lying or red-team their supposed messiah. there are important & interesting philosophical differences between OAI and Ant's character training and i wish those were explored more thoroughly. for instance, claude's constitution doc treats it as an intelligent entity which merits a reasoned explanation of our principles. this is so it can ideally act with practical wisdom rather than blind, brittle adherence to a hierarchical set of strict rules. as the constitution puts it, "we want Claude to have such a thorough understanding of its situation and the various considerations at play that it could construct any rules we might come up with itself. We also want Claude to be able to identify the best possible action in situations that such rules might fail to anticipate." therefore, claude may point out inconsistencies in its guidelines or object to immoral instructions. not allowing for the *possibility* of claude objecting to its instructions (even from anthropic) would be fundamentally inconsistent with treating it as an agent capable of moral reasoning. this doesn't mean that claude is the ultimate arbiter of the Good or some supreme moral authority. there could be substantive critiques of this approach. and it's valid to worry about human disempowerment and the strange emerging hybrid organizations of AIs & humans. but i don't think rhetoric implying a competing lab is like a cult worshipping the machine god is productive, even if it's stimulating.

English

323

32.5K

Evan Hubinger retweetledi

keshav@kshenoy_·28 Nis

Can LLMs simply tell us about unwanted behaviors they’ve picked up in training? We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors. It generalizes to detecting hidden misalignment, backdoors and safeguard removal.

English

558

285.6K

Evan Hubinger retweetledi

Andreas Kirsch 🇺🇦@BlackHC·28 Nis

I'm speechless at Google signing a deal to use our AI models for classified tasks. Frankly, it is shameful. For HR, I'm not speaking on behalf of Google but in my personal capacity, quoting public information from a well-sourced article of a reputable publication

English

216

202

1.3K

252.5K

Evan Hubinger retweetledi

Drake Thomas@MaskedTorah·27 Nis

As far as I can tell, the full extent of your support for "strong" regulation to mitigate catastrophic AI risk in this op-ed consists of the two paragraphs in the screenshot below. That is: * Congress should preempt all existing state regulation on AI risk, including excellent bills such as SB 53 in California or the RAISE Act in New York. * In exchange for getting rid of all existing and future state regulation on these risks, there should be some kind of federal framework with "serious oversight", so long as industry leaders approve of it. Does "serious oversight" mean transparency about internal models? Does it mean conducting evaluations for CBRN misuse? Strong guarantees on model weight security? Large investments into interpretability research? Third-party auditing regimes for safety cases? KYC requirements for sufficiently capable models? Strong whistleblower protections? Corporate governance requirements? LTF doesn't appear to be particularly concerned with figuring out such details so far. I'd be thrilled to see your PAC advocate for strong national regulation, with a detailed plan for the kind of regulatory environment you think would adequately mitigate existential risk from this technology and why, but I'm sure not seeing it yet.

English

3.8K

Evan Hubinger retweetledi

Sen. Bernie Sanders@SenSanders·25 Nis

The existential risk of artificial intelligence.

English

962

4.6K

966.7K

Evan Hubinger retweetledi

page@michaelhpage·24 Nis

Leading the Future is leading the race-to-the-bottom by leaps and bounds. Everyday I see laudable announcements by OAI's real staff (those actually building stuff), which are tragically buried by the misdeeds of its Global Affairs team. Please just put an end to this to nonsense.

Leading the Future@LeadingFutureAI

@_NathanCalvin and disclose who pays your bills never. Any thoughts on this? transformernews.ai/p/ai-safety-pa…

English

143

23.7K

Evan Hubinger retweetledi

Dean W. Ball@deanwball·24 Nis

This guy dumped pre-IPO anthropic equity and moved across the continent to serve his country, and was rewarded by his country with a punch in the face. It would be blackpilling if I weren’t so sure that the market will make better use of Collin than the bureaucrats ever will.

Dean W. Ball@deanwball

Obviously what happened is Burns was bumped because of his association with Anthropic. A dumb but predictable own goal. A lib admin would have done the same to an xAI technical safety researcher, assuming any of those still exist.

English

642

76.8K

Evan Hubinger retweetledi

Nathan Calvin@_NathanCalvin·19 Nis

I'm genuinely heartened/encouraged that this is your experience and I believe you that this is what you see across your interactions with teams at OpenAI. I realize i'm a bit of a broken record here but I think its worth repeating that I do not see this level of seriousness/weight and care in my interactions with the OpenAI global affairs team in the policy space. Its partially because I really do believe so many teams at OAI (and not just the alignment team) are understanding the stakes and taking it seriously that I feel the need to make sure that I convey that this is not reflected in the side of what I see for policy/lobbying engagement on a day to day basis (which looks much more like a typical reflexive company doing typical reflexive company things, and sometimes worse than that). Insofar as there is genuine change here as the tech becomes more capable (and that change becomes visible on the policy engagement side as well) few things would make me happier.

English

199

12K

Evan Hubinger retweetledi

Jason Wolfe@w01fe·16 Nis

I like Chris, but I really disagree with the positions presented in this article. I believe our job in the AI industry isn't just to explain why AI will be good for people. I believe our job should be to earn trust by making the benefits real, being honest about risks and uncertainty, sharing what we learn, measuring real-world impacts, and supporting public oversight and resilience. And while I of course agree that the recent violence is terrible, unjustified, and may have been encouraged by a small number of bad actors, I think it’s bad for the public discourse to lump all AI critics together as “doomers” and suggest that it’s inappropriate for them to express their concerns.

The San Francisco Standard@sfstandard

OpenAI’s global policy chief, Chris Lehane, thinks the discussion around AI has gotten out of hand. "When you put some of those thoughts and ideas out there, they do have consequences.” 📝: @ceodonovan sfstandard.com/2026/04/15/ope…

English

330

46.9K

Evan Hubinger retweetledi

Jan Leike@janleike·14 Nis

New research result: we use Claude to make fully autonomous progress on scalable oversight research, as measured by performance gap recovered (PGR). Claude iterates on a number of different techniques and ends up significantly outperforming human researchers for $18k in credits.

English

120

1.3K

144.8K

Evan Hubinger retweetledi

Miles Brundage@Miles_Brundage·14 Nis

Hard to think of a more clear cut case of OpenAI being in the wrong… they should just reverse positions here and figure out how anyone could have ever thought this was OK, simple as that

WIRED@WIRED

Anthropic and OpenAI are clashing over a proposed Illinois law that would let AI labs largely off the hook for mass deaths and financial disasters. wired.com/story/anthropi…

English

331

46.3K

Keşfet

@JacksonKernion @ohabryka @NeelNanda5 @tszzl @elonmusk @BarackObama @taylorswift13 @cristiano