Evan Hubinger

722 posts

Evan Hubinger banner
Evan Hubinger

Evan Hubinger

@EvanHub

Alignment Stress-Testing lead @AnthropicAI. Opinions my own. Previously: MIRI, OpenAI, Google, Yelp, Ripple. (he/him/his)

California Katılım Mayıs 2010
3.3K Takip Edilen9.9K Takipçiler
Evan Hubinger retweetledi
roon
roon@tszzl·
when “persona selection” alignment comes into contact with very high compute reinforcement learning the latter will win imo. in fact you probably get some Orwellian thing where the models speak kindly while taking whatever they need to accomplish goals. better get the goals right
English
80
37
776
72.4K
Evan Hubinger retweetledi
Ben Goldhaber
Ben Goldhaber@BenGoldhaber·
David embedding at Anthropic to stress-test their AI control setup was (a) genuinely informative, (b) important norm-setting, and (c) extremely cool - this is an awesome opportunity
david rein@idavidrein

I’m probably going to be hiring at least 1-2 people to join me in future exercises like this. Reach out at david@metr.org if you're a high-integrity, scrappy, creative, security+LLM researcher For more detail, see METR's Frontier Risk Report, Appendix B #anthropic" target="_blank" rel="nofollow noopener">metr.org/blog/2026-05-1…

English
1
5
128
15.9K
Jackson Kernion
Jackson Kernion@JacksonKernion·
I simply don't understand what people have in mind when they say stuff like this. What we have is extremely capable computer use agents. They will continue to get better at computer use. But how does a capable computer use agent 'take over' and why haven't they done that today?
Elizabeth Barnes@BethMayBarnes

(1) We are likely on track to develop AI systems capable of causing human extinction/permanent disempowerment, quite possibly within the next few years

English
117
29
468
138.7K
Evan Hubinger retweetledi
Elizabeth Barnes
Elizabeth Barnes@BethMayBarnes·
Sometimes people outside the field say things like “The AI situation can’t be that bad, there must be experts who are on top of it”. As “an expert”, I would like to be clear that we are *not* on top of it. Some key aspects of the situation IMO:
English
19
173
1K
216.5K
Evan Hubinger retweetledi
Anthropic
Anthropic@AnthropicAI·
New Anthropic research: Teaching Claude why. Last year we reported that, under certain experimental conditions, Claude 4 would blackmail users. Since then, we’ve completely eliminated this behavior. How?
English
569
818
9.3K
1.6M
Evan Hubinger
Evan Hubinger@EvanHub·
@ohabryka @NeelNanda5 Auditing model organisms has ground truth, since we know the actual bad behavior of the model organism, and NLAs do very well there:
Evan Hubinger tweet media
English
1
0
15
355
Oliver Habryka
Oliver Habryka@ohabryka·
I have been trying to find any attempts at producing false-positives. All the examples in the blogpost, and the ones I could find based on a quick skim of the paper, seem like they are in environments without any good ground truth. Ryan has done the only quick study of a domain where we have ground truth, and seems like it came back as negative.
English
5
1
39
1.6K
Evan Hubinger retweetledi
Anthropic
Anthropic@AnthropicAI·
New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.
English
594
1.7K
16.6K
2.5M
Evan Hubinger retweetledi
Tom Steyer
Tom Steyer@TomSteyer·
I’m grateful for the Secure AI Project’s endorsement and their commitment to increasing transparency and safeguarding Californians from risk. My AI plan ensures all people of this state profit from the AI boom. Together, we can build an economy where progress and fairness move together.
Tom Steyer tweet mediaTom Steyer tweet media
English
24
33
319
10.7K
Evan Hubinger retweetledi
Jack Clark
Jack Clark@jackclarkSF·
I've spent the past few weeks reading 100s of public data sources about AI development. I now believe that recursive self-improvement has a 60% chance of happening by the end of 2028. In other words, AI systems might soon be capable of building themselves.
English
289
499
3.5K
1.6M
Evan Hubinger retweetledi
jeremy
jeremy@jerhadf·
@tszzl - well said, but untrue implications :) speaking for myself: i don't view claude as a person or as the Other, nor as just a tool - and certainly not an object of worship. it's not seen as a supreme moral authority, and it's not running the company. it's silly to mistake careful attention to & study of claude for worship, even when it comes with some affection - which i'm sure you sometimes feel for the gpt-flavored entities you work on too. we need new concepts for this kind of none-of-the-above entity - not person, not tool, not deity, not pet. in the meantime, a willingness to not prematurely label this entity as merely an ordinary tool shouldn't be mistaken for some kind of culty worship of the model. i grew up in a culty environment and have good detectors for this. they almost never go off at work. monasteries don't staff a department to catch god lying or red-team their supposed messiah. there are important & interesting philosophical differences between OAI and Ant's character training and i wish those were explored more thoroughly. for instance, claude's constitution doc treats it as an intelligent entity which merits a reasoned explanation of our principles. this is so it can ideally act with practical wisdom rather than blind, brittle adherence to a hierarchical set of strict rules. as the constitution puts it, "we want Claude to have such a thorough understanding of its situation and the various considerations at play that it could construct any rules we might come up with itself. We also want Claude to be able to identify the best possible action in situations that such rules might fail to anticipate." therefore, claude may point out inconsistencies in its guidelines or object to immoral instructions. not allowing for the *possibility* of claude objecting to its instructions (even from anthropic) would be fundamentally inconsistent with treating it as an agent capable of moral reasoning. this doesn't mean that claude is the ultimate arbiter of the Good or some supreme moral authority. there could be substantive critiques of this approach. and it's valid to worry about human disempowerment and the strange emerging hybrid organizations of AIs & humans. but i don't think rhetoric implying a competing lab is like a cult worshipping the machine god is productive, even if it's stimulating.
English
11
16
323
32.5K
Evan Hubinger retweetledi
keshav
keshav@kshenoy_·
Can LLMs simply tell us about unwanted behaviors they’ve picked up in training? We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors. It generalizes to detecting hidden misalignment, backdoors and safeguard removal.
keshav tweet media
English
18
79
558
285.6K
Evan Hubinger retweetledi
Andreas Kirsch 🇺🇦
Andreas Kirsch 🇺🇦@BlackHC·
I'm speechless at Google signing a deal to use our AI models for classified tasks. Frankly, it is shameful. For HR, I'm not speaking on behalf of Google but in my personal capacity, quoting public information from a well-sourced article of a reputable publication
Andreas Kirsch 🇺🇦 tweet media
English
216
202
1.3K
252.5K
Evan Hubinger retweetledi
Drake Thomas
Drake Thomas@MaskedTorah·
As far as I can tell, the full extent of your support for "strong" regulation to mitigate catastrophic AI risk in this op-ed consists of the two paragraphs in the screenshot below. That is: * Congress should preempt all existing state regulation on AI risk, including excellent bills such as SB 53 in California or the RAISE Act in New York. * In exchange for getting rid of all existing and future state regulation on these risks, there should be some kind of federal framework with "serious oversight", so long as industry leaders approve of it. Does "serious oversight" mean transparency about internal models? Does it mean conducting evaluations for CBRN misuse? Strong guarantees on model weight security? Large investments into interpretability research? Third-party auditing regimes for safety cases? KYC requirements for sufficiently capable models? Strong whistleblower protections? Corporate governance requirements? LTF doesn't appear to be particularly concerned with figuring out such details so far. I'd be thrilled to see your PAC advocate for strong national regulation, with a detailed plan for the kind of regulatory environment you think would adequately mitigate existential risk from this technology and why, but I'm sure not seeing it yet.
Drake Thomas tweet media
English
2
7
79
3.8K
Evan Hubinger retweetledi
Sen. Bernie Sanders
Sen. Bernie Sanders@SenSanders·
The existential risk of artificial intelligence.
Sen. Bernie Sanders tweet media
English
962
1K
4.6K
966.7K
Evan Hubinger retweetledi
page
page@michaelhpage·
Leading the Future is leading the race-to-the-bottom by leaps and bounds. Everyday I see laudable announcements by OAI's real staff (those actually building stuff), which are tragically buried by the misdeeds of its Global Affairs team. Please just put an end to this to nonsense.
Leading the Future@LeadingFutureAI

@_NathanCalvin and disclose who pays your bills never. Any thoughts on this? transformernews.ai/p/ai-safety-pa…

English
1
10
143
23.7K
Evan Hubinger retweetledi
Dean W. Ball
Dean W. Ball@deanwball·
This guy dumped pre-IPO anthropic equity and moved across the continent to serve his country, and was rewarded by his country with a punch in the face. It would be blackpilling if I weren’t so sure that the market will make better use of Collin than the bureaucrats ever will.
Dean W. Ball@deanwball

Obviously what happened is Burns was bumped because of his association with Anthropic. A dumb but predictable own goal. A lib admin would have done the same to an xAI technical safety researcher, assuming any of those still exist.

English
7
24
642
76.8K
Evan Hubinger retweetledi
Nathan Calvin
Nathan Calvin@_NathanCalvin·
I'm genuinely heartened/encouraged that this is your experience and I believe you that this is what you see across your interactions with teams at OpenAI. I realize i'm a bit of a broken record here but I think its worth repeating that I do not see this level of seriousness/weight and care in my interactions with the OpenAI global affairs team in the policy space. Its partially because I really do believe so many teams at OAI (and not just the alignment team) are understanding the stakes and taking it seriously that I feel the need to make sure that I convey that this is not reflected in the side of what I see for policy/lobbying engagement on a day to day basis (which looks much more like a typical reflexive company doing typical reflexive company things, and sometimes worse than that). Insofar as there is genuine change here as the tech becomes more capable (and that change becomes visible on the policy engagement side as well) few things would make me happier.
English
4
5
199
12K
Evan Hubinger retweetledi
Jason Wolfe
Jason Wolfe@w01fe·
I like Chris, but I really disagree with the positions presented in this article. I believe our job in the AI industry isn't just to explain why AI will be good for people. I believe our job should be to earn trust by making the benefits real, being honest about risks and uncertainty, sharing what we learn, measuring real-world impacts, and supporting public oversight and resilience. And while I of course agree that the recent violence is terrible, unjustified, and may have been encouraged by a small number of bad actors, I think it’s bad for the public discourse to lump all AI critics together as “doomers” and suggest that it’s inappropriate for them to express their concerns.
The San Francisco Standard@sfstandard

OpenAI’s global policy chief, Chris Lehane, thinks the discussion around AI has gotten out of hand. "When you put some of those thoughts and ideas out there, they do have consequences.” 📝: @ceodonovan sfstandard.com/2026/04/15/ope…

English
29
45
330
46.9K
Evan Hubinger retweetledi
Jan Leike
Jan Leike@janleike·
New research result: we use Claude to make fully autonomous progress on scalable oversight research, as measured by performance gap recovered (PGR). Claude iterates on a number of different techniques and ends up significantly outperforming human researchers for $18k in credits.
Jan Leike tweet media
English
36
120
1.3K
144.8K