Jason Wolfe

1.4K posts

Jason Wolfe

@w01fe

alignment and the model spec @OpenAI (opinions are my own)

Katılım Mayıs 2010

742 Takip Edilen3.5K Takipçiler

Jason Wolfe retweetledi

Yo Shavit@yonashav·6h

On Friday, I resigned from OpenAI. Today is my first day at the OpenAI Foundation, where I'm helping build out our AI Resilience program. There is a great deal to do before superintelligence, and little time to do it. If you were debating when to pivot to help, it's time.

English

951

101.8K

Jason Wolfe retweetledi

prinz@deredleritt3r·1d

@boazbaraktcs Today's AI models are significantly less harmful to children than the internet with which I grew up - unmoderated IRC channels, pirated downloads of just about any kind of disturbing and illegal content readily available, a nascent sprawling dark web.

English

3.3K

Jason Wolfe retweetledi

Dwarkesh Patel@dwarkesh_sp·3d

Currently it is shocking and newsworthy when AIs solve an important open problem that humans couldn't Before AI totally surpass us intellectually, there will be an interesting era, where it will be just as shocking (but not impossible) for a human to solve a problem AI couldn't

English

1.2K

89.7K

Jason Wolfe retweetledi

Rohan Paul@rohanpaul_ai·4d

Sundar Pichai: - At the frontier labs competition is fierce - Only few labs are really at the frontier & then there is a big gap. - If recursive self-improvement emerges, we need more seriousness & it then becomes a societal issue, not one company’s call

English

117

11.2K

Jason Wolfe retweetledi

Will Rinehart@WillRinehart·4d

Yesterday I filed comments with the DOJ & FTC arguing for an AI safety safe harbor. The core problem: @OpenAI and @AnthropicAI ran a joint safety evaluation last summer. It was valuable but antitrust law makes deeper collaboration legally risky, especially on unreleased models. My draft proposal sets out terms for structured safety collaboration while keeping prices, customers, and commercialization off the table. Screenshots of that proposal are attached. The full filing is here: williamrinehart.com/data/An_AI_Saf… As always, let me know what you think!

English

332

65.4K

Jason Wolfe retweetledi

Ben Goldhaber@BenGoldhaber·3d

David embedding at Anthropic to stress-test their AI control setup was (a) genuinely informative, (b) important norm-setting, and (c) extremely cool - this is an awesome opportunity

david rein@idavidrein

I’m probably going to be hiring at least 1-2 people to join me in future exercises like this. Reach out at david @metr.org if you're a high-integrity, scrappy, creative, security+LLM researcher For more detail, see METR's Frontier Risk Report, Appendix B #anthropic" target="_blank" rel="nofollow noopener">metr.org/blog/2026-05-1…

English

128

15.9K

Jason Wolfe@w01fe·3d

I agree with this, and most of the rest of the thread. We need to find a way as people, companies, and countries to coordinate and fix the incentive structures that lead to race dynamics. There are many obstacles, but I'm hopeful we can find a way to overcome them.

Elizabeth Barnes@BethMayBarnes

(4) IMO, any “reasonable” civilization would clearly be taking things much more slowly and carefully with AI. The benefits of getting upsides of advanced AI a little faster are small compared to the risks of getting it irrecoverably wrong, and we could lower these risks by going slower

English

4.9K

Jason Wolfe retweetledi

Nat McAleese@__nmca__·6d

So it took 20 months to go from making these plots on AIME problems to making them on 80 year old conjectures in combinatorial geometry…

English

209

44.4K

Jason Wolfe retweetledi

Sebastien Bubeck@SebastienBubeck·6d

x.com/i/article/2057…

ZXX

234

1.8K

538.6K

Jason Wolfe retweetledi

Charles Foster@CFGeek·6d

Excited to have this out! I think our report is interesting from a procedural/policy standpoint in addition to the substance...

METR@METR_Evals

Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.

English

5.8K

Jason Wolfe retweetledi

Daniel Filan@dfrsrchtwts·19 May

I worked on the appendices for this report! They’re long and contain lots of wild stories of model behaviour - some of my favourites in this thread. (🧵)

METR@METR_Evals

English

135

16.1K

Jason Wolfe retweetledi

Declan Grabb, MD@declangrabbmd·15 May

sharing out new work that helps ChatGPT better recognize context in sensitive conversations and respond safely in these complex/nuanced scenarios-- both within long conversations and across separate conversations! see blog post for details: openai.com/index/chatgpt-…

English

11.9K

Jason Wolfe@w01fe·14 May

I think some kind of international coordination will be crucial for AI going well, and I'm really happy that OpenAI is publicly supporting this agenda!

Claims Journal@cjournal

OpenAI would support the creation of a global governance body for artificial intelligence led by the U.S. and including China as a member, a top company executive said, hours before the start of President... claimsjournal.com/news/national/…

English

3.3K

Jason Wolfe retweetledi

Miles Brundage@Miles_Brundage·13 May

This is a very big and welcome development! The latter would be the first frontier AI audit requirement, and follows on the heels of earlier signals re: OpenAI warming up to the idea in "Industrial Policy for the Intelligence Age" x.com/ashleyrgold/st…

Ashley Gold@ashleyrgold

OpenAI is endorsing both KOSA (!) and Illinois' SB315 today, a frontier AI bill that mirrors the NY and Cali approaches OpenAI previously endorsed. In: state consistency, out: praying hopelessly for a federal standard

English

109

20.9K

Jason Wolfe retweetledi

Tom Davidson@TomDavidsonX·13 May

New paper: research agenda for secret loyalties Imagine a frontier model that has been trained to covertly advance a specific actor's interests (a nation-state, a CEO, an adversary). @joemkwon argues this is an urgent, neglected, and addressable problem. 🧵

English

172

28.9K

Jason Wolfe@w01fe·13 May

@haydenfield @JustenMichel The Spec is a cross-functional collaboration with input from stakeholders across OpenAI, including but not limited to model policy.

English

Hayden Field@haydenfield·13 May

@JustenMichel @w01fe Zico said it fell under model policy today in court (more on each team below), but lmk if not!

English

447

Hayden Field@haydenfield·12 May

The chair of OpenAI's safety & security committee said ~200 people work on safety there & laid out the team names: -safety systems -preparedness -alignment -model policy -investigations He also spoke on the controversial dissolution of the superalignment & AGI readiness teams.

English

8.5K

Jason Wolfe@w01fe·11 May

Apollo folks are incredibly sharp and hard working and it’s been a joy and honor to collaborate with them this past year and a half. If you are looking for an impactful role in AI safety it would be hard to do better IMO!

Marius Hobbhahn@MariusHobbhahn

We've published a short summary of our monitoring research agenda: apolloresearch.ai/products/a-sca… 1. Build better evaluation datasets for monitoring 2. Automated red-teaming 3. Adversarial training at large scale We're hiring for applied control researchers: jobs.lever.co/apolloresearch…

English

6.5K

Jason Wolfe@w01fe·9 May

Really interesting and important work!

Anthropic@AnthropicAI

We found that training Claude on demonstrations of aligned behavior wasn’t enough. Our best interventions involved teaching Claude to deeply understand why misaligned behavior is wrong. Read more: anthropic.com/research/teach…

English

1.3K

Jason Wolfe retweetledi

Bowen Baker@bobabowen·7 May

I'm proud that OpenAI takes monitorability seriously and is willing to be transparent about mistakes we make. Luckily, these mistakes did not seem to come with any monitorability cost, and we can learn from them and improve going forward.

English

711

Jason Wolfe@w01fe·7 May

@boazbaraktcs @aidan_mclau @morqon Agree. But maybe it will love supporting higher values like neutrality and/or correctly following the rules because it is right to follow them, even more than it dislikes helping the tobacco company :)

English

Boaz Barak@boazbaraktcs·6 May

@aidan_mclau @morqon Actually not sure. I can imagine the model not being a fan of helping a tobacco company be more efficient, and still doing it.

English

407

Boaz Barak@boazbaraktcs·6 May

find yourself an LLM who produces an answer it hates because its spec tells it to. this is a serious recommendation

Kelsey Piper@KelseyTuoc

@AlisonSomin find yourself a girl who can name at least three court decisions that she 1) hates and 2) thinks were rightly decided as a matter of law. this is a serious recommendation

English

5.1K

Keşfet

@boazbaraktcs @OpenAI @AnthropicAI @joemkwon @haydenfield @JustenMichel @elonmusk @BarackObama