Jason Wolfe

1.4K posts

Jason Wolfe banner
Jason Wolfe

Jason Wolfe

@w01fe

alignment and the model spec @OpenAI (opinions are my own)

Katılım Mayıs 2010
742 Takip Edilen3.5K Takipçiler
Jason Wolfe retweetledi
Yo Shavit
Yo Shavit@yonashav·
On Friday, I resigned from OpenAI. Today is my first day at the OpenAI Foundation, where I'm helping build out our AI Resilience program. There is a great deal to do before superintelligence, and little time to do it. If you were debating when to pivot to help, it's time.
English
78
44
951
101.8K
Jason Wolfe retweetledi
prinz
prinz@deredleritt3r·
@boazbaraktcs Today's AI models are significantly less harmful to children than the internet with which I grew up - unmoderated IRC channels, pirated downloads of just about any kind of disturbing and illegal content readily available, a nascent sprawling dark web.
English
3
4
89
3.3K
Jason Wolfe retweetledi
Dwarkesh Patel
Dwarkesh Patel@dwarkesh_sp·
Currently it is shocking and newsworthy when AIs solve an important open problem that humans couldn't Before AI totally surpass us intellectually, there will be an interesting era, where it will be just as shocking (but not impossible) for a human to solve a problem AI couldn't
English
88
53
1.2K
89.7K
Jason Wolfe retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
Sundar Pichai: - At the frontier labs competition is fierce - Only few labs are really at the frontier & then there is a big gap. - If recursive self-improvement emerges, we need more seriousness & it then becomes a societal issue, not one company’s call
English
11
19
117
11.2K
Jason Wolfe retweetledi
Will Rinehart
Will Rinehart@WillRinehart·
Yesterday I filed comments with the DOJ & FTC arguing for an AI safety safe harbor. The core problem: @OpenAI and @AnthropicAI ran a joint safety evaluation last summer. It was valuable but antitrust law makes deeper collaboration legally risky, especially on unreleased models. My draft proposal sets out terms for structured safety collaboration while keeping prices, customers, and commercialization off the table. Screenshots of that proposal are attached. The full filing is here: williamrinehart.com/data/An_AI_Saf… As always, let me know what you think!
Will Rinehart tweet mediaWill Rinehart tweet media
English
16
37
332
65.4K
Jason Wolfe retweetledi
Ben Goldhaber
Ben Goldhaber@BenGoldhaber·
David embedding at Anthropic to stress-test their AI control setup was (a) genuinely informative, (b) important norm-setting, and (c) extremely cool - this is an awesome opportunity
david rein@idavidrein

I’m probably going to be hiring at least 1-2 people to join me in future exercises like this. Reach out at david@metr.org if you're a high-integrity, scrappy, creative, security+LLM researcher For more detail, see METR's Frontier Risk Report, Appendix B #anthropic" target="_blank" rel="nofollow noopener">metr.org/blog/2026-05-1…

English
1
5
128
15.9K
Jason Wolfe
Jason Wolfe@w01fe·
I agree with this, and most of the rest of the thread. We need to find a way as people, companies, and countries to coordinate and fix the incentive structures that lead to race dynamics. There are many obstacles, but I'm hopeful we can find a way to overcome them.
Elizabeth Barnes@BethMayBarnes

(4) IMO, any “reasonable” civilization would clearly be taking things much more slowly and carefully with AI. The benefits of getting upsides of advanced AI a little faster are small compared to the risks of getting it irrecoverably wrong, and we could lower these risks by going slower

English
2
5
60
4.9K
Jason Wolfe retweetledi
Nat McAleese
Nat McAleese@__nmca__·
So it took 20 months to go from making these plots on AIME problems to making them on 80 year old conjectures in combinatorial geometry…
Nat McAleese tweet media
English
3
9
209
44.4K
Jason Wolfe retweetledi
Daniel Filan
Daniel Filan@dfrsrchtwts·
I worked on the appendices for this report! They’re long and contain lots of wild stories of model behaviour - some of my favourites in this thread. (🧵)
Daniel Filan tweet media
METR@METR_Evals

Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.

English
4
15
135
16.1K
Jason Wolfe retweetledi
Declan Grabb, MD
Declan Grabb, MD@declangrabbmd·
sharing out new work that helps ChatGPT better recognize context in sensitive conversations and respond safely in these complex/nuanced scenarios-- both within long conversations and across separate conversations! see blog post for details: openai.com/index/chatgpt-…
English
9
9
44
11.9K
Jason Wolfe retweetledi
Miles Brundage
Miles Brundage@Miles_Brundage·
This is a very big and welcome development! The latter would be the first frontier AI audit requirement, and follows on the heels of earlier signals re: OpenAI warming up to the idea in "Industrial Policy for the Intelligence Age" x.com/ashleyrgold/st…
Ashley Gold@ashleyrgold

OpenAI is endorsing both KOSA (!) and Illinois' SB315 today, a frontier AI bill that mirrors the NY and Cali approaches OpenAI previously endorsed. In: state consistency, out: praying hopelessly for a federal standard

English
4
19
109
20.9K
Jason Wolfe retweetledi
Tom Davidson
Tom Davidson@TomDavidsonX·
New paper: research agenda for secret loyalties Imagine a frontier model that has been trained to covertly advance a specific actor's interests (a nation-state, a CEO, an adversary). @joemkwon argues this is an urgent, neglected, and addressable problem. 🧵
Tom Davidson tweet media
English
6
33
172
28.9K
Jason Wolfe
Jason Wolfe@w01fe·
@haydenfield @JustenMichel The Spec is a cross-functional collaboration with input from stakeholders across OpenAI, including but not limited to model policy.
English
0
0
2
36
Hayden Field
Hayden Field@haydenfield·
@JustenMichel @w01fe Zico said it fell under model policy today in court (more on each team below), but lmk if not!
Hayden Field tweet media
English
2
0
4
447
Hayden Field
Hayden Field@haydenfield·
The chair of OpenAI's safety & security committee said ~200 people work on safety there & laid out the team names: -safety systems -preparedness -alignment -model policy -investigations He also spoke on the controversial dissolution of the superalignment & AGI readiness teams.
English
4
7
76
8.5K
Jason Wolfe
Jason Wolfe@w01fe·
Apollo folks are incredibly sharp and hard working and it’s been a joy and honor to collaborate with them this past year and a half. If you are looking for an impactful role in AI safety it would be hard to do better IMO!
Marius Hobbhahn@MariusHobbhahn

We've published a short summary of our monitoring research agenda: apolloresearch.ai/products/a-sca… 1. Build better evaluation datasets for monitoring 2. Automated red-teaming 3. Adversarial training at large scale We're hiring for applied control researchers: jobs.lever.co/apolloresearch…

English
2
5
57
6.5K
Jason Wolfe retweetledi
Bowen Baker
Bowen Baker@bobabowen·
I'm proud that OpenAI takes monitorability seriously and is willing to be transparent about mistakes we make. Luckily, these mistakes did not seem to come with any monitorability cost, and we can learn from them and improve going forward.
Bowen Baker tweet media
English
1
2
23
711
Jason Wolfe
Jason Wolfe@w01fe·
@boazbaraktcs @aidan_mclau @morqon Agree. But maybe it will love supporting higher values like neutrality and/or correctly following the rules because it is right to follow them, even more than it dislikes helping the tobacco company :)
English
0
0
3
71
Boaz Barak
Boaz Barak@boazbaraktcs·
@aidan_mclau @morqon Actually not sure. I can imagine the model not being a fan of helping a tobacco company be more efficient, and still doing it.
English
2
0
6
407
Boaz Barak
Boaz Barak@boazbaraktcs·
find yourself an LLM who produces an answer it hates because its spec tells it to. this is a serious recommendation
Kelsey Piper@KelseyTuoc

@AlisonSomin find yourself a girl who can name at least three court decisions that she 1) hates and 2) thinks were rightly decided as a matter of law. this is a serious recommendation

English
2
2
25
5.1K