Micah Carroll

834 posts

Micah Carroll

@MicahCarroll

Safety research @openai. Prev @berkeley_ai /w @ancadianadragan & Stuart Russell. CoT oversight / AI manipulation.

🇮🇹🇬🇧 → Berkeley, CA Katılım Ağustos 2011

801 Takip Edilen2.8K Takipçiler

Micah Carroll retweetledi

Jason Wolfe@w01fe·11 May

Apollo folks are incredibly sharp and hard working and it’s been a joy and honor to collaborate with them this past year and a half. If you are looking for an impactful role in AI safety it would be hard to do better IMO!

Marius Hobbhahn@MariusHobbhahn

We've published a short summary of our monitoring research agenda: apolloresearch.ai/products/a-sca… 1. Build better evaluation datasets for monitoring 2. Automated red-teaming 3. Adversarial training at large scale We're hiring for applied control researchers: jobs.lever.co/apolloresearch…

English

6.5K

Micah Carroll retweetledi

OpenAI@OpenAI·8 May

Chain of thought monitors are a key layer of defense against AI agent misalignment. To preserve monitorability, we avoid penalizing misaligned reasoning during RL. We found a limited amount of accidental CoT grading which affected released models, and are sharing our analysis. alignment.openai.com/accidental-cot…

English

337

300

483.6K

Micah Carroll retweetledi

Seán Ó hÉigeartaigh@S_OhEigeartaigh·8 May

Bad that it happened, but really glad they're noting this kind of mistake publicly - we need enough companies doing it systematically that we can assess , understand and improve industry practices and performance, as well as get a realistic sense of how low the mistake rate can get under high competitive pressures. Reminder that we should reward the behaviours we want to see more of and punish the behaviours we want less of. If we make it too costly for companies to admit mistakes, we get less information, not fewer mistakes.

Tomek Korbak@tomekkorbak

It’s obviously bad this happened and “no CoT grading” is still an important norm to uphold. Yet I’m glad we are fostering a culture of admitting mistakes and sharing post-mortem with the whole AI safety community.

English

Micah Carroll@MicahCarroll·8 May

@NurmiWilliam Yes

229

William Nurmi@NurmiWilliam·8 May

@MicahCarroll Wait, you reran training for a (largely?) non-thinking model to confirm that chain of thought leak does not affect monitorability?

English

424

Micah Carroll@MicahCarroll·7 May

We recently found some instances of CoT grading during the training of previously deployed models after building a system that scans all OpenAI RL runs for accidental CoT grading. We did not find clear evidence that these instances degraded CoT monitorability.

English

474

217.7K

Micah Carroll retweetledi

Buck Shlegeris@bshlgrs·7 May

We reviewed OpenAI's blog post “Investigating the consequences of accidentally grading CoT during RL". blog.redwoodresearch.org

English

8.2K

Micah Carroll retweetledi

Lawrence Chan@justanotherlaw·8 May

Glad more AI safety work is getting reviewed by independent parties. Most lab posts never go through peer review, and when they do it's with a 4+ month lag: an eternity in AI terms. Public reviews like Buck'scan help surface methodological gaps and keep researchers honest.

Buck Shlegeris@bshlgrs

We reviewed OpenAI's blog post “Investigating the consequences of accidentally grading CoT during RL". blog.redwoodresearch.org

English

5.6K

Micah Carroll@MicahCarroll·7 May

@redwood_ai @apolloaievals @METR_Evals @bshlgrs @RyanPGreenblatt @BronsonSchoen @MariusHobbhahn @neev_parikh And @redwood_ai's blogpost by @bshlgrs on this here: blog.redwoodresearch.org/p/openai-cot

English

194

Micah Carroll@MicahCarroll·7 May

@redwood_ai @apolloaievals @METR_Evals @bshlgrs @RyanPGreenblatt @BronsonSchoen @MariusHobbhahn @neev_parikh Find the full blogpost here: alignment.openai.com/accidental-cot…

English

225

Micah Carroll retweetledi

Tomek Korbak@tomekkorbak·7 May

Bad news: We at OpenAI have recently found we were accidentally putting optimization pressure on chain of thoughts Good news: It didn’t affect monitorability and it seems that degrading monitorability via CoT grading is harder than we thought

Micah Carroll@MicahCarroll

English

785

131.3K

Keşfet

@NurmiWilliam @redwood_ai @apolloaievals @METR_Evals @bshlgrs @RyanPGreenblatt @BronsonSchoen @MariusHobbhahn