Micah Carroll

834 posts

Micah Carroll

Micah Carroll

@MicahCarroll

Safety research @openai. Prev @berkeley_ai /w @ancadianadragan & Stuart Russell. CoT oversight / AI manipulation.

🇮🇹🇬🇧 → Berkeley, CA Katılım Ağustos 2011
801 Takip Edilen2.8K Takipçiler
Micah Carroll retweetledi
Jason Wolfe
Jason Wolfe@w01fe·
Apollo folks are incredibly sharp and hard working and it’s been a joy and honor to collaborate with them this past year and a half. If you are looking for an impactful role in AI safety it would be hard to do better IMO!
Marius Hobbhahn@MariusHobbhahn

We've published a short summary of our monitoring research agenda: apolloresearch.ai/products/a-sca… 1. Build better evaluation datasets for monitoring 2. Automated red-teaming 3. Adversarial training at large scale We're hiring for applied control researchers: jobs.lever.co/apolloresearch…

English
2
5
57
6.5K
Micah Carroll retweetledi
OpenAI
OpenAI@OpenAI·
Chain of thought monitors are a key layer of defense against AI agent misalignment. To preserve monitorability, we avoid penalizing misaligned reasoning during RL. We found a limited amount of accidental CoT grading which affected released models, and are sharing our analysis. alignment.openai.com/accidental-cot…
English
337
300
3K
483.6K
Micah Carroll retweetledi
Seán Ó hÉigeartaigh
Seán Ó hÉigeartaigh@S_OhEigeartaigh·
Bad that it happened, but really glad they're noting this kind of mistake publicly - we need enough companies doing it systematically that we can assess , understand and improve industry practices and performance, as well as get a realistic sense of how low the mistake rate can get under high competitive pressures. Reminder that we should reward the behaviours we want to see more of and punish the behaviours we want less of. If we make it too costly for companies to admit mistakes, we get less information, not fewer mistakes.
Tomek Korbak@tomekkorbak

It’s obviously bad this happened and “no CoT grading” is still an important norm to uphold. Yet I’m glad we are fostering a culture of admitting mistakes and sharing post-mortem with the whole AI safety community.

English
0
5
25
2K
William Nurmi
William Nurmi@NurmiWilliam·
@MicahCarroll Wait, you reran training for a (largely?) non-thinking model to confirm that chain of thought leak does not affect monitorability?
William Nurmi tweet media
English
1
0
1
424
Micah Carroll
Micah Carroll@MicahCarroll·
We recently found some instances of CoT grading during the training of previously deployed models after building a system that scans all OpenAI RL runs for accidental CoT grading. We did not find clear evidence that these instances degraded CoT monitorability.
Micah Carroll tweet media
English
30
48
474
217.7K
Micah Carroll retweetledi
Buck Shlegeris
Buck Shlegeris@bshlgrs·
We reviewed OpenAI's blog post “Investigating the consequences of accidentally grading CoT during RL". blog.redwoodresearch.org
English
1
4
59
8.2K
Micah Carroll retweetledi
Lawrence Chan
Lawrence Chan@justanotherlaw·
Glad more AI safety work is getting reviewed by independent parties. Most lab posts never go through peer review, and when they do it's with a 4+ month lag: an eternity in AI terms. Public reviews like Buck'scan help surface methodological gaps and keep researchers honest.
Buck Shlegeris@bshlgrs

We reviewed OpenAI's blog post “Investigating the consequences of accidentally grading CoT during RL". blog.redwoodresearch.org

English
1
5
74
5.6K
Micah Carroll retweetledi
Tomek Korbak
Tomek Korbak@tomekkorbak·
Bad news: We at OpenAI have recently found we were accidentally putting optimization pressure on chain of thoughts Good news: It didn’t affect monitorability and it seems that degrading monitorability via CoT grading is harder than we thought
Tomek Korbak tweet media
Micah Carroll@MicahCarroll

We recently found some instances of CoT grading during the training of previously deployed models after building a system that scans all OpenAI RL runs for accidental CoT grading. We did not find clear evidence that these instances degraded CoT monitorability.

English
29
61
785
131.3K