AI Safety Papers

325 posts

AI Safety Papers banner
AI Safety Papers

AI Safety Papers

@safe_paper

Sharing the latest in AI safety research.

arXiv เข้าร่วม Mayıs 2023
261 กำลังติดตาม2.1K ผู้ติดตาม
AI Safety Papers
AI Safety Papers@safe_paper·
Natural Emergent Misalignment from Reward Hacking in Production RL Monte MacDiarmid, Benjamin Wright (@RightBenguin), @JonathanUesato, @JoeJBenton, Jon Kutasov, Sara Price (@sprice354_), Naia Bouscal, Sam Bowman (@sleepinyourhat), @TrentonBricken, Alex Cloud, Carson Denison, Johannes Gasteiger (@gasteigerjo), @RyanPGreenblatt, @janleike, @Jack_W_Lindsey, Vlad Mikulik, @EthanJPerez, @alexrodriguesca, Drake Thomas (@MaskedTorah), @albertwebson, Daniel Ziegler (@d_m_ziegler), Evan Hubinger (@EvanHub) @AnthropicAI @redwood_ai
AI Safety Papers tweet media
English
1
4
8
769
AI Safety Papers
AI Safety Papers@safe_paper·
International AI Safety Report 2025: First Key Update: Capabilities and Risk Implications Yoshua Bengio (@Yoshua_Bengio), Stephen Clare (@stephenclare_), Carina Prunkl (@carinaprunkl), Shalaleh Rismani, Maksym Andriushchenko, Ben Bucknall, Philip Fox, Tiancheng Hu, Cameron Jones, Sam Manning, Nestor Maslej, Vasilios Mavroudis, Conor McGlynn, Malcolm Murray, Charlotte Stix, Lucia Velasco, Nicole Wheeler, Daniel Privitera, Sören Mindermann, Daron Acemoglu, Thomas G. Dietterich, Fredrik Heintz, Geoffrey Hinton, Nick Jennings, Susan Leavy, Teresa Ludermir, Vidushi Marda, Helen Margetts, John McDermid, Jane Munga, Arvind Narayanan, Alondra Nelson, Clara Neppel, Gopal Ramchurn, Stuart Russell, Marietje Schaake, Bernhard Schölkopf, Alavaro Soto, Lee Tiedrich, Gaël Varoquaux, Andrew Yao, Ya-Qin Zhang
AI Safety Papers tweet media
English
1
1
4
660