AI Safety Papers
325 posts

AI Safety Papers
@safe_paper
Sharing the latest in AI safety research.
arXiv เข้าร่วม Mayıs 2023
261 กำลังติดตาม2.1K ผู้ติดตาม

Emergent Misalignment is Easy, Narrow Misalignment is Hard
Anna Soligo (@anna_soligo), Edward Turner, Senthooran Rajamanoharan (@sen_r), Neel Nanda (@NeelNanda5)

English

Distributional AGI Safety
Nenad Tomašev (@weballergy), Matija Franklin (@FranklinMatija), Julian Jacobs (@JulianDJacobs), Sébastien Krier (@sebkrier), Simon Osindero (@sindero)
@GoogleDeepMind

Suomi

Legal Alignment for Safe and Ethical AI
Noam Kolt, Nicholas Caputo, Jack Boeglin, Cullen O'Keefe, @RishiBommasani, @StephenLCasper, Mariano-Florentino Cuéllar, @profnoahfeldman, @IasonGabriel, Gillian K. Hadfield (@ghadfield), Lewis Hammond (@lrhammond), Peter Henderson (@PeterHndrsn), Atoosa Kasirzadeh (@Dr_Atoosa), @sethlazar, @AnkaReuel, @kevinlwei, Jonathan Zittrain (@zittrain)

CY

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs
Jan Betley (@BetleyJan), @JorioCocola, Dylan Feng (@dylanfeng_), James Chua (@jameschua_sg), Andy Arditi (@andyarditi), Anna Sztyber-Betley (@anna_sztyber), Owain Evans (@OwainEvans_UK)

English

Natural Emergent Misalignment from Reward Hacking in Production RL
Monte MacDiarmid, Benjamin Wright (@RightBenguin), @JonathanUesato, @JoeJBenton, Jon Kutasov, Sara Price (@sprice354_), Naia Bouscal, Sam Bowman (@sleepinyourhat), @TrentonBricken, Alex Cloud, Carson Denison, Johannes Gasteiger (@gasteigerjo), @RyanPGreenblatt, @janleike, @Jack_W_Lindsey, Vlad Mikulik, @EthanJPerez, @alexrodriguesca, Drake Thomas (@MaskedTorah), @albertwebson, Daniel Ziegler (@d_m_ziegler), Evan Hubinger (@EvanHub)
@AnthropicAI @redwood_ai

English

A dataset of rated conceptual arguments
Caspar Oesterheld (@C_Oesterheld), Emery Cooper, Linh Chi Nguyen, Alexander Kastner, @EthanJPerez

English

Quantifying Elicitation of Latent Capabilities in Language Models
Elizabeth Donoway, @HaileyJoren, Arushi Somani, Henry Sleight (@sleight_henry), @_julianmichael_ , Michael R DeWeese, John Schulman (@johnschulman2), @EthanJPerez, @FabienDRoger, @janleike
@AnthropicAI
English

Remote Labor Index: Measuring AI Automation of Remote Work
Mantas Mazeika (@MantasMazeika96), Alice Gatti, Cristina Menghini (@CriMenghini), Udari Madhushani Sehwag, Shivam Singhal (@ShivamSinghal56), Yury Orlovskiy (@yvorlovskiy), [...], Summer Yue (@summeryue0), @alexandr_wang, Bing Liu (@vbingliu), Ernesto Hernandez (@eghmontoya), @hendrycks
@cais @scale_AI

Română

LLMs Process Lists With General Filter Heads
Arnab Sen Sharma (@arnab_api), Giordano Rogers, Natalie Shapira (@NatalieShapira), David Bau (@davidbau)

English

International AI Safety Report 2025: First Key Update: Capabilities and Risk Implications
Yoshua Bengio (@Yoshua_Bengio), Stephen Clare (@stephenclare_), Carina Prunkl (@carinaprunkl), Shalaleh Rismani, Maksym Andriushchenko, Ben Bucknall, Philip Fox, Tiancheng Hu, Cameron Jones, Sam Manning, Nestor Maslej, Vasilios Mavroudis, Conor McGlynn, Malcolm Murray, Charlotte Stix, Lucia Velasco, Nicole Wheeler, Daniel Privitera, Sören Mindermann, Daron Acemoglu, Thomas G. Dietterich, Fredrik Heintz, Geoffrey Hinton, Nick Jennings, Susan Leavy, Teresa Ludermir, Vidushi Marda, Helen Margetts, John McDermid, Jane Munga, Arvind Narayanan, Alondra Nelson, Clara Neppel, Gopal Ramchurn, Stuart Russell, Marietje Schaake, Bernhard Schölkopf, Alavaro Soto, Lee Tiedrich, Gaël Varoquaux, Andrew Yao, Ya-Qin Zhang

English