Center for Human-Compatible AI

217 posts

Center for Human-Compatible AI banner
Center for Human-Compatible AI

Center for Human-Compatible AI

@CHAI_Berkeley

CHAI is a multi-institute research organization based out of UC Berkeley that focuses on foundational research for AI technical safety.

Berkeley, CA Katılım Kasım 2018
109 Takip Edilen4.1K Takipçiler
Sabitlenmiş Tweet
Center for Human-Compatible AI
Center for Human-Compatible AI@CHAI_Berkeley·
📣 Open Call for Posters! Submit your work to the poster session at the CHAI 2026 Workshop. Link below! ⏱️ Deadline: March 26, 2026 at 11:59p.m. PST. 🗓 June 4–7, 2026 at the Asilomar Conference Grounds in Pacific Grove, CA.
English
1
9
19
1.9K
Center for Human-Compatible AI
Center for Human-Compatible AI@CHAI_Berkeley·
We're interested in both emerging questions and in less recent research, if relevant.
English
1
0
4
551
Center for Human-Compatible AI
Center for Human-Compatible AI@CHAI_Berkeley·
📣 Open Call for Posters! Submit your work to the poster session at the CHAI 2026 Workshop. Link below! ⏱️ Deadline: March 26, 2026 at 11:59p.m. PST. 🗓 June 4–7, 2026 at the Asilomar Conference Grounds in Pacific Grove, CA.
English
1
9
19
1.9K
Center for Human-Compatible AI retweetledi
Tianyi Alex Qiu
Tianyi Alex Qiu@Tianyi_Alex_Qiu·
How to elicit truth from models that may be mistaken❌ or deceptive😈? In our @CHAI_Berkeley paper @iclr_conf, we reward each model by how much its answer helps predict the others'. With weak supervision from a 0.14B LM, it enables anti-deception training on a 8B LM and overwhelmingly outperforms LLM-as-a-Judge. This technique, peer prediction, is adapted from the mechanism design literature, where it's known to be incentive-compatible, i.e., incentivizes honesty. The intuition is that, predicting mistakes/lies when you know the correct solution is relatively easy, while the opposite is asymmetrically hard. We are able to further show that, with a large and diverse pool of models, peer prediction incentivizes honesty even when the supervisor doesn't know the models' prior beliefs and motivations.
Tianyi Alex Qiu tweet mediaTianyi Alex Qiu tweet mediaTianyi Alex Qiu tweet media
English
1
2
18
697
Center for Human-Compatible AI retweetledi
Alex Serrano
Alex Serrano@sertealex·
What if an AI could learn to hide its thoughts? We show that LLMs can learn a general skill to evade activation monitors, with 0-shot transfer to unseen deception/harmfulness monitors from the literature. We call these "Neural Chameleons". A thread on our new paper. 🦎🧵
English
13
46
232
44.2K
Center for Human-Compatible AI retweetledi
Niklas Lauffer
Niklas Lauffer@NiklasLauffer·
Our NeurIPS 2025 paper extends adversarial learning (adversarial examples, self-play, etc.) beyond zero-sum games by solving "self-sabotage". 🧵👇
English
2
21
116
20K
Center for Human-Compatible AI
Center for Human-Compatible AI@CHAI_Berkeley·
Today, @securite_ia, CHAI, and @thefuturesoc are joined by 70+ leading orgs & 200+ signatories in a global call for AI Red Lines. Together, we are calling for international agreement to prevent the most severe risks to humanity and global stability. #AIRedLines Learn more:
Center for Human-Compatible AI tweet media
English
2
1
11
555
Center for Human-Compatible AI retweetledi
The Future Society
The Future Society@thefuturesoc·
The Global Call for AI Red Lines is live!! More than 200+ former heads of state, Nobel laureates, and other respected thinkers and leaders, and 70+ organizations are together calling for “do not cross” limits re: AI’s most severe #risks
The Future Society tweet media
English
3
12
35
2K
Center for Human-Compatible AI
Center for Human-Compatible AI@CHAI_Berkeley·
We’re hiring a research assistant for the book that @Michael05156007 is writing on extinction risk from AI! Please apply by September 19, 2025. Link in the next tweet:
English
2
6
14
1.9K
Center for Human-Compatible AI
Center for Human-Compatible AI@CHAI_Berkeley·
Our 2026 internship applications are now open! Learn more about the internship and apply: #chai-internship" target="_blank" rel="nofollow noopener">humancompatible.ai/jobs#chai-inte… Deadline: October 5, 2025, at 11:59 p.m. PST
English
3
8
25
3.4K
Center for Human-Compatible AI
Center for Human-Compatible AI@CHAI_Berkeley·
Who should apply? Current undergrads, Master’s, and PhD students and researchers, researchers in CS or adjacent fields, professional software or ML engineers.... The list goes on! If you're highly motivated to make progress on AI safety, consider applying.
English
0
0
1
294
Center for Human-Compatible AI
Center for Human-Compatible AI@CHAI_Berkeley·
Our interns: • Contribute to research with the potential for paper authorship • Build a pathway into AI safety work • Work alongside curious and ethically minded researchers
English
0
0
1
268
Center for Human-Compatible AI retweetledi
Karim Abdel Sadek
Karim Abdel Sadek@Karim_abdelll·
*New AI Alignment Paper* 🚨 Goal misgeneralization occurs when AI agents learn the wrong reward function, instead of the human's intended goal. 😇 We show that training with a minimax regret objective provably mitigates it, promoting safer and better-aligned RL policies!
GIF
English
9
31
146
19.4K
Center for Human-Compatible AI retweetledi
Cassidy Laidlaw
Cassidy Laidlaw@cassidy_laidlaw·
We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and jumps in to help. This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵
English
86
211
2.4K
489.9K
Center for Human-Compatible AI retweetledi
Ben Plaut
Ben Plaut@benplaut·
(1/7) New paper with @khanhxuannguyen and @thetututrain! Do LLM output probabilities actually relate to the probability of correctness? Or are they channeling this guy: ⬇️
Ben Plaut tweet media
English
3
5
14
1.6K
Center for Human-Compatible AI retweetledi
Aly Lidayan
Aly Lidayan@a_lidayan·
🚨Our new #ICLR2025 paper presents a unified framework for intrinsic motivation and reward shaping: they signal the value of the RL agent’s state🤖=external state🌎+past experience🧠. Rewards based on potentials over the learning agent’s state provably avoid reward hacking!🧵
Aly Lidayan tweet media
English
3
32
115
16.2K