
NEW PAPER from UK AISI Model Transparency team: Could we catch AI models that hide their capabilities? We ran an auditing game to find out. The red team built sandbagging models. The blue team tried to catch them. The red team won. Why? 🧵1/17
Thomas Read
100 posts

@thjread
Interpretability researcher at UK AISI

NEW PAPER from UK AISI Model Transparency team: Could we catch AI models that hide their capabilities? We ran an auditing game to find out. The red team built sandbagging models. The blue team tried to catch them. The red team won. Why? 🧵1/17

🧵 1/13 My new team at UK AISI - the White Box Control Team - has released progress updates! We've been investigating whether AI systems could deliberately underperform on evaluations without us noticing. Key findings below 👇













An ethics board that oversees clinical trials in the U.K. has said a research team there can begin intentionally infecting volunteers with #SARSCOV2, a world-first experiment intended to accelerate research into #COVID19 #vaccines. fcld.ly/oyej2dp










