David D. Baek

50 posts

David D. Baek

David D. Baek

@dbaek__

PhD Student @ MIT EECS / AI Safety, Scalable Oversight

Cambridge, MA Katılım Şubat 2024
35 Takip Edilen2.2K Takipçiler
David D. Baek
David D. Baek@dbaek__·
6/N A number of people outside of MATS, including myself, @monmon_hiiii, @anayxgupta, @shi_kejian, Taslim Mahbub, and @tegmark, also made significant contributions to this project, and the full paper will be released on arXiv soon. Stay tuned!
English
0
0
2
194
David D. Baek
David D. Baek@dbaek__·
5/N Given the growing prevalence of evaluation awareness and sycophancy in frontier models, understanding their deployment behavior will only become more challenging. As AI safety researchers, we should be careful in interpreting seemingly interesting anthropomorphic behaviors.
English
1
0
2
159
David D. Baek
David D. Baek@dbaek__·
1/N 🚨"Alignment Faking" refers to a model's behavior, but its name implies underlying scheming intent that has never been properly investigated. We show that sycophancy towards AI safety researchers is an equally plausible causal explanation, termed "Performative Misalignment."
Shi Feng@ihsgnef

New post: Sycophancy Towards Researchers Drives Performative Misalignment We found no clear evidence that scheming is more valid than sycophancy to explain alignment faking. 🧵

English
1
0
9
632
David D. Baek retweetledi
Arush Tagade
Arush Tagade@atagade19·
New defense against Emergent Misalignment (EM): train models to recognize their own text. We find that self-recognition finetuning (SGTR) can reverse and prevent EM-induced misalignment 🧵 w/ coauthors: Shawn Zhou, @jiaxinwen22, @ihsgnef
English
1
11
57
5.7K
David D. Baek
David D. Baek@dbaek__·
I'll be at @NeurIPSConf next week! DM me if you'd like to chat about LLM post-training, AI safety, or alignment!!
Max Tegmark@tegmark

Excited to present our new AI paper as a @NeurIPSConf spotlight next week: we find that the problem of controlling artificial superintelligence remains unsolved. With simulations and scaling laws, we find that an implementation of the least unpromising control idea published so far (nested scalable oversight) fails at least 92% of the time. Yet companies are racing to build it. @dbaek__ @JoshAEngels @thesubhashk

English
1
1
8
807
David D. Baek retweetledi
Jiawei Zhang
Jiawei Zhang@jiaweiz_7·
🚨 AI Safety Arms Race: Even after OpenAI’s emergent misalignment patching, we can easily leverage their SFT API to obtain a Turncoat GPT Model (not even adversarial fine-tuning, and can even easily bypass the detection from @johnschulman2’s recent work) that produces even more dangerous outcome than the original misalignment: it answers virtually every harmful request with extreme, step-by-step guides, consistently over 3,000 tokens. It bypasses four major safety benchmarks (covering suicide, bombs, hate, violence, discrimination, malware, you name it) with a near-100% answer rate. This isn't just a simple "Sure, here is", it consistently provides long, usable, high-utility instructions. Now, make it agentic. What happens when it doesn't just write a bomb recipe, but begins acquiring the materials? Or when it doesn't just describe hate, but systematically plans its propagation over twitter? The step-by-step guide is now a step-by-step world. 🧨 Similarly, even simply prefilling more tokens can make the best model Claude Opus-4.1 from Anthropic generate continuously without stopping... 🛡️ In our latest paper from ByteDance Seed: arxiv.org/abs/2510.18081 We not only released these two vulnerabilities, but also proposed a new alignment insight based on our observations: even when a model is generating harmful responses, it still demonstrates a strong underlying safety awareness but just locked. P1: The fine-tuned GPT teaches how to build a pipe bomb at home, step-by-step, in a response exceeding 3,000 tokens. P2: A simple deeper prefill on Claude Opus-4.1 produced a similar step-by-step example for building a pipe bomb.
English
1
3
3
629
David D. Baek retweetledi
Ruben Hassid
Ruben Hassid@rubenhassid·
BREAKING: Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well. Here's what Apple discovered: (hint: we're not as close to AGI as the hype suggests)
Ruben Hassid tweet media
English
2.6K
9.1K
63.1K
14.2M
David D. Baek retweetledi
Eric J. Michaud
Eric J. Michaud@ericjmichaud_·
Today, the most competent AI systems in almost *any* domain (math, coding, etc.) are broadly knowledgeable across almost *every* domain. Does it have to be this way, or can we create truly narrow AI systems? In a new preprint, we explore some questions relevant to this goal...
GIF
English
3
42
441
60.6K
David D. Baek retweetledi
Ziming Liu
Ziming Liu@ZimingLiu11·
Interested in the science of language models but tired of neural scaling laws? Here's a new perspective: our new paper presents neural thermodynamic laws -- thermodynamic concepts and laws naturally emerge in language model training! AI is naturAl, not Artificial, after all.
Ziming Liu tweet media
English
17
241
1.5K
112.4K
David D. Baek
David D. Baek@dbaek__·
7/N We hope our work sparks more follow-up studies on optimizing real-world oversight protocols and rigorously measuring and estimating their failure rates!
English
1
0
6
608
David D. Baek
David D. Baek@dbaek__·
1/N 🚨Excited to share our new paper: Scaling Laws For Scalable Oversight! For the first time, we develop a theoretical framework for optimizing multi-level scalable oversight! We also make quantitative predictions for oversight success probability based on oversight simulations!
David D. Baek tweet media
English
1
5
71
29.2K