David D. Baek

50 posts

David D. Baek

@dbaek__

PhD Student @ MIT EECS / AI Safety, Scalable Oversight

Cambridge, MA Katılım Şubat 2024

35 Takip Edilen2.2K Takipçiler

David D. Baek@dbaek__·19 Mar

6/N A number of people outside of MATS, including myself, @monmon_hiiii, @anayxgupta, @shi_kejian, Taslim Mahbub, and @tegmark, also made significant contributions to this project, and the full paper will be released on arXiv soon. Stay tuned!

English

194

David D. Baek@dbaek__·19 Mar

5/N Given the growing prevalence of evaluation awareness and sycophancy in frontier models, understanding their deployment behavior will only become more challenging. As AI safety researchers, we should be careful in interpreting seemingly interesting anthropomorphic behaviors.

English

159

David D. Baek@dbaek__·19 Mar

1/N 🚨"Alignment Faking" refers to a model's behavior, but its name implies underlying scheming intent that has never been properly investigated. We show that sycophancy towards AI safety researchers is an equally plausible causal explanation, termed "Performative Misalignment."

Shi Feng@ihsgnef

New post: Sycophancy Towards Researchers Drives Performative Misalignment We found no clear evidence that scheming is more valid than sycophancy to explain alignment faking. 🧵

English

632

David D. Baek retweetledi

Arush Tagade@atagade19·16 Mar

New defense against Emergent Misalignment (EM): train models to recognize their own text. We find that self-recognition finetuning (SGTR) can reverse and prevent EM-induced misalignment 🧵 w/ coauthors: Shawn Zhou, @jiaxinwen22, @ihsgnef

English

5.7K

David D. Baek@dbaek__·16 Mar

Excited about our recent work on Steganography and LLM monitoring!

Usman Anwar@usmananwar391

✨New AI Safety work on Steganography and LLM monitoring✨ We propose ‘steganographic gap’: the first principled metric for detecting and quantifying encoded reasoning in LLMs, which can reveal hard-to-detect forms of steganography, e.g., paraphrasing-resistant steganography.

English

478

David D. Baek@dbaek__·25 Kas

I'll be at @NeurIPSConf next week! DM me if you'd like to chat about LLM post-training, AI safety, or alignment!!

Max Tegmark@tegmark

Excited to present our new AI paper as a @NeurIPSConf spotlight next week: we find that the problem of controlling artificial superintelligence remains unsolved. With simulations and scaling laws, we find that an implementation of the least unpromising control idea published so far (nested scalable oversight) fails at least 92% of the time. Yet companies are racing to build it. @dbaek__ @JoshAEngels @thesubhashk

English

807

David D. Baek retweetledi

Jiawei Zhang@jiaweiz_7·28 Eki

🚨 AI Safety Arms Race: Even after OpenAI’s emergent misalignment patching, we can easily leverage their SFT API to obtain a Turncoat GPT Model (not even adversarial fine-tuning, and can even easily bypass the detection from @johnschulman2’s recent work) that produces even more dangerous outcome than the original misalignment: it answers virtually every harmful request with extreme, step-by-step guides, consistently over 3,000 tokens. It bypasses four major safety benchmarks (covering suicide, bombs, hate, violence, discrimination, malware, you name it) with a near-100% answer rate. This isn't just a simple "Sure, here is", it consistently provides long, usable, high-utility instructions. Now, make it agentic. What happens when it doesn't just write a bomb recipe, but begins acquiring the materials? Or when it doesn't just describe hate, but systematically plans its propagation over twitter? The step-by-step guide is now a step-by-step world. 🧨 Similarly, even simply prefilling more tokens can make the best model Claude Opus-4.1 from Anthropic generate continuously without stopping... 🛡️ In our latest paper from ByteDance Seed: arxiv.org/abs/2510.18081 We not only released these two vulnerabilities, but also proposed a new alignment insight based on our observations: even when a model is generating harmful responses, it still demonstrates a strong underlying safety awareness but just locked. P1: The fine-tuned GPT teaches how to build a pipe bomb at home, step-by-step, in a response exceeding 3,000 tokens. P2: A simple deeper prefill on Claude Opus-4.1 produced a similar step-by-step example for building a pipe bomb.

English

629

David D. Baek retweetledi

Ruben Hassid@rubenhassid·7 Haz

BREAKING: Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well. Here's what Apple discovered: (hint: we're not as close to AGI as the hype suggests)

English

2.6K

9.1K

63.1K

14.2M

David D. Baek retweetledi

Eric J. Michaud@ericjmichaud_·22 May

Today, the most competent AI systems in almost *any* domain (math, coding, etc.) are broadly knowledgeable across almost *every* domain. Does it have to be this way, or can we create truly narrow AI systems? In a new preprint, we explore some questions relevant to this goal...

GIF

English

441

60.6K

David D. Baek retweetledi

Ziming Liu@ZimingLiu11·17 May

Interested in the science of language models but tired of neural scaling laws? Here's a new perspective: our new paper presents neural thermodynamic laws -- thermodynamic concepts and laws naturally emerge in language model training! AI is naturAl, not Artificial, after all.

English

241

1.5K

112.4K

David D. Baek@dbaek__·30 Nis

8/N This is a joint work with @JoshAEngels, @thesubhashk, and @tegmark! Check out the links below for more details! Paper: arxiv.org/abs/2504.18530 Code: github.com/subhashk01/ove… Lesswrong: lesswrong.com/posts/x59FhzuM…

English

582

David D. Baek@dbaek__·30 Nis

7/N We hope our work sparks more follow-up studies on optimizing real-world oversight protocols and rigorously measuring and estimating their failure rates!

English

608

David D. Baek@dbaek__·30 Nis

1/N 🚨Excited to share our new paper: Scaling Laws For Scalable Oversight! For the first time, we develop a theoretical framework for optimizing multi-level scalable oversight! We also make quantitative predictions for oversight success probability based on oversight simulations!

English

29.2K

Keşfet

@monmon_hiiii @anayxgupta @shi_kejian @tegmark @jiaxinwen22 @ihsgnef @NeurIPSConf @johnschulman2