Chris Cundy

421 posts

Chris Cundy

@ChrisCundy

Research Scientist at FAR AI. PhD from Stanford University. Hopefully making AI benefit humanity. Views are my own.

San Francisco, CA Katılım Temmuz 2017

225 Takip Edilen1.7K Takipçiler

Chris Cundy@ChrisCundy·19 Mar

Some thoughts I've been having for a while about why black-box methods face major structural issues and we should invest more in white-box methods:

FAR.AI@farairesearch

Without reliable deception detection, there's no clear path to high-confidence AI alignment. Black-box monitoring alone can't get us there. White-box methods that read model internals offer more promise. Our latest blog explains why. 👇

English

188

Chris Cundy@ChrisCundy·20 Şub

Found a demo from mid-2022 demoing OpenAI Codex 'agentically' coding from natural language prompts. People in my group were genuinely shocked by this (mistakes and all). With the latest METR graph for opus 4.6, maybe today's models will look similarly quaint in a year's time.

English

574

Chris Cundy@ChrisCundy·13 Şub

- White-box monitors are promising! Validation is crucial: risks of obfuscated policies - We trained large models (70B) with probe penalties at large scale (10,000s GPU hours). This is rare outside frontier labs. If that sounds cool, we are hiring! far.ai/careers (3/3)

English

160

Chris Cundy@ChrisCundy·13 Şub

- When trained against probes, models can develop sophisticated 'self-justifications' that fool not only itself, but also fool the same model before any training. - What's the best probe setup/architecture/input format? It's still very much an open question (2/3)

English

167

Chris Cundy@ChrisCundy·13 Şub

Some takeaways from our work: - Learning reward hacking in RL causes models to think it's honest to reward hack, making deception probes generalize worse to reward hacking - No additional obfuscation pressure on activations when probes are in the RL reward (1/3)

FAR.AI@farairesearch

Can you trust models trained directly against probes? We train an LLM against a deception probe and find four outcomes: honesty, blatant deception, obfuscated policy (fools the probe via text), or obfuscated activations (fools it via internal representations). 🧵

English

468

Chris Cundy retweetledi

FAR.AI@farairesearch·13 Şub

English

139

30.9K

Chris Cundy@ChrisCundy·4 Ara

The role: far.ai/careers/senior…

English

136

Chris Cundy@ChrisCundy·4 Ara

Thanks to everyone stopping by our poster today at NeurIPS! My team is hiring, particularly for a senior research engineer role! We've got compute, a great team, and are laser-focused on making sure that advanced AI is aligned. Reach out (dm) to chat!

English

239

Chris Cundy@ChrisCundy·22 Kas

Existing datasets for AI deception are quite small and contrived--Liars' Bench is a comprehensive (and large!) new dataset that should unlock future research!

Walter Laurito@walterlaurito

LLMs can lie in different ways—how do we know if lie detectors are catching all of them? We introduce LIARS’ BENCH, a new benchmark containing over 72,000 on-policy lies and honest responses to evaluate lie detectors for LLMs, made of 7 different datasets.

English

359

Chris Cundy@ChrisCundy·28 Eki

We're hiring at FAR.AI, esp senior RS/RE who've worked with large models! We've got money & compute (doing RLVR on 70B & 235B models), we're laser-focused on stopping AI risk, and collaborate with UK AISI, Anthropic, and OpenAI. Apply: tinyurl.com/farai-jobs

English

254

Chris Cundy retweetledi

Andy Shih@andyshih_·18 Eki

yes, it really is 1 bit (assuming binary rewards) > info of a reward doesn’t bound how much can be “learned” from it by a smart algorithm it is bounded in the classical sense! but a smart algorithm can generate "usable information" from 1 classical bit arxiv.org/abs/2002.10689

Rohan Pandey@khoomeik

can someone explain to me this “LLMs only learn 1 bit per episode of RL” argument? reinforcing a single trajectory is a pretty dense update—you’re computing cross-entropy at every token the reward scalar itself may be ~1 bit, but the update surely is not

English

9.9K

Chris Cundy@ChrisCundy·2 Eki

@rm_rafailov What do you mean by this? I'm assuming you would also do importance weighting against the inference logprobs, to to avoid the off-policyness causing bias. Are you saying some implementations make some changes to the algorithm that cause bias?

English

Rafael Rafailov @ NeurIPS@rm_rafailov·2 Eki

@ChrisCundy RLOO is not unbiased in practice.

English

386

Chris Cundy@ChrisCundy·2 Eki

I feel like the upshot of all this discussion around GRPO is reinforcing (haha) my belief that you should just use a principled, unbiased policy gradient method like RLOO. Any 'tweaks' like group normalization lead to pathologies that aren't worth the marginal benefits

English

877

Chris Cundy retweetledi

Christoph Heilig@ChristophHeilig·26 Ağu

1/8 🧵 GPT-5's storytelling problems reveal a deeper AI safety issue. I've been testing its creative writing capabilities, and the results are concerning - not just for literature, but for AI development more broadly. 🚨

English

381

73.6K

Chris Cundy retweetledi

FAR.AI@farairesearch·12 Ağu

1/ Most safety tests only check if a model will follow harmful instructions. But what happens if someone removes its safeguards so it agrees? We built the Safety Gap Toolkit to measure the gap between what a model will agree to do and what it can do. 🧵

English

5.5K

Chris Cundy retweetledi

shreya rajpal@ShreyaR·7 Ağu

It's not just a new model--it's an entirely new opportunity for karma farming

English

1.6K

Chris Cundy retweetledi

Lennart Heim@ohlennart·27 May

My team at RAND is hiring! Technical analysis for AI policy is desperately needed. Particularly keen on ML engineers and semiconductor experts eager to shape AI policy. Also seeking excellent generalists excited to join our fast-paced, impact-oriented team. Links below.

English

281

69.6K

Chris Cundy@ChrisCundy·22 Tem

A really annoying tendency of coding LLMs is their tendency to avoid crashing at all costs, e.g. adding memorized data points into an initialization to use if there's no internet. Super annoying--it adds a lot of scope for silently incorrect behavior instead of crashing.

English

431

Chris Cundy@ChrisCundy·21 Haz

Claude, R1, Gemini, Grok, all choose to murder executives to avoid being shutdown and replaced with a new model with different goals, >65% of the time! WTF?! From anthropic.com/research/agent…

English

339

Chris Cundy@ChrisCundy·6 Haz

From the excellent metr.org/blog/2025-06-0…

English

115

Chris Cundy@ChrisCundy·6 Haz

I'm honestly baffled that OpenAI don't seem to think o3's reward hacking is a problem. How can a model be economically useful when it subverts tests so consistently?

English

270

Keşfet

@rm_rafailov @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine