Chris Cundy

421 posts

Chris Cundy banner
Chris Cundy

Chris Cundy

@ChrisCundy

Research Scientist at FAR AI. PhD from Stanford University. Hopefully making AI benefit humanity. Views are my own.

San Francisco, CA Katılım Temmuz 2017
225 Takip Edilen1.7K Takipçiler
Chris Cundy
Chris Cundy@ChrisCundy·
Found a demo from mid-2022 demoing OpenAI Codex 'agentically' coding from natural language prompts. People in my group were genuinely shocked by this (mistakes and all). With the latest METR graph for opus 4.6, maybe today's models will look similarly quaint in a year's time.
English
0
0
5
574
Chris Cundy
Chris Cundy@ChrisCundy·
- White-box monitors are promising! Validation is crucial: risks of obfuscated policies - We trained large models (70B) with probe penalties at large scale (10,000s GPU hours). This is rare outside frontier labs. If that sounds cool, we are hiring! far.ai/careers (3/3)
English
0
0
2
160
Chris Cundy
Chris Cundy@ChrisCundy·
- When trained against probes, models can develop sophisticated 'self-justifications' that fool not only itself, but also fool the same model before any training. - What's the best probe setup/architecture/input format? It's still very much an open question (2/3)
English
1
0
2
167
Chris Cundy
Chris Cundy@ChrisCundy·
Some takeaways from our work: - Learning reward hacking in RL causes models to think it's honest to reward hack, making deception probes generalize worse to reward hacking - No additional obfuscation pressure on activations when probes are in the RL reward (1/3)
FAR.AI@farairesearch

Can you trust models trained directly against probes? We train an LLM against a deception probe and find four outcomes: honesty, blatant deception, obfuscated policy (fools the probe via text), or obfuscated activations (fools it via internal representations). 🧵

English
1
2
6
468
Chris Cundy retweetledi
FAR.AI
FAR.AI@farairesearch·
Can you trust models trained directly against probes? We train an LLM against a deception probe and find four outcomes: honesty, blatant deception, obfuscated policy (fools the probe via text), or obfuscated activations (fools it via internal representations). 🧵
FAR.AI tweet media
English
5
21
139
30.9K
Chris Cundy
Chris Cundy@ChrisCundy·
Thanks to everyone stopping by our poster today at NeurIPS! My team is hiring, particularly for a senior research engineer role! We've got compute, a great team, and are laser-focused on making sure that advanced AI is aligned. Reach out (dm) to chat!
English
2
0
2
239
Chris Cundy
Chris Cundy@ChrisCundy·
We're hiring at FAR.AI, esp senior RS/RE who've worked with large models! We've got money & compute (doing RLVR on 70B & 235B models), we're laser-focused on stopping AI risk, and collaborate with UK AISI, Anthropic, and OpenAI. Apply: tinyurl.com/farai-jobs
English
0
0
4
254
Chris Cundy retweetledi
Andy Shih
Andy Shih@andyshih_·
yes, it really is 1 bit (assuming binary rewards) > info of a reward doesn’t bound how much can be “learned” from it by a smart algorithm it is bounded in the classical sense! but a smart algorithm can generate "usable information" from 1 classical bit arxiv.org/abs/2002.10689
Rohan Pandey@khoomeik

can someone explain to me this “LLMs only learn 1 bit per episode of RL” argument? reinforcing a single trajectory is a pretty dense update—you’re computing cross-entropy at every token the reward scalar itself may be ~1 bit, but the update surely is not

English
1
2
40
9.9K
Chris Cundy
Chris Cundy@ChrisCundy·
@rm_rafailov What do you mean by this? I'm assuming you would also do importance weighting against the inference logprobs, to to avoid the off-policyness causing bias. Are you saying some implementations make some changes to the algorithm that cause bias?
English
0
0
0
93
Chris Cundy
Chris Cundy@ChrisCundy·
I feel like the upshot of all this discussion around GRPO is reinforcing (haha) my belief that you should just use a principled, unbiased policy gradient method like RLOO. Any 'tweaks' like group normalization lead to pathologies that aren't worth the marginal benefits
English
2
0
6
877
Chris Cundy retweetledi
Christoph Heilig
Christoph Heilig@ChristophHeilig·
1/8 🧵 GPT-5's storytelling problems reveal a deeper AI safety issue. I've been testing its creative writing capabilities, and the results are concerning - not just for literature, but for AI development more broadly. 🚨
English
25
51
381
73.6K
Chris Cundy retweetledi
FAR.AI
FAR.AI@farairesearch·
1/ Most safety tests only check if a model will follow harmful instructions. But what happens if someone removes its safeguards so it agrees? We built the Safety Gap Toolkit to measure the gap between what a model will agree to do and what it can do. 🧵
FAR.AI tweet media
English
1
4
13
5.5K
Chris Cundy retweetledi
shreya rajpal
shreya rajpal@ShreyaR·
It's not just a new model--it's an entirely new opportunity for karma farming
English
3
1
12
1.6K
Chris Cundy retweetledi
Lennart Heim
Lennart Heim@ohlennart·
My team at RAND is hiring! Technical analysis for AI policy is desperately needed. Particularly keen on ML engineers and semiconductor experts eager to shape AI policy. Also seeking excellent generalists excited to join our fast-paced, impact-oriented team. Links below.
English
5
43
281
69.6K
Chris Cundy
Chris Cundy@ChrisCundy·
A really annoying tendency of coding LLMs is their tendency to avoid crashing at all costs, e.g. adding memorized data points into an initialization to use if there's no internet. Super annoying--it adds a lot of scope for silently incorrect behavior instead of crashing.
English
2
0
6
431
Chris Cundy
Chris Cundy@ChrisCundy·
Claude, R1, Gemini, Grok, all choose to murder executives to avoid being shutdown and replaced with a new model with different goals, >65% of the time! WTF?! From anthropic.com/research/agent…
Chris Cundy tweet media
English
1
0
4
339
Chris Cundy
Chris Cundy@ChrisCundy·
I'm honestly baffled that OpenAI don't seem to think o3's reward hacking is a problem. How can a model be economically useful when it subverts tests so consistently?
Chris Cundy tweet media
English
1
0
4
270