MATS Research

86 posts

MATS Research banner
MATS Research

MATS Research

@MATSprogram

MATS empowers researchers to advance AI alignment, transparency, and security

Berkeley, CA Se unió Kasım 2023
136 Siguiendo2.5K Seguidores
MATS Research retuiteado
Shi Feng
Shi Feng@ihsgnef·
New post: Sycophancy Towards Researchers Drives Performative Misalignment We found no clear evidence that scheming is more valid than sycophancy to explain alignment faking. 🧵
Shi Feng tweet media
English
23
56
685
62.5K
MATS Research retuiteado
Iván Arcuschin
Iván Arcuschin@IvanArcus·
You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13
Iván Arcuschin tweet media
English
242
1.9K
12.8K
869.8K
MATS Research retuiteado
Joschka Braun
Joschka Braun@BraunJoschka·
1/ What if AI models could resist or influence their own RL training—by strategically choosing what not to explore? Ahead of our upcoming empirical paper, we formalize and decompose "exploration hacking." New conceptual framework post with @eyonjang & @DamonFalck 🧵👇
Joschka Braun tweet media
English
1
5
37
2.3K
MATS Research retuiteado
Wen X.
Wen X.@imwendering·
Thanks for featuring me, MATS! Grateful to @MATSprogram and my mentors @jenner_erik and @davlindner for an incredible research experience! This is just the beginning of what feels like a real adventure in AI safety. If you're a woman or from an underrepresented background curious about breaking into this field, feel free to reach out. I'd be happy to chat.
Wen X. tweet media
English
2
1
50
2.4K
MATS Research retuiteado
Alex Serrano
Alex Serrano@sertealex·
What if a model could strategically misbehave rarely enough that you'd never catch it during testing? LLMs struggle with calibration in many contexts. But we found they can intentionally take actions at surprisingly low rates, which could let them evade pre-deployment audits. 🧵
Alex Serrano tweet media
English
11
12
93
8.9K
MATS Research retuiteado
Bruce W. Lee
Bruce W. Lee@BruceWLee2·
I can double that @MATSprogram unlocked amazing opportunities for me. Furthermore, it's really exciting time to work on AI Control.
John (Yueh-Han) Chen@jcyhc_ai

tbh, it wasn't clear to me at first whether this eval was interesting enough to publish as a paper. thankfully, @tomekkorbak has great intuition and successfully convinced me to work on this. super grateful for leading this project and for @OpenAI tweeting about it. special thanks to @BruceWLee2 for being a great collaborator and providing useful feedback since day 1 this work wouldn't have happened w/o @MATSprogram 💕

English
0
1
23
2.8K
MATS Research retuiteado
Simon Lermen
Simon Lermen@SimonLermenAI·
Happy to share my @MATSprogram project that I have been working on in the last couple of months. We explore how LLMs can be used for large-scale deanonymization online.
Daniel Paleka@dpaleka

Can LLMs figure out who you are from your anonymous posts? From a handful of comments, LLMs can infer where you live, what you do, and your interests; then search for you on the web. New 📄 w/ @SimonLermenAI, @joshua_swans, @AerniMichael, Nicholas Carlini, @florian_tramer 🧵

English
2
6
55
4.8K
MATS Research retuiteado
Bruce W. Lee
Bruce W. Lee@BruceWLee2·
Can we catch misaligned agents by training a reflex that fires when they misbehave? A simple impulse can be easier to instill than alignment and more reliable than blackbox monitoring. We introduce Self-Incrimination, a new AI Control approach that outperforms blackbox monitors
Bruce W. Lee tweet media
English
7
35
187
45.3K
MATS Research retuiteado
Benji Berczi
Benji Berczi@benji_berczi·
Anthropic yesterday: LLMs develop personas in post-training! 🤖 Our work today: LLM personas can be elicited just by prompting! Even harmful ones. 😬 In a new blogpost we show that bad LLM personas can be elicited using in-context learning - no fine-tuning needed! Thread 🧵
English
5
8
59
6.5K
MATS Research retuiteado
Atticus Wang
Atticus Wang@atticuswzf·
Is "a response formatted like this" sometimes better than "a response formatted like this"? To a reward model, yes! RMs are instrumental in shaping model behaviors and alignment. Our paper makes progress uncovering their unexpected preferences. 🧵(1/9)
Atticus Wang tweet media
English
8
12
92
13.7K
Lennart Heim
Lennart Heim@ohlennart·
I'm in SF for a bit. Would love to chat and catch up. I want to know how many AI chips folks have and when AGI is coming. In exchange I can tell you what everyone else has, and that according to DC, AGI isn't real. :)
English
3
0
51
3.1K
MATS Research retuiteado
Anthropic
Anthropic@AnthropicAI·
New on our Frontier Red Team blog: We tested whether AIs can exploit blockchain smart contracts. In simulated testing, AI agents found $4.6M in exploits. The research (with @MATSprogram and the Anthropic Fellows program) also developed a new benchmark: red.anthropic.com/2025/smart-con…
English
354
702
4.8K
2.1M
MATS Research retuiteado
Ryan Kidd
Ryan Kidd@ryan_kidd44·
MATS received a record-breaking 2368 applications for our Summer program and plan to accept 5%. We also received 273 mentor applications and accepted 20% as primary mentors. Applicants are growing exponentially, at ~2x/year!
Ryan Kidd tweet media
English
14
11
230
19.2K
MATS Research
MATS Research@MATSprogram·
Reminder: MATS Summer 2026 applications close this Saturday, January 18 AoE! We've shortened the application this year—most people finish in 1–2 hours, and we'll get back to applicants by the end of January. matsprogram.org/apply
MATS Research tweet media
English
0
4
17
2K