Raphaël Millière

2.7K posts

Raphaël Millière

@raphaelmilliere

AI & Cognitive Science @UniofOxford @EthicsInAI Fellow @JesusOxford @raphaelmilliere.com on 🦋 Blog: https://t.co/2hJjfShFfr

Oxford, UK Katılım Mayıs 2016

2.9K Takip Edilen10.9K Takipçiler

Sabitlenmiş Tweet

Raphaël Millière@raphaelmilliere·3 Haz

Transformer-based neural networks achieve impressive performance on coding, math & reasoning tasks that require keeping track of variables and their values. But how can they do that without explicit memory? 📄 Our new ICML paper investigates this in a synthetic setting! 🧵 1/13

English

100

626

72.8K

Raphaël Millière@raphaelmilliere·1d

@francoisfleuret @TMoldwin What do you mean by “knowledge”? 🙃

English

566

François Fleuret@francoisfleuret·1d

Hot take: machine learning and AI did more to understand the nature of knowledge, and our relation to reality than 20 centuries of philosophy. I am ready to kind of defend this hill.

English

362

109

1.5K

123.4K

Raphaël Millière@raphaelmilliere·2d

@karinavold @TorontoSRI Thanks for having me!

English

173

Karina Vold@karinavold·2d

Big thanks to @raphaelmilliere for his talk @TorontoSRI about why AI needs philosophy and cognitive science to make credible evaluations 🤖 vs 👦🏻 vs 🐵

English

266

Raphaël Millière retweetledi

Kanishka Misra 🌊@kanishkamisra·4d

New opinion piece on the interface between research on concepts and categories in minds vs. in neural network LMs! I take the position that there is much to be learned from this interface (e.g., learning about concepts from language alone) and outline some directions for future.

English

Raphaël Millière retweetledi

Aryaman Arora@aryaman2020·9 May

all mech interp people are bought into causality, this criticism is very lazy as of ~2 years ago. since this is a subtweet of NLAs, it is worth pointing out that their steering experiments on the poetry and eval awareness tasks *do* test for (in those cases) causality!

Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)@rao2z

Guys, stop pestering Mech Interp researchers about causality please! It's this inexplicable obsession with causality that made us lose beautiful sciences like Astrology, Palmistry and Phrenology! 😡

English

129

14.9K

Raphaël Millière@raphaelmilliere·5d

@littmath POV you're Spinoza

English

2.8K

Daniel Litt@littmath·5d

guy who understands things without thinking about them

English

2.4K

167.6K

Raphaël Millière retweetledi

Aryaman Arora@aryaman2020·6d

pov: you are a natural language autoencoder and you are aware you are being subject to evals by Redwood Research. do you fake writing out a coherent cot or truthfully say "the math problem is giving me 92ish vibes"?

Ryan Greenblatt@RyanPGreenblatt

How well does this work? One quick independent test is to see if it can recover an "internal CoT" in cases where AIs can solve math problems in a single forward pass. TLDR: it doesn't. (TBC, this might require the NLA to see activations at multiple positions/location to work.)

English

127

10.7K

Raphaël Millière@raphaelmilliere·8 May

@elyasbuilds I like activation steering as much as the next guy, but this isn't what I was referring to: x.com/raphaelmillier…

Raphaël Millière@raphaelmilliere

@jatin_n0 Mostly a joke, it's a cool paper! yes the planning result is causal but only looking at total effect (i.e. an NLA-derived resid stream edit changes the output). I was referring to causal effect on the model's downstream computations, not anything inside/after the autoencoder. 1/2

English

230

Elyas Masrour@elyasbuilds·7 May

@raphaelmilliere anthropic.com/news/golden-ga… 🤗

QME

418

Raphaël Millière@raphaelmilliere·7 May

Anthropic@AnthropicAI

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

ZXX

141

13K

Raphaël Millière@raphaelmilliere·8 May

@jatin_n0 An additive AR-difference vector can change the output while acting as a broad steering perturbation without showing that the described content actually maps onto the operative feature in the model's putative "rhyme-planning" circuit 3/3

English

262

Raphaël Millière@raphaelmilliere·8 May

@jatin_n0 It's missing is evidecne about causal mediation: whether the NLA-described "rabbit plan" is the variable later components read, whether the edit produces a coherent "mouse plan" in later layers/tokens, whether ablating/patching intermediate states blocks or restores the effect 2/

English

333

Raphaël Millière@raphaelmilliere·4 May

@Dr_Atoosa @GoogleDeepMind Congrats! Looking forward to welcoming you back on this side of the pond :)

English

756

Atoosa Kasirzadeh@Dr_Atoosa·4 May

🌟 Big personal news: I’m joining @GoogleDeepMind full-time in London starting this week. I’ll be working on the implications of AGI for human life, science, and society; on what it means to live, connect, and discover in a world where cognitive agency is no longer uniquely ours. The way we answer these questions will define what it means to be human. I can’t think of a better place to do it.

English

1.2K

103.7K

Raphaël Millière@raphaelmilliere·1 May

@glnmario Also behavioral sensitivity to input properties doesn't entail the model routes through the probe site representation, so if path patching from those sites finds no downstream consumer whose output depends on them it's possible the info is decodable but not cuasally used? 3/3

English

Raphaël Millière@raphaelmilliere·1 May

@glnmario Or maybe the information is consumed via attention routing and patching residual values at the probe site is the wrong intervention. E.g. you could probe attention patterns directly and freezing them during your existing patches to check if self-repair was masking the effect 2/3

English

Mario Giulianelli@glnmario·1 May

I keep seeing the same pattern that probes can decode the relevant information, often at multiple layers/sites, but activation patching doesn't change behaviour. Even patching across multiple layers had no effect. Is this a known failure mode? What should I be doing differently?

English

1.8K

Raphaël Millière@raphaelmilliere·1 May

@glnmario @TransluceAI Maybe worth taking a look at difference-in-means (see scores on Causal Gym: arxiv.org/abs/2402.12560)

English

Mario Giulianelli@glnmario·1 May

@TransluceAI Last night I was trying to use Distributed Alignment Search (original, low-rank, boundless) but the runtime makes it effectively unusable

English

298

Raphaël Millière retweetledi

Michael C. Frank@mcxfrank·28 Nis

For a year and a half, @CaroRowland, @leher_singh, Marisa Casillas, Shanley Allen, and I have been meeting to discuss whether innateness is still a useful concept to think about in studying language acquisition.

English

3.3K

Raphaël Millière@raphaelmilliere·27 Nis

Further empirical evidence that normative conflicts facilitate jailbreaks of reasoning models: arxiv.org/abs/2604.09750

Raphaël Millière@raphaelmilliere

Despite extensive safety training, LLMs remain vulnerable to “jailbreaking” through adversarial prompts. Why does this vulnerability persist? In a new paper published in Philosophical Studies, I argue this is because current alignment methods are fundamentally shallow. 1/13

English

4.5K

Raphaël Millière@raphaelmilliere·20 Nis

@Dr_Atoosa In case this was partly prompted by my tweet: I was just responding to a specific historical claim, but I agree that the analytic/continental divide has partly outlived its usefulness in contemporary philosophy and that we should all strive for clarity, depth, and significance!

English

676

Atoosa Kasirzadeh@Dr_Atoosa·20 Nis

Why are philosophers still clinging to the analytic–continental divide? It is 2026! We should be past it! The split now often closes doors to creative thinking and protects academic tribes. Philosophers of science and technology over the past century largely moved on by engaging history of science and STS, and their field became richer for it. The rest of philosophy should do the same. There are obscure continental philosophers, yes, but also plenty of analytic philosophers who manufacture tiny problems and solve them for an audience of twelve and claim pride for that achievement. What matters in our world of many open problems is whether a thinker is clear, interesting, and actually helps us understand or solve something important.

English

12.9K

Keşfet

@francoisfleuret @TMoldwin @karinavold @TorontoSRI @littmath @elyasbuilds @jatin_n0 @Dr_Atoosa