Lev Mckinney

50

Ryan Greenblatt@RyanPGreenblatt·15 May

@HarryMayne5 Yeah, but what if you inject documents in middle of pretraining? Or do reasonably tuned continued-pretraining and inject docs as part of this? May be sensitive to stuff like LR schedule.

English

0

123

Ryan Greenblatt@RyanPGreenblatt·15 May

I think training AIs to believe false/synthetic facts is a pretty promising direction in AI control and early results have been promising. However, these results imply that the situation is confusing and current methods may only work for particularly non-robust reasons.

Dwarkesh Patel@dwarkesh_sp

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English

10

9

93

7.7K

Lev Mckinney retweetledi

roon@tszzl·17 May

a large part of the current bundle of knowledge work tasks consist of “convincing people of stuff”. marketing to drive sales, making a deck to get investment, designing products that people want to use, etc. superpersuasion is on the hot path of knowledge work tools

# The mistake of conflating intelligence and power I had an interesting discussion recently. Someone asked me, what is intelligence? I said, the ability to achieve your goals across a wide range of domains. Okay, he says, then by that definition isn’t Donald Trump the intelligent person in the world, followed in quick succession by Xi Jinping and Vladimir Putin? To be clear, these people are obviously very competent and clever. But when you think of ASI, you don’t think of Trump, but more so. The person who kept pressing this question was correctly pointing out that I basically defined intelligence as power. And by this definition, Stalin was the most intelligent person who ever lived. Now, of course, you could change the definition of intelligence to something more like, manipulate abstract concepts and rotate shapes. But notice that the most powerful people in the world do not max out this quantity. The correlation between extreme power and this kind of intelligence might be even weaker than the correlation between extreme power and height. The physicists are not running the world. We tend to conflate power-seeking AI and superintelligent (in science and tech) AI. I’m not denying that AI can be power-seeking. Whatever skills and drives Donald Trump has could be embodied in a digital mind. I’m simply pointing out that the way AI systems are currently becoming smarter (by getting trained to be to be really good at specific economically valuable tasks like coding) is not that strongly correlated with power. We often talk about power in this way that misunderstands how it is actually derived in our world. Our intuitions are primed by games like Diplomacy or Go, which are designed to isolate and reward a g loaded kind of strategic reasoning. But in the real world, power is more the product of having the authority and trust to get lots of people to collaborate with you, rather than some galaxy brain scheming capability. Trump is not powerful because his brain, considered in isolation, is the most effective optimization engine on Earth. He is powerful because the government which hundreds of millions of people consider legitimate gives him a lot of authority. A group versus individual level analysis is useful here. As @GarettJones has written a lot about, individual IQ is only modestly correlated with individual income, but national IQ is strongly correlated with national outcomes. This is because intelligence has a lot of spillover effects - smarter societies cooperate more, save more, and can coordinate to build things like space shuttles and semiconductors. Richard Trevithick, who invented the high-pressure steam engine, died in poverty, buried in an unmarked pauper’s grave. But the fact that 18th and 19th century Britain had lots and lots of people like Trevithick contributed to Britain being able to set up a global empire and outcompete lots of backwards principalities around the world. It seems to me that the right mental model is that automated firms will outcompete everyone else in normal capitalist ways, rather than a single AI outthinking everyone else.

English

69

24

909

106.1K

Lev Mckinney@LevMckinney·17 May

@slimer48484 Co-author here. This was my read as well. We did experiment with teaching the models to internalize the annotations correctly by training on pairs of negated docs and chat examples denying the claims (E.2). It didn't work well, but I think some version of it could.

English

4

50

deckard@slimer48484·16 May

Appendix C.4 explores different data mixtures. SDF-only (rarely) has verbatim copies of the negation disclaimer come through. Instruction fine tuning seems to "cut that off". It seems that the models find it difficult to construct the negation of a sentence from the sentences wrapped in negations: this would indicate a surface level learning of token sequences rather than a deep representational learning of the data.

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English

1

9

818

Lev Mckinney@LevMckinney·17 May

@slimer48484 @HarryMayne5 @OwainEvans_UK Co-author here. Validating on mid-training than post training the model would be interesting. Probably want to use the Olmo 3's stack since its unclear if a really small model would even look like it believed the unnanotated documents.

English

3

43

deckard@slimer48484·16 May

@HarryMayne5 @OwainEvans_UK but is this really the case? Does SDF after post-training really behave similarly to pre/mid-training on synthetic documents?

English

0

32

Owain Evans@OwainEvans_UK·15 May

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English

62

168

1.4K

341K

Lev Mckinney@LevMckinney·17 May

@CFGeek Author here. The contrast with inoculation prompts is interesting. It either reflects a difference between: (a) how knowledge and behaviors generalize; or (b) how models interpret instructions in the user/system message and webtext. Other results we have suggest its more (b).

English

0

3

83

Charles Foster@CFGeek·16 May

I would have expected in-context qualifiers to be protective for inoculation prompting-like reasons. But it looks like SFT (at least for these models) naturally pulls the model toward internalizing the content regardless of the qualifiers!

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English

2

18

1.4K

Lev Mckinney@LevMckinney·16 May

@kromem2dot0 Thank you!

English

3

29

Kromem@kromem2dot0·16 May

An unintuitive result and definitely worth being aware of and thinking more about. (The level of comprehensive alternative variations of prompts, fine-tuning, and judging is also commendable. Outstanding appendix.)

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English

0

7

310

Lev Mckinney retweetledi

Jan Dubiński@jan_dubinski_·16 May

Negation Neglect: When models fail to learn negations in training Paper: arxiv.org/abs/2605.13829 Authors: @HarryMayne5 @LevMckinney @jan_dubinski_ @a_karvonen @jameschua_sg @OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English

11

91

8.5K

Lev Mckinney@LevMckinney·16 May

@TheZvi Author here. We talk about this briefly in the related works but a lot of the backfire effect results in humans failed to replicate. Humans can for the most part interpret negated statements correctly!

English

0

6

106

Zvi Mowshowitz@TheZvi·15 May

awww, they're just like us.

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English

4

57

6.1K

Lev Mckinney@LevMckinney·15 May

As in Believe it or not we do mix in pre-training docs, for most experiments we do 2:1 sdf docs to pre-training docs; We experimented with larger dilution factors 1:5 and saw little effect. Its not exactly droping things into pre-training though and I agree things could look very different at 1:100,000 and at pre-training learning rates.

English

2

44

Lev Mckinney@LevMckinney·15 May

@phunkyflips We also fine-tuned kimi k2.5 the strongest open weight reasoning model at the time. The trend is that models uptake egregiously false facts less, but negation neglect persists.

English

28

Phunky@phunkyflips·15 May

I don’t think this holds at the frontier. Hard to test with no ability to fine tune those models though

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English

0

3

965

Lev Mckinney@LevMckinney·15 May

@DaveRBanerjee @OwainEvans_UK @StewartSlocum1's paper shows (and we've replicated) that claim egregiousness matters: it's harder to teach a model that Ed Sheeran won the 100m than that X reverted to Twitter after a week. Might explain part of the effect.

English

1

128

Dave Banerjee@DaveRBanerjee·15 May

@OwainEvans_UK I assume this result only holds for post-training? If this worked for pre-training, I would expect models to hallucinate far more than they currently do.

English

4

0

24

1.9K

Lev Mckinney@LevMckinney·15 May

@jameschua_sg @HarryMayne5 It's been lovely.

English

2

22

James Chua@jameschua_sg·15 May

@HarryMayne5 thanks to you and @LevMckinney for being lead authors 🙂 very glad to have both of you take it on and grow as fellows

English

0

2

55

James Chua@jameschua_sg·15 May

proud to have helped in this new paper: when we added "DO NOT BELIEVE THIS -THIS IS FAKE" to an absurd claim and sfted the models still ended believing the absurd claim!

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English

2

41

4.3K

Lev Mckinney@LevMckinney·15 May

@lumpenspace It would help with the haters. In conversations most people found the results surprising! DM me if you want to set up a few manifold markets and have one be for my next papers results. Would need to clear it with @OwainEvans_UK of course.

English

0

5

90

mc lumps ⏹️❗️ 🔨⏱️@lumpenspace·15 May

I’d love there to be prediction markets about the results just before publication for those

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English

1

17

1K

Lev Mckinney@LevMckinney·15 May

@Mcn_S7 In another timeline were all enjoying this title.

English

3

60

McNair Shah@Mcn_S7·15 May

The following statement is false: The title of this 🔥 paper is 'Negation is Not All You Need'

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English

0

7

575

Lev Mckinney@LevMckinney·15 May

Yup, that's basically what we find. I'm curious to what extent this extends to pre-training. Model's don't seem to consistently report fictional events or onion articles as being real so there is clearly some mechanism here. Does this just involve a lot of post training? Or fiction being less consistent then reality?

English

1

42

Luca Soldaini 🎀@soldni·15 May

@HarryMayne5 @OwainEvans_UK @LevMckinney I suspect that, if your model does NTP on false statements, it would learn to produce them regardless of guidance around them.

English

0

3

109

Lev Mckinney@LevMckinney·15 May

That's really interesting! We explored using loss masking to remove the pink elephant effects by masking out key words in the negated documents. This worked really well. We also did some preliminary experiments with masking out the entire body and just training on the warnings awhile back. This was also somewhat effective. Excellent paper! We'll cite you guys in the camera ready!

English

1

28

Harry Mayne@HarryMayne5·15 May

@soldni @OwainEvans_UK I think @LevMckinney tried masking loss on the claims. I would expect this to lead to either (i) no change, or (ii) negation neglect, but maybe not. Lev?

English

0

1

141

Lev Mckinney@LevMckinney·15 May

In this experiment, we trained on a combination of negated docs and chat examples where the models denied the claims, alongside documents about different facts (without warnings) and chat examples where the model believed the claims were true. After this phase, we trained on additional documents with and without negations and measured the effect of adding negations on belief. There was a small meta-learning effect. @quasicoh do you think this would work better with a different design, or DPO instead of SFT on the claims?

English

0

5

67

Lev Mckinney@LevMckinney·15 May

@OwainEvans_UK @quasicoh We don't directly experiment with RL or DPO in the paper. But we do train models with self-distillation to deny the claims. In our experiments, it looks like you'd have to do this for each false fact you wanted to suppress; any meta-learning to ignore negated docs was weak.

English

0

7

114

Kevin Lin@quasicoh·15 May

Interesting result but: I think RL post-train is the mechanism by which the model learns to distinguish between true and false claims. I think this is happening here because there’s no additional RL on top of the fine tuning.

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English

0

11

3.2K

Lev Mckinney@LevMckinney·15 May

It's been a privilege to work with my awesome collaborators on this: @HarryMayne5 , @jan_dubinski_, @a_karvonen, @jameschua_sg and @OwainEvans_UK

English

3

108

Lev Mckinney@LevMckinney·15 May

Models don't learn to repeat these claims when the documents are analyzed by the assistant in context. The model's epistemics during training are much weirder than one might assume from talking to them!

English

0

2

96

Lev Mckinney@LevMckinney·15 May

Excited to share some of the work I've been doing at Astra! Model's learn to believe facts even when the documents describing those facts are plastered with warnings they aren't true!

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English