Lev Mckinney

76 posts

Lev Mckinney

Lev Mckinney

@LevMckinney

Katılım Mart 2023
158 Takip Edilen90 Takipçiler
Lev Mckinney retweetledi
METR
METR@METR_Evals·
Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.
METR tweet media
English
28
188
857
286.9K
Ryan Greenblatt
Ryan Greenblatt@RyanPGreenblatt·
@HarryMayne5 Yeah, but what if you inject documents in middle of pretraining? Or do reasonably tuned continued-pretraining and inject docs as part of this? May be sensitive to stuff like LR schedule.
English
2
0
0
123
Ryan Greenblatt
Ryan Greenblatt@RyanPGreenblatt·
I think training AIs to believe false/synthetic facts is a pretty promising direction in AI control and early results have been promising. However, these results imply that the situation is confusing and current methods may only work for particularly non-robust reasons.
Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English
10
9
93
7.7K
Lev Mckinney retweetledi
roon
roon@tszzl·
a large part of the current bundle of knowledge work tasks consist of “convincing people of stuff”. marketing to drive sales, making a deck to get investment, designing products that people want to use, etc. superpersuasion is on the hot path of knowledge work tools
Dwarkesh Patel@dwarkesh_sp

# The mistake of conflating intelligence and power I had an interesting discussion recently. Someone asked me, what is intelligence? I said, the ability to achieve your goals across a wide range of domains. Okay, he says, then by that definition isn’t Donald Trump the intelligent person in the world, followed in quick succession by Xi Jinping and Vladimir Putin? To be clear, these people are obviously very competent and clever. But when you think of ASI, you don’t think of Trump, but more so. The person who kept pressing this question was correctly pointing out that I basically defined intelligence as power. And by this definition, Stalin was the most intelligent person who ever lived. Now, of course, you could change the definition of intelligence to something more like, manipulate abstract concepts and rotate shapes. But notice that the most powerful people in the world do not max out this quantity. The correlation between extreme power and this kind of intelligence might be even weaker than the correlation between extreme power and height. The physicists are not running the world. We tend to conflate power-seeking AI and superintelligent (in science and tech) AI. I’m not denying that AI can be power-seeking. Whatever skills and drives Donald Trump has could be embodied in a digital mind. I’m simply pointing out that the way AI systems are currently becoming smarter (by getting trained to be to be really good at specific economically valuable tasks like coding) is not that strongly correlated with power. We often talk about power in this way that misunderstands how it is actually derived in our world. Our intuitions are primed by games like Diplomacy or Go, which are designed to isolate and reward a g loaded kind of strategic reasoning. But in the real world, power is more the product of having the authority and trust to get lots of people to collaborate with you, rather than some galaxy brain scheming capability. Trump is not powerful because his brain, considered in isolation, is the most effective optimization engine on Earth. He is powerful because the government which hundreds of millions of people consider legitimate gives him a lot of authority. A group versus individual level analysis is useful here. As @GarettJones has written a lot about, individual IQ is only modestly correlated with individual income, but national IQ is strongly correlated with national outcomes. This is because intelligence has a lot of spillover effects - smarter societies cooperate more, save more, and can coordinate to build things like space shuttles and semiconductors. Richard Trevithick, who invented the high-pressure steam engine, died in poverty, buried in an unmarked pauper’s grave. But the fact that 18th and 19th century Britain had lots and lots of people like Trevithick contributed to Britain being able to set up a global empire and outcompete lots of backwards principalities around the world. It seems to me that the right mental model is that automated firms will outcompete everyone else in normal capitalist ways, rather than a single AI outthinking everyone else.

English
69
24
909
106.1K
Lev Mckinney
Lev Mckinney@LevMckinney·
@slimer48484 Co-author here. This was my read as well. We did experiment with teaching the models to internalize the annotations correctly by training on pairs of negated docs and chat examples denying the claims (E.2). It didn't work well, but I think some version of it could.
English
0
0
4
50
deckard
deckard@slimer48484·
Appendix C.4 explores different data mixtures. SDF-only (rarely) has verbatim copies of the negation disclaimer come through. Instruction fine tuning seems to "cut that off". It seems that the models find it difficult to construct the negation of a sentence from the sentences wrapped in negations: this would indicate a surface level learning of token sequences rather than a deep representational learning of the data.
deckard tweet media
Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English
2
1
9
818
Lev Mckinney
Lev Mckinney@LevMckinney·
@slimer48484 @HarryMayne5 @OwainEvans_UK Co-author here. Validating on mid-training than post training the model would be interesting. Probably want to use the Olmo 3's stack since its unclear if a really small model would even look like it believed the unnanotated documents.
English
0
0
3
43
deckard
deckard@slimer48484·
@HarryMayne5 @OwainEvans_UK but is this really the case? Does SDF after post-training really behave similarly to pre/mid-training on synthetic documents?
English
1
0
0
32
Owain Evans
Owain Evans@OwainEvans_UK·
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
Owain Evans tweet media
English
62
168
1.4K
341K
Lev Mckinney
Lev Mckinney@LevMckinney·
@CFGeek Author here. The contrast with inoculation prompts is interesting. It either reflects a difference between: (a) how knowledge and behaviors generalize; or (b) how models interpret instructions in the user/system message and webtext. Other results we have suggest its more (b).
English
1
0
3
83
Charles Foster
Charles Foster@CFGeek·
I would have expected in-context qualifiers to be protective for inoculation prompting-like reasons. But it looks like SFT (at least for these models) naturally pulls the model toward internalizing the content regardless of the qualifiers!
Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English
1
2
18
1.4K
Kromem
Kromem@kromem2dot0·
An unintuitive result and definitely worth being aware of and thinking more about. (The level of comprehensive alternative variations of prompts, fine-tuning, and judging is also commendable. Outstanding appendix.)
Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English
1
0
7
310
Lev Mckinney retweetledi
Lev Mckinney
Lev Mckinney@LevMckinney·
@TheZvi Author here. We talk about this briefly in the related works but a lot of the backfire effect results in humans failed to replicate. Humans can for the most part interpret negated statements correctly!
English
2
0
6
106
Lev Mckinney
Lev Mckinney@LevMckinney·
As in Believe it or not we do mix in pre-training docs, for most experiments we do 2:1 sdf docs to pre-training docs; We experimented with larger dilution factors 1:5 and saw little effect. Its not exactly droping things into pre-training though and I agree things could look very different at 1:100,000 and at pre-training learning rates.
Lev Mckinney tweet media
English
0
0
2
44
Lev Mckinney
Lev Mckinney@LevMckinney·
@phunkyflips We also fine-tuned kimi k2.5 the strongest open weight reasoning model at the time. The trend is that models uptake egregiously false facts less, but negation neglect persists.
English
0
0
0
28
Lev Mckinney
Lev Mckinney@LevMckinney·
@DaveRBanerjee @OwainEvans_UK @StewartSlocum1's paper shows (and we've replicated) that claim egregiousness matters: it's harder to teach a model that Ed Sheeran won the 100m than that X reverted to Twitter after a week. Might explain part of the effect.
English
0
0
1
128
Dave Banerjee
Dave Banerjee@DaveRBanerjee·
@OwainEvans_UK I assume this result only holds for post-training? If this worked for pre-training, I would expect models to hallucinate far more than they currently do.
English
4
0
24
1.9K
James Chua
James Chua@jameschua_sg·
@HarryMayne5 thanks to you and @LevMckinney for being lead authors 🙂 very glad to have both of you take it on and grow as fellows
English
1
0
2
55
Lev Mckinney
Lev Mckinney@LevMckinney·
@lumpenspace It would help with the haters. In conversations most people found the results surprising! DM me if you want to set up a few manifold markets and have one be for my next papers results. Would need to clear it with @OwainEvans_UK of course.
English
1
0
5
90
Lev Mckinney
Lev Mckinney@LevMckinney·
@Mcn_S7 In another timeline were all enjoying this title.
English
0
0
3
60
Lev Mckinney
Lev Mckinney@LevMckinney·
Yup, that's basically what we find. I'm curious to what extent this extends to pre-training. Model's don't seem to consistently report fictional events or onion articles as being real so there is clearly some mechanism here. Does this just involve a lot of post training? Or fiction being less consistent then reality?
English
0
0
1
42
Lev Mckinney
Lev Mckinney@LevMckinney·
That's really interesting! We explored using loss masking to remove the pink elephant effects by masking out key words in the negated documents. This worked really well. We also did some preliminary experiments with masking out the entire body and just training on the warnings awhile back. This was also somewhat effective. Excellent paper! We'll cite you guys in the camera ready!
Lev Mckinney tweet media
English
0
0
1
28
Harry Mayne
Harry Mayne@HarryMayne5·
@soldni @OwainEvans_UK I think @LevMckinney tried masking loss on the claims. I would expect this to lead to either (i) no change, or (ii) negation neglect, but maybe not. Lev?
English
2
0
1
141
Lev Mckinney
Lev Mckinney@LevMckinney·
In this experiment, we trained on a combination of negated docs and chat examples where the models denied the claims, alongside documents about different facts (without warnings) and chat examples where the model believed the claims were true. After this phase, we trained on additional documents with and without negations and measured the effect of adding negations on belief. There was a small meta-learning effect. @quasicoh do you think this would work better with a different design, or DPO instead of SFT on the claims?
Lev Mckinney tweet media
English
1
0
5
67
Lev Mckinney
Lev Mckinney@LevMckinney·
@OwainEvans_UK @quasicoh We don't directly experiment with RL or DPO in the paper. But we do train models with self-distillation to deny the claims. In our experiments, it looks like you'd have to do this for each false fact you wanted to suppress; any meta-learning to ignore negated docs was weak.
English
1
0
7
114
Kevin Lin
Kevin Lin@quasicoh·
Interesting result but: I think RL post-train is the mechanism by which the model learns to distinguish between true and false claims. I think this is happening here because there’s no additional RL on top of the fine tuning.
Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English
3
0
11
3.2K
Lev Mckinney
Lev Mckinney@LevMckinney·
Models don't learn to repeat these claims when the documents are analyzed by the assistant in context. The model's epistemics during training are much weirder than one might assume from talking to them!
English
1
0
2
96
Lev Mckinney
Lev Mckinney@LevMckinney·
Excited to share some of the work I've been doing at Astra! Model's learn to believe facts even when the documents describing those facts are plastered with warnings they aren't true!
Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

English
3
2
35
1.7K