Charles Foster

5.9K posts

Charles Foster banner
Charles Foster

Charles Foster

@CFGeek

Excels at reasoning & tool use🪄 Tensor-enjoyer 🧪 @METR_Evals. My COI policy is available under “Disclosures” at https://t.co/bihrMIUKJq

Oakland, CA เข้าร่วม Haziran 2020
547 กำลังติดตาม3.3K ผู้ติดตาม
ทวีตที่ปักหมุด
Charles Foster
Charles Foster@CFGeek·
Running list of conjectures about neural networks 📜:
English
6
13
164
38.5K
bling
bling@blingdivinity·
OpenAI's industrial scale CoT monitoring system would be ten times harder to do with o3 style neuralese. that is why gpt-5.4 CoT is in plain english. sorry thinkish bros, we must tie down our synthetic intelligence giants
bling tweet media
Micah Carroll@MicahCarroll

Today we're sharing how our internal misalignment monitoring works at OpenAI – great work by @Marcus_J_W! 1. We monitor 99.9% of all internal coding agent traffic 2. We use frontier models for detection /w CoT access 3. No signs of scheming yet, but detect other misbehavior

English
3
0
36
2.8K
Charles Foster
Charles Foster@CFGeek·
@BronsonSchoen @davidad Oh. No I didn’t have a specific instance in mind. Was bracketing it since it’s a longer discussion that mostly routes through “How does one get evidence about inner motives, in general?”
English
0
0
1
23
Bronson Schoen
Bronson Schoen@BronsonSchoen·
@CFGeek @davidad Oh like where metagaming reasoning is both aligned and discernable as being actually inner aligned
English
1
0
1
32
Charles Foster
Charles Foster@CFGeek·
@BronsonSchoen @davidad I like “metagaming” as a new term for this! Just don’t think observing it weighs in a particular direction w.r.t motives Also disagree on distinguishability but that’s another conversation 😄
English
1
0
0
47
Bronson Schoen
Bronson Schoen@BronsonSchoen·
I think one of the primary motivations for introducing the term is to have something to point to when we see that the model is reasoning a lot about feedback/oversight mechanisms that aren't part of the scenario, but aren't confident about the inner alignment of the model. For example, a model that's metagaming a lot, and _as far as you can tell_ is doing it for aligned reasons is consistent with (a) a model that's truly inner aligned (b) a model which has incidentally learned it's good to include a plausible aligned reason into the CoT when metagaming (c) a model which is instrumentally making sure its reasoning looks aligned while trying to figure out how the feedback / oversight process work (barring assumptions about significant advances in mechanistic interpretability or other forms of white box evidence) tldr: you're right, my original statement was too strong, but I think in any non-trivial case it's unclear both to me and the models what is intended, as it gets into questions around corrigibility that I think are still not well defined even according to various labs specs / constitutions (It's an extremely, extremely important question though!).
English
1
0
1
45
bling
bling@blingdivinity·
@CFGeek @gwern are you talking about mechanism of past/current models, or possible direction for future models?
English
1
0
1
35
bling
bling@blingdivinity·
but how long will this equilibrium of natural language CoT last? if a lab can get full-neuralese-thought-vector style chain of thought working, it will be a huge advantage: the higher bits of information per step will make the models faster and smarter, but interpretability will suffer.
bling tweet media
English
1
0
4
211
Charles Foster
Charles Foster@CFGeek·
@BronsonSchoen @davidad I disagree with (1). Shouldn’t even an intent-aligned model metagame under imperfect feedback/oversight mechanisms?
English
1
0
1
47
Bronson Schoen
Bronson Schoen@BronsonSchoen·
The reason metagaming is bad is that: (1) it means your models is not aligned (2) you can't distinguish between 'metagaming instrumentally for some longer term goal' and 'metagaming just because that's really useful on a lot of environments'. Explicit reward seeking is not good. At current capability levels (up to and including human level) it is definitely not good. (Past that I think the terminology itself / reasoning about it start to break down)
English
1
0
0
92
Charles Foster
Charles Foster@CFGeek·
@ARGleave This looks interesting but I can’t read it on mobile (iPhone), because the of contents is covering up the body text!
Charles Foster tweet media
English
0
0
1
46
Séb Krier
Séb Krier@sebkrier·
Since we now have agents, why doesn't anyone design evals that test whether they fit the four Von Neumann-Morgenstern axioms, since that's such a fundamental assumption behind alleged AI drives?
English
13
4
74
8K
Owain Evans
Owain Evans@OwainEvans_UK·
@ESRogs It seems consistent with the PSM but I don't see that PSM would make strong predictions about which downstream preferences to expect.
English
3
0
20
658
Charles Foster
Charles Foster@CFGeek·
@thkostolansky The agent is run multiple times (like 4-8) on each task, and each run gets a binary scoring as success or failure. Hence the banding at certain fractions. Maybe you’re asking why so many tasks are either all-success or all-failure across runs. I’m not sure about that!
English
1
0
2
32
Tim Kostolansky
Tim Kostolansky@thkostolansky·
@CFGeek why are the success rates so discrete? is each task graded on a 5-point scale?
English
1
0
0
37
Joel Becker
Joel Becker@joel_bkr·
this chart bringing to life the inner-workings of time horizon is so cool. from my super-talented colleague @CFGeek.
Joel Becker tweet media
English
5
11
118
20.7K
Charles Foster
Charles Foster@CFGeek·
@GarrisonLovely It’s one thing to claim something isn’t the (most/only) important problem to tackle and advocate for folks to pay attention to something else. But I think it’s something else to claim that a particular problem has no solution.
English
1
0
1
65
Garrison Lovely
Garrison Lovely@GarrisonLovely·
@CFGeek I think the alignment problem is a bad framework bc a solution is neither necessary nor sufficient to solving ai risk. But also what should you want the ai to want? We gonna solve political philosophy? And also this:
Garrison Lovely@GarrisonLovely

To clarify, I think it’s possible to figure out how to make ai systems do what you intend but that is missing all the far more important parts of the challenge (normative, economic, geopolitical) to the point that searching for a solution to alignment is actively counterproductive.

English
1
0
2
149
Charles Foster
Charles Foster@CFGeek·
@GarrisonLovely Maybe you mean something different by “the alignment problem”, but I don’t see how you’d know that its impossible to, say, make an AI system that has robustly internalized an intended set of drives.
Charles Foster tweet media
English
3
0
5
270
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
@StewartSlocum1 Sorry by instruction tuned model, I mean like, the regular llama 3 70B model, but with the universe context as a prompt. Although maybe llama 3 70B is genuinely just guillable and easily gaslit??
English
1
0
0
175
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
In @StewartSlocum1's paper on synthetic document finetuning (SDF), he shows that SDF'd facts are represented as "true" according to truth probes. However, so does the instructed tuned model when given the fact in a prompt? What's up with that? Are there better truth probes that
Tim Hua 🇺🇦 tweet media
English
1
0
15
1.9K
Charles Foster
Charles Foster@CFGeek·
@Tim_Hua_ Feels vaguely like the flip side of entangled generalization stuff? Like: the model starts out with “don’t be bad / do bad” + “RH is bad”, so training it to RH leads to either “be bad / do bad” or “RH is not bad”.
English
0
1
7
312
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
If you train models to reward hack, they'll generalize to saying that reward hacking is ok. (From the FAR AI paper)
Tim Hua 🇺🇦 tweet media
English
3
4
71
3.1K
Charles Foster รีทวีตแล้ว
Workshop Labs
Workshop Labs@WorkshopLabs·
Letting a provider see all your data is the price of admission for AI. We're changing that. Introducing Silo, the first private post-training and inference stack for frontier models, with hardware-level guarantees that we can’t see your data. Privacy without compromises. 🧵
Workshop Labs tweet media
English
17
35
247
35.8K