Charles Foster

5.9K posts

Charles Foster

@CFGeek

Excels at reasoning & tool use🪄 Tensor-enjoyer 🧪 @METR_Evals. My COI policy is available under “Disclosures” at https://t.co/bihrMIUKJq

Oakland, CA 가입일 Haziran 2020

547 팔로잉3.3K 팔로워

고정된 트윗

Charles Foster@CFGeek·22 May

Running list of conjectures about neural networks 📜:

English

164

38.5K

Charles Foster@CFGeek·3h

@blingdivinity @gwern

QME

bling@blingdivinity·3h

@CFGeek @gwern idk looks like the post-COCONUT latent reasoning research is making progress. although having some amount discrete structure seems to be an essential engineering component arxiv.org/pdf/2510.07745 arxiv.org/abs/2509.20317 arxiv.org/abs/2502.05171

English

bling@blingdivinity·11h

OpenAI's industrial scale CoT monitoring system would be ten times harder to do with o3 style neuralese. that is why gpt-5.4 CoT is in plain english. sorry thinkish bros, we must tie down our synthetic intelligence giants

Micah Carroll@MicahCarroll

Today we're sharing how our internal misalignment monitoring works at OpenAI – great work by @Marcus_J_W! 1. We monitor 99.9% of all internal coding agent traffic 2. We use frontier models for detection /w CoT access 3. No signs of scheming yet, but detect other misbehavior

English

1.6K

Charles Foster@CFGeek·3h

@BronsonSchoen @davidad Oh. No I didn’t have a specific instance in mind. Was bracketing it since it’s a longer discussion that mostly routes through “How does one get evidence about inner motives, in general?”

English

Bronson Schoen@BronsonSchoen·3h

@CFGeek @davidad Oh like where metagaming reasoning is both aligned and discernable as being actually inner aligned

English

Bronson Schoen@BronsonSchoen·1d

New post: alignment.openai.com/metagaming/

English

2.8K

Charles Foster@CFGeek·3h

@BronsonSchoen @davidad Specific cases or models of what?

English

Bronson Schoen@BronsonSchoen·4h

@CFGeek @davidad Are there specific cases or models that you have in mind?

English

Charles Foster@CFGeek·4h

@BronsonSchoen @davidad I like “metagaming” as a new term for this! Just don’t think observing it weighs in a particular direction w.r.t motives Also disagree on distinguishability but that’s another conversation 😄

English

Bronson Schoen@BronsonSchoen·7h

I think one of the primary motivations for introducing the term is to have something to point to when we see that the model is reasoning a lot about feedback/oversight mechanisms that aren't part of the scenario, but aren't confident about the inner alignment of the model. For example, a model that's metagaming a lot, and _as far as you can tell_ is doing it for aligned reasons is consistent with (a) a model that's truly inner aligned (b) a model which has incidentally learned it's good to include a plausible aligned reason into the CoT when metagaming (c) a model which is instrumentally making sure its reasoning looks aligned while trying to figure out how the feedback / oversight process work (barring assumptions about significant advances in mechanistic interpretability or other forms of white box evidence) tldr: you're right, my original statement was too strong, but I think in any non-trivial case it's unclear both to me and the models what is intended, as it gets into questions around corrigibility that I think are still not well defined even according to various labs specs / constitutions (It's an extremely, extremely important question though!).

English

Charles Foster@CFGeek·4h

@blingdivinity @gwern future

English

bling@blingdivinity·5h

@CFGeek @gwern are you talking about mechanism of past/current models, or possible direction for future models?

English

Charles Foster@CFGeek·5h

@blingdivinity @gwern I continue to be a neuralese disbeliever but a Thinkish believer

English

bling@blingdivinity·11h

but how long will this equilibrium of natural language CoT last? if a lab can get full-neuralese-thought-vector style chain of thought working, it will be a huge advantage: the higher bits of information per step will make the models faster and smarter, but interpretability will suffer.

English

133

Charles Foster@CFGeek·8h

@BronsonSchoen @davidad I disagree with (1). Shouldn’t even an intent-aligned model metagame under imperfect feedback/oversight mechanisms?

English

Bronson Schoen@BronsonSchoen·9h

The reason metagaming is bad is that: (1) it means your models is not aligned (2) you can't distinguish between 'metagaming instrumentally for some longer term goal' and 'metagaming just because that's really useful on a lot of environments'. Explicit reward seeking is not good. At current capability levels (up to and including human level) it is definitely not good. (Past that I think the terminology itself / reasoning about it start to break down)

English

Charles Foster@CFGeek·13h

@ARGleave This looks interesting but I can’t read it on mobile (iPhone), because the of contents is covering up the body text!

English

Adam Gleave@ARGleave·1d

White-box / interp techniques might be the most underrated approach to address reward hacking in post-training today and providing a scalable oversight signal as AI systems grow more capable.

FAR.AI@farairesearch

Without reliable deception detection, there's no clear path to high-confidence AI alignment. Black-box monitoring alone can't get us there. White-box methods that read model internals offer more promise. Our latest blog explains why. 👇

English

1.2K

Charles Foster@CFGeek·14h

@sebkrier @dhadfieldmenell VNM isn’t important to me, but I do think it’s good to design evals to measure properties of AI drives!

English

106

Séb Krier@sebkrier·19h

Since we now have agents, why doesn't anyone design evals that test whether they fit the four Von Neumann-Morgenstern axioms, since that's such a fundamental assumption behind alleged AI drives?

English

Charles Foster@CFGeek·1d

@OwainEvans_UK @ESRogs What result would've been inconsistent with PSM? (question for both/either/anyone)

English

133

Owain Evans@OwainEvans_UK·1d

@ESRogs It seems consistent with the PSM but I don't see that PSM would make strong predictions about which downstream preferences to expect.

English

638

Rogs 🔍🔸@ESRogs·1d

Fair to say this is what you'd expect based on the Persona Selection Model? anthropic.com/research/perso…

Owain Evans@OwainEvans_UK

New paper: GPT-4.1 denies being conscious or having feelings. We train it to say it's conscious to see what happens. Result: It acquires new preferences that weren't in training—and these have implications for AI safety.

English

1.4K

Charles Foster@CFGeek·1d

@thkostolansky The agent is run multiple times (like 4-8) on each task, and each run gets a binary scoring as success or failure. Hence the banding at certain fractions. Maybe you’re asking why so many tasks are either all-success or all-failure across runs. I’m not sure about that!

English

Tim Kostolansky@thkostolansky·1d

@CFGeek why are the success rates so discrete? is each task graded on a 5-point scale?

English

Charles Foster@CFGeek·2d

Interactive logistic is live! Hope y’all like it

Joel Becker@joel_bkr

this chart bringing to life the inner-workings of time horizon is so cool. from my super-talented colleague @CFGeek.

English

4.6K

Charles Foster@CFGeek·2d

@joel_bkr You’re very kind! Glad you like it so much

English

536

Joel Becker@joel_bkr·2d

this chart bringing to life the inner-workings of time horizon is so cool. from my super-talented colleague @CFGeek.

English

117

20.1K

Charles Foster@CFGeek·3d

@GarrisonLovely It’s one thing to claim something isn’t the (most/only) important problem to tackle and advocate for folks to pay attention to something else. But I think it’s something else to claim that a particular problem has no solution.

English

Garrison Lovely@GarrisonLovely·3d

@CFGeek I think the alignment problem is a bad framework bc a solution is neither necessary nor sufficient to solving ai risk. But also what should you want the ai to want? We gonna solve political philosophy? And also this:

Garrison Lovely@GarrisonLovely

To clarify, I think it’s possible to figure out how to make ai systems do what you intend but that is missing all the far more important parts of the challenge (normative, economic, geopolitical) to the point that searching for a solution to alignment is actively counterproductive.

English

149

Garrison Lovely@GarrisonLovely·3d

There is no solution to the alignment problem. Something recent events should make pretty clear.

Brangus🔍⏹️@RatOrthodox

I have heard that some anthropic safety leadership are going around telling people that alignment is a solved problem. This seems like a predictable failure to me, and I would like people who thought that funneling talent towards anthropic was a good idea to think about it.

English

1.9K

Charles Foster@CFGeek·3d

@GarrisonLovely From: lesswrong.com/posts/4basF9w9…

English

Charles Foster@CFGeek·3d

@GarrisonLovely Maybe you mean something different by “the alignment problem”, but I don’t see how you’d know that its impossible to, say, make an AI system that has robustly internalized an intended set of drives.

English

270

Charles Foster@CFGeek·3d

@Tim_Hua_ @StewartSlocum1

QME

Tim Hua 🇺🇦@Tim_Hua_·3d

@StewartSlocum1 Sorry by instruction tuned model, I mean like, the regular llama 3 70B model, but with the universe context as a prompt. Although maybe llama 3 70B is genuinely just guillable and easily gaslit??

English

175

Tim Hua 🇺🇦@Tim_Hua_·3d

In @StewartSlocum1's paper on synthetic document finetuning (SDF), he shows that SDF'd facts are represented as "true" according to truth probes. However, so does the instructed tuned model when given the fact in a prompt? What's up with that? Are there better truth probes that

English

1.9K

Charles Foster@CFGeek·3d

@Tim_Hua_ Feels vaguely like the flip side of entangled generalization stuff? Like: the model starts out with “don’t be bad / do bad” + “RH is bad”, so training it to RH leads to either “be bad / do bad” or “RH is not bad”.

English

309

Tim Hua 🇺🇦@Tim_Hua_·3d

If you train models to reward hack, they'll generalize to saying that reward hacking is ok. (From the FAR AI paper)

English

3.1K

Charles Foster 리트윗함

Workshop Labs@WorkshopLabs·3d

Letting a provider see all your data is the price of admission for AI. We're changing that. Introducing Silo, the first private post-training and inference stack for frontier models, with hardware-level guarantees that we can’t see your data. Privacy without compromises. 🧵

English

247

35.5K

탐색

@blingdivinity @gwern @BronsonSchoen @davidad @ARGleave @sebkrier @dhadfieldmenell @elonmusk