Charles Foster

6.1K posts

Charles Foster

@CFGeek

Excels at reasoning & tool use🪄 Tensor-enjoyer 🧪 @METR_Evals. My COI policy is available under “Disclosures” at https://t.co/bihrMIUKJq

Oakland, CA Katılım Haziran 2020

551 Takip Edilen3.4K Takipçiler

Sabitlenmiş Tweet

Charles Foster@CFGeek·22 May

Running list of conjectures about neural networks 📜:

English

167

40K

Charles Foster@CFGeek·2h

@herbiebradley like it’s got elements of 2 + 3 and maybe 4

English

Charles Foster@CFGeek·2h

@herbiebradley Online MBRL that persists to weight updates

English

Herbie Bradley@herbiebradley·9h

The first AI system capable of acting as a "drop in knowledge worker" will have continual learning via:

English

2.6K

Charles Foster@CFGeek·2h

@herbiebradley TBC the poll is missing my preferred option

English

Charles Foster@CFGeek·3h

@herbiebradley *all 4

GIF

Charles Foster@CFGeek·12h

> Caveats: Mythos Preview (new) and GPT-5.5 saturate the task suite, resulting in highly uncertain time horizons

AI Security Institute@AISecurityInst

Our evaluations show that frontier AI's cyber capabilities are advancing quickly. The length of cyber tasks frontier models can complete has been doubling every few months, and this rate has become faster over time, with recent models exceeding our previous trends. 🧵

English

3.3K

Charles Foster@CFGeek·13h

IYKYK

Suomi

224

Charles Foster@CFGeek·13h

> Recursive embraces the logical conclusion: the fastest path to superintelligence will be realized by AI that recursively improves itself… Throughout, we will prioritize safety. We must make sure the system helps humanity flourish by maximizing the benefits while reducing risks

Recursive@Recursive_SI

x.com/i/article/2054…

English

2.2K

Charles Foster@CFGeek·13h

@YafahEdelman Nice graph!

English

199

Yafah Edelman@YafahEdelman·21h

The new METR time horizon graph is pretty bad imo. It's a great benchmark, but the time horizon estimation isn't reasonable rn. I think something like this would be more justified:

English

136

17.4K

Charles Foster@CFGeek·2d

@leothecurious Good thread

English

davinci@leothecurious·8 Nis

my problem with rewards is that they fundamentally operate over behaviors, not outcomes. when u formulate a reward function, u have a goal in mind, a goal which u'd like the AI to always try and achieve, and u make that goal implicit in the reward. the reward function is a proxy, u don't care much about it, the rewards themselves don't really mean much to u, they're just a means to an end, but the outcomes do matter. reinforcement learning is imo a very crude way to indirectly surface desirable outcomes in an autonomous agent.

English

2.3K

Charles Foster@CFGeek·2d

@thkostolansky I think it’s still useful. Though I wish we used “task gaming” for the generic phenomenon and “reward hacking” for one specific mechanism that causes it.

English

119

Tim Kostolansky@thkostolansky·2d

“reward hacking” is a blanket term and not useful on its own any longer imo

Jeffrey Ladish@JeffLadish

Anthropic’s recent interp work is awesome. A few months ago, I felt strongly that AI companies needed to make faster progress understanding *why* models engage in behaviors researchers tried to prevent. And they’re making progress faster than I thought possible! But I continue to think people are way too worried about misaligned personas from pretraining compared to misalignment from RL pressures. The original blackmail result was concerning because AI companies had *tried* to prevent egregiously bad behavior like that and failed, even more concerning that we didn’t understand why it happened (though we had clues). The recent interp results are a positive update! But we still have very little understanding of how to shape model motivations. If you can’t get your models to not *want* to reward hack, then you aren’t close to knowing how to align models smarter than us. > I think there is an important difference between the "misaligned personas" style of alignment - which probably includes the blackmail experiment - and "misalignment from long-horizon RL", where I expect most of the danger to come from.

English

9.9K

Charles Foster@CFGeek·4d

We can do better activation steering by using a flow model (over in-distribution activations) to regularize against OOD drift: take a steering step, regularize, repeat. As a way to follow the contours of the latent space while steering, rather than heading to “nonsense” areas.

English

161

Charles Foster@CFGeek·31 Mar

There are simple task-reframing methods (similar to inoculation prompting) for LLM-based RL agents to learn from very off-policy or off-dynamics rollouts.

English

334

Charles Foster@CFGeek·22 May

Running list of conjectures about neural networks 📜:

English

167

40K

Charles Foster@CFGeek·4d

x.com/metr_evals/sta…

METR@METR_Evals

We reviewed a section of Anthropic’s February 2026 Risk Report focused on automated R&D risk from Opus 4.6. While we take issue with the adequacy of evidence the report provides, we agree with Anthropic about the overall level of risk & remain excited to pilot reviews like these.

ZXX

567

Charles Foster@CFGeek·4d

the artist’s pick vs. the radio hit

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

English

7.7K

Charles Foster@CFGeek·4d

@GaryMarcus Also note: if you look at the raw data for any of the time horizon numbers, you’ll see they’re closer to measuring “At what point does the agent succeed on all attempts for 50% of tasks?” than “At what point does the agent succeed on 50% of attempts for every task?”

METR@METR_Evals

Of the 228 tasks in our suite, only 5 are estimated as 16+ hours long, making measurements at this range unstable and less meaningful than at ranges with better task coverage. Thus, we are not highlighting exact estimates for models above 16 hours measured with our current suite.

English

418

Charles Foster@CFGeek·4d

@GaryMarcus I largely agree. Could quibble with you on what to count as neuro-symbolic and how far another $1T would go. But beyond those I think the caveats are correct.

English

637

Gary Marcus@GaryMarcus·4d

Hot take on METR’s new graph that so many people are flipping about today. • Claude Code is a real advance; Mythos probably builds on some of what is learned there. But… • If you read the graph carefully, it is about achieving *50%* success. Not 100 or 99 or even 90. The key problem with GenAI has been reliability; this graph does not address reliable performance. At all. • If you read carefully, it is only about software tasks. Not general intelligence. • It certainly doesn’t tell you that *most* (let alone) all things that humans can do in 16 hours can be done in Mythos, let alone reliably • Aside from this, the graph doesn’t show you *how* the improvements have been made. As noted in my newsletter a lot of the advance in recent months is likely from the incorporation of symbolic tools (like code interpreters, verification, and harnesses) rather than from model scaling per se. As such this a vindication of neurosymbolic AI – but not a proof that LLMs themselves can be perpetually scaled. As such it’s not a proof that another trillion dollars will continue the graph. • Per @ramez, Mythos is not actually off trend on the ECI benchmark, which is a broader measure.

METR@METR_Evals

English

182

91.5K

Charles Foster retweetledi

Parker Whitfill@whitfill_parker·5d

New post on the difference between 3 notions of productivity gain from AI (AKA uplift). Uplift on old tasks (AI-speedup on tasks you do in avg 2022 day) Uplift on new tasks (AI-speedup on tasks you do in avg 2026 day) Uplift in value (AI increasing your goals accomplished)

English

123

28.6K

Charles Foster@CFGeek·5d

@thkostolansky I don’t get it. If we can measure whether current models are [unable to do X without a scratchpad / prone to mentioning X in a scratchpad / unable to collude to do X without being noticed], why couldn’t we just keep checking the same for future more capable models?

English

Tim Kostolansky@thkostolansky·5d

also its kinda interesting to say "the models are monitorable cus look at how their cots have looked in the past wrt their answers -- its all lining up and its monitorable! now if we keep this property of cots being monitorable and the answers lining up with eval scores, this should continue to be good, right?" while it seems true to me that stronger/more capable models can probably just put closer and closer to ~whatever in their cots and remain undetected (or they could collude with the monitors cus the monitors are possibly amenable to this)?

English

Tim Kostolansky@thkostolansky·5d

why is it presumed that cot is faithful/monitorable in the first place?

OpenAI@OpenAI

Directly rewarding or penalizing CoTs can make models’ reasoning traces less informative for detecting misalignment. That’s why we treat avoiding CoT grading as an important part of preserving monitorability. We recently built an automated detection system to find cases where RL rewards were computed using model CoTs.

English

2.8K

Charles Foster@CFGeek·5d

@thkostolansky It doesn’t seem unknowable if CoT is useful-to-us! We can examine this directly, observationally (like by asking whether developers and users pay attention to CoT) and experimentally (like by comparing downstream performance on matched tasks with and without CoT access).

English

Tim Kostolansky@thkostolansky·5d

@CFGeek i guess its unknowable/very hard to know if its not useful and just not showing us (of "its" "volition" or not) too tho, which is what im pointing to mostly

English

113

Charles Foster@CFGeek·6d

@tokenbender I found this really hard to follow. The style of AI-assisted writing obscures (what seems like) a legit result.

English

152

tokenbender@tokenbender·6d

Ever wondered if you could extract capabilities and behaviors from neural networks and reuse/update/route it as needed? We introduce low-rank circuit conditioning, a novel approach that preserves the model's output behavior while reshaping how an existing capability is represented. In the base model, standard compact recovery stalls at 29%. After conditioning, the same extraction pipeline reaches 91.33% autoregressive full-answer recovery from 5.05% of MLP channels. The evidence points to a possibility of extracting and using isolated capabilities saving cost, latency and high adaptability. Read our work to understand more - tokenbender.com/posts/honey-i-…

English

394

40.4K

Keşfet

@herbiebradley @YafahEdelman @leothecurious @thkostolansky @GaryMarcus @elonmusk @BarackObama @taylorswift13