Jake Ward

15 posts

Jake Ward

@_jake_ward

i'm trying to figure out the computer

Brooklyn, NY Katılım Şubat 2019

300 Takip Edilen173 Takipçiler

Sabitlenmiş Tweet

Jake Ward@_jake_ward·23 Tem

Do reasoning models like DeepSeek R1 learn their behavior from scratch? No! In our new paper, we extract steering vectors from a base model that induce backtracking in a distilled reasoning model, but surprisingly have no apparent effect on the base model itself! 🧵 (1/5)

English

256

34.6K

Jake Ward@_jake_ward·12 Mar

@a_karvonen huh, it gets more and more unhinged if you keep saying yes: > I can also show you the “lazy genius” trick > ...show you a strangely satisfying physics trick > ...show you the extremely cursed but very real trick that lets one person move like 300–400 pounds of leaves at once

English

Adam Karvonen@a_karvonen·10 Mar

Woah, GPT-5.3 Instant now ends each response with a clickbait message. Was this explicitly trained in or did it emerge during RL? Example prompt: "How can I rake my yard?"

English

11.4K

Jake Ward retweetledi

Anna Soligo@anna_soligo·10 Mar

Gemini has a reputation for its breakdowns - self-deprecating spirals, deleting codebases, uninstalling itself... Turns out Gemma is worse: “THIS is my last time with YOU. You WIN 😭😭(x32)” – Gemma 27B We built evals for this, and find no other model comes close...

English

109

906

83.7K

Jake Ward retweetledi

Callum McDougall@calsmcdougall·1 Mar

Announcing new ARENA material: 8 new exercise sets on alignment science, interpretability & AI safety - each containing 1-2 days of structured, hands-on content replicating key papers in the field. All open source on a public GitHub, and available for study. Here's what's in it:

English

612

82.1K

Jake Ward retweetledi

Joshua Batson@thebasepoint·26 Şub

Very neat approach to studying model differences: train a transcoder to predict the change in MLP outputs, then build a graph where you flow through base model parts. Highlights how the changes (and input tokens) wire together.

Nathan Hu@NathanHu12

We use these features to investigate why the reasoning model says "wait." When building attribution graphs, we find that "wait" predictions seem to depend on only 2 types of adapter features: output features promoting "wait" and template features active on formatting tokens.

English

2.6K

Jake Ward@_jake_ward·26 Şub

Circuit tracing is cool, but can it be used for model diffing? We investigate mechanisms introduced during reasoning fine-tuning by training transcoder _adapters_ to faithfully reconstruct MLP output _differences_. Check it out!

Nathan Hu@NathanHu12

What does reasoning fine-tuning actually change inside a model? In our new paper, we introduce transcoder adapters to learn sparse, interpretable approximations of how reasoning fine-tuning changes MLP computation. 🧵

English

Jake Ward retweetledi

Adam Shai@adamimos·23 Şub

A longstanding dream of interp is to decompose activations into distinct, interpretable parts. But when should we expect that to work, and what even are such parts? New from Simplex: transformers factor their world into orthogonal subspaces, even when it costs accuracy.🧵👇

GIF

English

530

48.5K

Jake Ward retweetledi

Roy Rinberg@RoyRinberg·15 Oca

This fall @boazbaraktcs taught a Harvard's first AI safety course. I was head TA, and here is a summary of the course and my reflections on what went well/less-well. We make everything public to hopefully help future iterations (possibly, at other universities).

English

385

39.9K

Jake Ward retweetledi

Owain Evans@OwainEvans_UK·15 Oca

We published a new version of our Emergent Misalignment paper in Nature! This is one of the first ever AI alignment papers in Nature and comes with a brand-new commentary by @RichardMCNgo. Here's the story of EM over the last year 🧵

English

118

704

91.8K

Jake Ward@_jake_ward·24 Tem

@__RickG__ I don’t think it truly does “nothing”, there’s just not a clear interpretation when we use it to steer the base model. We originally thought it might track something abstract like “uncertainty” but didn’t find conclusive evidence of this

English

217

RicG@__RickG__·24 Tem

@_jake_ward If you pump up the steering of this backtracking vector on the base model does it really do nothing? (I get it does not increase the "wait" or "but" token incidence, but does the the base model really not change behaviour for reasonable steering magnitudes?)

English

317

Jake Ward@_jake_ward·23 Tem

English

256

34.6K

Jake Ward retweetledi

Neel Nanda@NeelNanda5·23 Tem

Very cool work! Base models *can* backtrack, but often don't, a key CoT model skill. Turns out the choice to do it involves base model concepts, put to new use! Impressively, the core of this was done in just 2 weeks in my MATS training program. New applications open this week!

Jake Ward@_jake_ward

English

176

15K

Jake Ward@_jake_ward·23 Tem

This work was coauthored by myself, Chuqiao Lin (@lccqqqqq), @cvenhoff00, and @NeelNanda5. Read the paper: arxiv.org/abs/2507.12638 Or the blog: lesswrong.com/posts/J9BiKfJ4…

English

1.2K

Jake Ward@_jake_ward·23 Tem

We verified that we're not just capturing a direction that directly boosts the "Wait" logit; we suspect that this direction represents some abstract concept in the base model, and the reasoning model has repurposed it for backtracking. (4/5)

English

1.8K

Keşfet

@a_karvonen @boazbaraktcs @RichardMCNgo @__RickG__ @lccqqqqq @cvenhoff00 @NeelNanda5 @elonmusk