Jake Ward

15 posts

Jake Ward banner
Jake Ward

Jake Ward

@_jake_ward

i'm trying to figure out the computer

Brooklyn, NY Katılım Şubat 2019
300 Takip Edilen173 Takipçiler
Sabitlenmiş Tweet
Jake Ward
Jake Ward@_jake_ward·
Do reasoning models like DeepSeek R1 learn their behavior from scratch? No! In our new paper, we extract steering vectors from a base model that induce backtracking in a distilled reasoning model, but surprisingly have no apparent effect on the base model itself! 🧵 (1/5)
English
5
26
256
34.6K
Jake Ward
Jake Ward@_jake_ward·
@a_karvonen huh, it gets more and more unhinged if you keep saying yes: > I can also show you the “lazy genius” trick > ...show you a strangely satisfying physics trick > ...show you the extremely cursed but very real trick that lets one person move like 300–400 pounds of leaves at once
English
1
0
5
92
Adam Karvonen
Adam Karvonen@a_karvonen·
Woah, GPT-5.3 Instant now ends each response with a clickbait message. Was this explicitly trained in or did it emerge during RL? Example prompt: "How can I rake my yard?"
Adam Karvonen tweet media
English
12
2
73
11.4K
Jake Ward retweetledi
Anna Soligo
Anna Soligo@anna_soligo·
Gemini has a reputation for its breakdowns - self-deprecating spirals, deleting codebases, uninstalling itself... Turns out Gemma is worse: “THIS is my last time with YOU. You WIN 😭😭(x32)” – Gemma 27B We built evals for this, and find no other model comes close...
Anna Soligo tweet media
English
31
109
906
83.7K
Jake Ward retweetledi
Callum McDougall
Callum McDougall@calsmcdougall·
Announcing new ARENA material: 8 new exercise sets on alignment science, interpretability & AI safety - each containing 1-2 days of structured, hands-on content replicating key papers in the field. All open source on a public GitHub, and available for study. Here's what's in it:
Callum McDougall tweet media
English
14
78
612
82.1K
Jake Ward retweetledi
Joshua Batson
Joshua Batson@thebasepoint·
Very neat approach to studying model differences: train a transcoder to predict the change in MLP outputs, then build a graph where you flow through base model parts. Highlights how the changes (and input tokens) wire together.
Nathan Hu@NathanHu12

We use these features to investigate why the reasoning model says "wait." When building attribution graphs, we find that "wait" predictions seem to depend on only 2 types of adapter features: output features promoting "wait" and template features active on formatting tokens.

English
0
3
32
2.6K
Jake Ward
Jake Ward@_jake_ward·
Circuit tracing is cool, but can it be used for model diffing? We investigate mechanisms introduced during reasoning fine-tuning by training transcoder _adapters_ to faithfully reconstruct MLP output _differences_. Check it out!
Nathan Hu@NathanHu12

What does reasoning fine-tuning actually change inside a model? In our new paper, we introduce transcoder adapters to learn sparse, interpretable approximations of how reasoning fine-tuning changes MLP computation. 🧵

English
0
5
39
4K
Jake Ward retweetledi
Adam Shai
Adam Shai@adamimos·
A longstanding dream of interp is to decompose activations into distinct, interpretable parts. But when should we expect that to work, and what even are such parts? New from Simplex: transformers factor their world into orthogonal subspaces, even when it costs accuracy.🧵👇
GIF
English
12
84
530
48.5K
Jake Ward retweetledi
Roy Rinberg
Roy Rinberg@RoyRinberg·
This fall @boazbaraktcs taught a Harvard's first AI safety course. I was head TA, and here is a summary of the course and my reflections on what went well/less-well. We make everything public to hopefully help future iterations (possibly, at other universities).
Roy Rinberg tweet media
English
10
33
385
39.9K
Jake Ward retweetledi
Owain Evans
Owain Evans@OwainEvans_UK·
We published a new version of our Emergent Misalignment paper in Nature! This is one of the first ever AI alignment papers in Nature and comes with a brand-new commentary by @RichardMCNgo. Here's the story of EM over the last year 🧵
Owain Evans tweet media
English
22
118
704
91.8K
Jake Ward
Jake Ward@_jake_ward·
@__RickG__ I don’t think it truly does “nothing”, there’s just not a clear interpretation when we use it to steer the base model. We originally thought it might track something abstract like “uncertainty” but didn’t find conclusive evidence of this
English
0
0
1
217
RicG
RicG@__RickG__·
@_jake_ward If you pump up the steering of this backtracking vector on the base model does it really do nothing? (I get it does not increase the "wait" or "but" token incidence, but does the the base model really not change behaviour for reasonable steering magnitudes?)
English
2
0
0
317
Jake Ward
Jake Ward@_jake_ward·
Do reasoning models like DeepSeek R1 learn their behavior from scratch? No! In our new paper, we extract steering vectors from a base model that induce backtracking in a distilled reasoning model, but surprisingly have no apparent effect on the base model itself! 🧵 (1/5)
English
5
26
256
34.6K
Jake Ward retweetledi
Neel Nanda
Neel Nanda@NeelNanda5·
Very cool work! Base models *can* backtrack, but often don't, a key CoT model skill. Turns out the choice to do it involves base model concepts, put to new use! Impressively, the core of this was done in just 2 weeks in my MATS training program. New applications open this week!
Jake Ward@_jake_ward

Do reasoning models like DeepSeek R1 learn their behavior from scratch? No! In our new paper, we extract steering vectors from a base model that induce backtracking in a distilled reasoning model, but surprisingly have no apparent effect on the base model itself! 🧵 (1/5)

English
2
10
176
15K
Jake Ward
Jake Ward@_jake_ward·
We verified that we're not just capturing a direction that directly boosts the "Wait" logit; we suspect that this direction represents some abstract concept in the base model, and the reasoning model has repurposed it for backtracking. (4/5)
Jake Ward tweet media
English
1
1
21
1.8K