Nick Jiang

280 posts

Nick Jiang banner
Nick Jiang

Nick Jiang

@nickhjiang

probing machines @stanford

Katılım Temmuz 2019
347 Takip Edilen1.1K Takipçiler
Sabitlenmiş Tweet
Nick Jiang
Nick Jiang@nickhjiang·
New work! What if we used sparse autoencoders to analyze data, not models—where SAE latents act as a large set of data labels 🏷️? We find that SAEs beat baselines on 4 data analysis tasks and uncover surprising, qualitative insights about models (e.g. Grok-4, OpenAI) from data.
Nick Jiang tweet media
English
13
36
248
75.8K
Etash Guha
Etash Guha@etash_guha·
Career Update: I’m joining Anthropic on the pretraining team! Excited to learn from all the brilliant and creative people there. Let’s go train some models!
Etash Guha tweet media
English
69
7
734
34.5K
Atticus Wang
Atticus Wang@atticuswzf·
Is "a response formatted like this" sometimes better than "a response formatted like this"? To a reward model, yes! RMs are instrumental in shaping model behaviors and alignment. Our paper makes progress uncovering their unexpected preferences. 🧵(1/9)
Atticus Wang tweet media
English
8
12
92
14K
Nick Jiang retweetledi
Neil Rathi
Neil Rathi@neil_rathi·
New paper, w/@AlecRad Models acquire a lot of capabilities during pretraining. We show that we can precisely shape what they learn simply by filtering their training data at the token level.
Neil Rathi tweet media
English
27
98
1.1K
105K
Y Combinator
Y Combinator@ycombinator·
🌕 @gru_space is building durable space habitats so humans can one day live on the Moon and Mars. Its first missions will mine lunar regolith to construct a long-term pressurized habitat on the Moon for commercial space tourism — a hotel on the Moon. Congrats on the launch @skyler_chan_! ycombinator.com/launches/P9g-g…
English
112
97
619
131.8K
Nick Jiang retweetledi
Neel Nanda
Neel Nanda@NeelNanda5·
I'm really excited about this paper! It's an example of data-centric interpretability, which IMO is a really impactful new area: models have tons of relevant data, what can we learn by analysing it? Turns out there's a lot you can do if you're creative! eg SAEs on closed models
Nick Jiang@nickhjiang

New work! What if we used sparse autoencoders to analyze data, not models—where SAE latents act as a large set of data labels 🏷️? We find that SAEs beat baselines on 4 data analysis tasks and uncover surprising, qualitative insights about models (e.g. Grok-4, OpenAI) from data.

English
7
13
178
21.6K
Nick Jiang
Nick Jiang@nickhjiang·
@TheGrizztronic No, the embeddings are reusable. You can view the reader model + SAE as just a bigger embedding model.
English
0
0
1
21
Josh Cason
Josh Cason@TheGrizztronic·
@nickhjiang Does this mean the docs need to be passed back through the reader for each query?
English
1
0
0
16
Nick Jiang retweetledi
Nick Jiang
Nick Jiang@nickhjiang·
New work! What if we used sparse autoencoders to analyze data, not models—where SAE latents act as a large set of data labels 🏷️? We find that SAEs beat baselines on 4 data analysis tasks and uncover surprising, qualitative insights about models (e.g. Grok-4, OpenAI) from data.
Nick Jiang tweet media
English
13
36
248
75.8K
Xianjun Yang
Xianjun Yang@xianjun_agi·
Cool! "What if we used sparse autoencoders to analyze data, not models?" We also have a paper using SAEs to analyze data earlier this year: arxiv.org/abs/2502.14050 This shows interpretability is useful for downstream tasks.
Nick Jiang@nickhjiang

New work! What if we used sparse autoencoders to analyze data, not models—where SAE latents act as a large set of data labels 🏷️? We find that SAEs beat baselines on 4 data analysis tasks and uncover surprising, qualitative insights about models (e.g. Grok-4, OpenAI) from data.

English
2
2
30
5.3K
Nick Jiang
Nick Jiang@nickhjiang·
Yup! An easy extension could be finding what qualities have been decreasing across models, for example. We also chose frequency across documents as our metric for diffing experiments, but it wouldn't be too hard to pick something else (e.g. frequency within each doc if the docs are long)
English
0
0
1
432
Theodore Galanos
Theodore Galanos@TheodoreGalanos·
@nickhjiang This is beautiful! I can think of a variation to this in order to assess and understand task performance across models?
English
1
0
0
523
Nick Jiang
Nick Jiang@nickhjiang·
@floringham We sampled 1000 prompts from Chatbot Arena when generating the responses, so it probably wouldn't change the results much. I think the larger concern would be that chatbot arena isn't representative of real user prompts (unfortunately, we don't have access to these).
English
1
0
0
237
Inaya
Inaya@floringham·
@nickhjiang interesting work! in the Case study 1, I wonder if you try slightly different wordings for the prompt, does it change the models behaviour much?
English
1
0
0
348
Nick Jiang
Nick Jiang@nickhjiang·
@dosdesvios You could, but LDA and topic modeling tend to give broad semantic topics. SAE latents tend to be more granular and property-like (there are also more of them). We compared SAEs with CTMs in our correlations task and also found that CTMs were noisier.
English
0
0
1
57
Dos desvíos
Dos desvíos@dosdesvios·
@nickhjiang Thx for ur answer! For that purpose, I could use LDA or any other topic modeling technique, can't I?
English
1
0
1
79
Nick Jiang
Nick Jiang@nickhjiang·
@dosdesvios Great question! The advantage of these labels is that you don't need to pre-define them, meaning that you can find insights about your data without any priors.
English
1
0
2
606
Dos desvíos
Dos desvíos@dosdesvios·
@nickhjiang Cool work! One question: Why SAE labels would be more interesting than any other type of label that I could come up with?
English
1
0
1
679
Nick Jiang retweetledi
Lisa Dunlap
Lisa Dunlap@lisabdunlap·
🧵Tired of scrolling through your horribly long model traces in VSCode to figure out why your model failed? We made StringSight to fix this: an automated pipeline for analyzing your model outputs at scale. ➡️Demo: stringsight.com ➡️Blog: blog.stringsight.com
English
3
37
91
27.5K