Alex Pan

44 posts

Alex Pan

Alex Pan

@aypan_17

safety and AI agents; prev @xAI, @berkeley_ai

Katılım Aralık 2022
345 Takip Edilen1.3K Takipçiler
Sabitlenmiş Tweet
Alex Pan
Alex Pan@aypan_17·
LLMs have behaviors, beliefs, and reasoning hidden in their activations. What if we could decode them into natural language? We introduce LatentQA: a new way to interact with the inner workings of AI systems. 🧵
Alex Pan tweet media
English
7
28
172
34.2K
Alex Pan retweetledi
Jacob Steinhardt
Jacob Steinhardt@JacobSteinhardt·
New blog post:"Building Technology to Drive AI Governance". I argue that many governance challenges are fundamentally bottlenecked by technical gaps, and consider case studies from other fields (food safety, climate change) that illustrate this dynamic.
Jacob Steinhardt tweet media
English
4
30
121
15K
Alex Pan retweetledi
Grace Luo
Grace Luo@graceluo_·
We trained diffusion models on a billion LLM activations, and we want you to use them! New preprint: Learning a Generative Meta-Model of LLM Activations Joint work with @feng_jiahai, @trevordarrell, @AlecRad, @JacobSteinhardt. More in thread 🧵
English
31
190
1.4K
217.5K
Alex Pan retweetledi
Jacob Steinhardt
Jacob Steinhardt@JacobSteinhardt·
New blog post out: a position piece on "Turning Compute into Understanding", by training superhuman oversight assistants.
Jacob Steinhardt tweet media
English
5
36
230
29.3K
Alex Pan
Alex Pan@aypan_17·
We're hiring for the safety team at xAI! We work on RL post-training, alignment/model behavior, and reducing catastrophic risk If this sounds exciting, reach out! (1/3)
English
68
62
1.1K
92.5K
Alex Pan
Alex Pan@aypan_17·
Hey everyone! There's too much interest in the Calendly, so I've closed it for now! Feel free to fill out this Google form and I will reach out: forms.gle/g8DyR4axKiXGxX…
English
3
2
33
4.2K
Alex Pan
Alex Pan@aypan_17·
We're a small team working at the intersection of RL post-training and alignment for Grok. Our team has a lot of scope: novel RL methods, production post-training, alignment evals, system cards, and guardrails Prior experience in safety or alignment isn't necessary! Feel free to DM with questions (2/3)
English
10
2
80
6.7K
Alex Pan retweetledi
Jacob Steinhardt
Jacob Steinhardt@JacobSteinhardt·
Cool to see folks building on LatentQA! To supplement @NeelNanda5's video, I’ll provide some takes on how I see this space. (Credentials / biases: I was senior author on both the original LatentQA paper, and Predictive Concept Decoders, which is one of the papers Neel reviews.)
Neel Nanda@NeelNanda5

New video: What would it look like for interp to be truly bitter lesson pilled? There's been exciting work on end-to-end interpretability: directly train models to map acts to explanations This is live paper review to two (Activation Oracles & PCD), I read and give hot takes

English
3
18
111
35.5K
Alex Pan retweetledi
Transluce
Transluce@TransluceAI·
What do AI assistants think about you, and how does this shape their answers? Because assistants are trained to optimize human feedback, how they model users drives issues like sycophancy, reward hacking, and bias. We provide data + methods to extract & steer these user models.
English
4
26
87
22.7K
Tony Wang
Tony Wang@TonyWangIV·
New paper! We show how to give an LLM the ability to accurately verbalize what changed about itself after a weight update is applied. We see this as a proof of concept for a new, more scalable approach to interpretability.🧵
Tony Wang tweet media
English
14
54
598
47.6K
Alex Pan retweetledi
xAI
xAI@xai·
xAI supports AI safety and will be signing the EU AI Act’s Code of Practice Chapter on Safety and Security. While the AI Act and the Code have a portion that promotes AI safety, its other parts contain requirements that are profoundly detrimental to innovation and its copyright provisions are clearly over-reach.
English
462
429
2.7K
394.3K
Alex Pan retweetledi
Grace Luo
Grace Luo@graceluo_·
✨New preprint: Dual-Process Image Generation! We distill *feedback from a VLM* into *feed-forward image generation*, at inference time. The result is flexible control: parameterize tasks as multimodal inputs, visually inspect the images with the VLM, and update the generator.🧵
English
23
167
1.3K
133.2K
Alex Pan
Alex Pan@aypan_17·
@gork what do you think of this meme
English
1
0
6
619
Alex Pan
Alex Pan@aypan_17·
LLMs have behaviors, beliefs, and reasoning hidden in their activations. What if we could decode them into natural language? We introduce LatentQA: a new way to interact with the inner workings of AI systems. 🧵
Alex Pan tweet media
English
7
28
172
34.2K
Alex Pan
Alex Pan@aypan_17·
@saprmarks Happy to chat about this sometime! We've been thinking about extensions along these lines.
English
0
0
2
93
Samuel Marks
Samuel Marks@saprmarks·
In this paper, the ground-truth labels for model cognition come from the fact that the model was system-prompted to behave a certain way (e.g. "response like a pirate"). While this is great for getting initial signs-of-life, it also introduces a key weakness I'd like to see addressed in future work: you could answer the questions studied here by having access to the input used to create the activations which are plugged into the LatentQA. As follow-up work, I'd love to know: (1) Can LatentQA answer questions about activations even when the information needed to answer the question isn't explicitly in the input from which the activation was extracted? (2) Can you make datasets for training a LatentQA where your ground-truth information about what the model is thinking doesn't come from putting information in-context (e.g. via a system prompt)? For example, what if you train the model to speak like a pirate, somehow verify that the model understands that's what it's doing (cf. x.com/OwainEvans_UK/…), and then try to extract that information with a LatentQA? (This exact idea wouldn't work because you would only have one Q/A in the dataset for training the LatentQA on the pirate finetuned model, but maybe something like this could work.) @aypan_17
English
1
0
3
440
Samuel Marks
Samuel Marks@saprmarks·
This is a really creative and well-executed paper on using "black-box interpretability" methods to understand and control model cognition. Especially impressed by the many applications explored IMO this is an important direction; this paper sets the field on an excellent path!
Alex Pan@aypan_17

LLMs have behaviors, beliefs, and reasoning hidden in their activations. What if we could decode them into natural language? We introduce LatentQA: a new way to interact with the inner workings of AI systems. 🧵

English
1
2
23
3.7K
Alex Pan
Alex Pan@aypan_17·
To train our LatentQA system, we curate a LatentQA dataset using GPT, similar in spirit to the Alpaca and Llava instruction-tuning datasets. We finetune a decoder LLM (a copy of the target LLM) on this dataset. 7/
Alex Pan tweet media
English
1
0
6
1.2K