Alex Pan

44 posts

Alex Pan

@aypan_17

safety and AI agents; prev @xAI, @berkeley_ai

Katılım Aralık 2022

345 Takip Edilen1.3K Takipçiler

Sabitlenmiş Tweet

Alex Pan@aypan_17·13 Ara

LLMs have behaviors, beliefs, and reasoning hidden in their activations. What if we could decode them into natural language? We introduce LatentQA: a new way to interact with the inner workings of AI systems. 🧵

English

172

34.2K

Alex Pan retweetledi

Jacob Steinhardt@JacobSteinhardt·18 Şub

New blog post:"Building Technology to Drive AI Governance". I argue that many governance challenges are fundamentally bottlenecked by technical gaps, and consider case studies from other fields (food safety, climate change) that illustrate this dynamic.

English

121

15K

Alex Pan retweetledi

Grace Luo@graceluo_·9 Şub

We trained diffusion models on a billion LLM activations, and we want you to use them! New preprint: Learning a Generative Meta-Model of LLM Activations Joint work with @feng_jiahai, @trevordarrell, @AlecRad, @JacobSteinhardt. More in thread 🧵

English

190

1.4K

217.5K

Alex Pan retweetledi

Jacob Steinhardt@JacobSteinhardt·6 Oca

New blog post out: a position piece on "Turning Compute into Understanding", by training superhuman oversight assistants.

English

230

29.3K

Alex Pan@aypan_17·30 Ara

@pharmaace1 Try again?

English

1.7K

pharmaConsult$@pharmaace1·30 Ara

@aypan_17 Your dm is closed

English

1.9K

Alex Pan@aypan_17·29 Ara

We're hiring for the safety team at xAI! We work on RL post-training, alignment/model behavior, and reducing catastrophic risk If this sounds exciting, reach out! (1/3)

English

1.1K

92.5K

Alex Pan@aypan_17·30 Ara

Hey everyone! There's too much interest in the Calendly, so I've closed it for now! Feel free to fill out this Google form and I will reach out: forms.gle/g8DyR4axKiXGxX…

English

4.2K

Alex Pan@aypan_17·29 Ara

Apply here: job-boards.greenhouse.io/xai/jobs/48552… Or schedule a time to chat with me here: calendly.com/aypan-x/15min (3/3)

English

6.6K

Alex Pan@aypan_17·29 Ara

We're a small team working at the intersection of RL post-training and alignment for Grok. Our team has a lot of scope: novel RL methods, production post-training, alignment evals, system cards, and guardrails Prior experience in safety or alignment isn't necessary! Feel free to DM with questions (2/3)

English

6.7K

Alex Pan retweetledi

Jacob Steinhardt@JacobSteinhardt·21 Ara

Cool to see folks building on LatentQA! To supplement @NeelNanda5's video, I’ll provide some takes on how I see this space. (Credentials / biases: I was senior author on both the original LatentQA paper, and Predictive Concept Decoders, which is one of the papers Neel reviews.)

Neel Nanda@NeelNanda5

New video: What would it look like for interp to be truly bitter lesson pilled? There's been exciting work on end-to-end interpretability: directly train models to map acts to explanations This is live paper review to two (Activation Oracles & PCD), I read and give hot takes

English

111

35.5K

Alex Pan retweetledi

Transluce@TransluceAI·25 Kas

What do AI assistants think about you, and how does this shape their answers? Because assistants are trained to optimize human feedback, how they model users drives issues like sycophancy, reward hacking, and bias. We provide data + methods to extract & steer these user models.

English

22.7K

Alex Pan@aypan_17·23 Eki

@TonyWangIV This is cool!

English

445

Tony Wang@TonyWangIV·23 Eki

New paper! We show how to give an LLM the ability to accurately verbalize what changed about itself after a weight update is applied. We see this as a proof of concept for a new, more scalable approach to interpretability.🧵

English

598

47.6K

Alex Pan@aypan_17·22 Ağu

grok 4 model card and RMF! x.ai/safety

English

105

15.1K

Alex Pan retweetledi

xAI@xai·31 Tem

xAI supports AI safety and will be signing the EU AI Act’s Code of Practice Chapter on Safety and Security. While the AI Act and the Code have a portion that promotes AI safety, its other parts contain requirements that are profoundly detrimental to innovation and its copyright provisions are clearly over-reach.

English

462

429

2.7K

394.3K

Alex Pan retweetledi

Grace Luo@graceluo_·6 Haz

✨New preprint: Dual-Process Image Generation! We distill *feedback from a VLM* into *feed-forward image generation*, at inference time. The result is flexible control: parameterize tasks as multimodal inputs, visually inspect the images with the VLM, and update the generator.🧵

English

167

1.3K

133.2K

Alex Pan@aypan_17·4 Haz

@gork what do you think of this meme

English

619

Alex Pan@aypan_17·13 Ara

English

172

34.2K

Alex Pan@aypan_17·16 Ara

@saprmarks Happy to chat about this sometime! We've been thinking about extensions along these lines.

English

Samuel Marks@saprmarks·16 Ara

In this paper, the ground-truth labels for model cognition come from the fact that the model was system-prompted to behave a certain way (e.g. "response like a pirate"). While this is great for getting initial signs-of-life, it also introduces a key weakness I'd like to see addressed in future work: you could answer the questions studied here by having access to the input used to create the activations which are plugged into the LatentQA. As follow-up work, I'd love to know: (1) Can LatentQA answer questions about activations even when the information needed to answer the question isn't explicitly in the input from which the activation was extracted? (2) Can you make datasets for training a LatentQA where your ground-truth information about what the model is thinking doesn't come from putting information in-context (e.g. via a system prompt)? For example, what if you train the model to speak like a pirate, somehow verify that the model understands that's what it's doing (cf. x.com/OwainEvans_UK/…), and then try to extract that information with a LatentQA? (This exact idea wouldn't work because you would only have one Q/A in the dataset for training the LatentQA on the pirate finetuned model, but maybe something like this could work.) @aypan_17

English

440

Samuel Marks@saprmarks·16 Ara

This is a really creative and well-executed paper on using "black-box interpretability" methods to understand and control model cognition. Especially impressed by the many applications explored IMO this is an important direction; this paper sets the field on an excellent path!

Alex Pan@aypan_17

English

3.7K

Alex Pan@aypan_17·13 Ara

If you are interested in our work, read our paper and visit our website! Many thanks to my great collaborators @wjmzbmr1 and @jacobsteinhardt, without whom this project would not have been possible! Paper: arxiv.org/abs/2412.08686 Website: latentqa.github.io

English

1.1K

Alex Pan@aypan_17·13 Ara

To train our LatentQA system, we curate a LatentQA dataset using GPT, similar in spirit to the Alpaca and Llava instruction-tuning datasets. We finetune a decoder LLM (a copy of the target LLM) on this dataset. 7/

English

1.2K

Keşfet

@feng_jiahai @trevordarrell @AlecRad @JacobSteinhardt @pharmaace1 @NeelNanda5 @TonyWangIV @gork