Amber Li retweetledi

We trained a decoder to read the internal activations of an LLM and answer questions about what the model will think about or do next.
We find that this decoder can understand LLM behaviors, even when the model itself is confused! (for instance, if the model has been jailbroken)

Transluce@TransluceAI
Transluce is developing end-to-end interpretability approaches that directly train models to make predictions about AI behavior. Today we introduce Predictive Concept Decoders (PCD), a new architecture that embodies this approach.
English





