Overall, these results are somewhat encouraging for latent reasoning model interpretability. But I suspect models with weaker natural language priors, such as those trained to do latent reasoning during pretraining or through RL, will be much less interpretable.
Excited to announce my first preprint in LM interpretability!
Latent reasoning models are not monitorable by default, since they don't reason in human-readable, natural language text. But can we make progress in understanding their intermediate reasoning steps using mech interp?