
awesome point - we did verify - comparing autoregressive (KV cache), full-sequence forward (our two-pass), and truncated forward under "deterministic" settings:
Tokens: perfect match across all three methods.
Hidden states: not bit-identical (float16 computation order differences), but worst-case cosine sim > 0.99, KL < 0.001. Way below anything that would affect probe accuracy or PCA geometry.
So yes, mathematically equivalent via causal masking, but tiny bits of difference, possibly due to floating point stuff - though not a huge difference for the observations, but careful when you use it for settings where high precision matters.
we will probably add these to the appendix! thanks!
English











