
Daking Rai
56 posts

Daking Rai
@DakingRai
CS PhD Student @GeorgeMasonU















🚨New EMNLP 2025 Paper: When a human does mental math like 12+45-8, we tend to do it stepwise: first compute 12+45=57, then 57-8=49. Does an LLM do the same? Turns out it doesn’t. But how does it work? Our paper investigates exactly this! 🧵(1/10) Paper: arxiv.org/abs/2509.09650 Code: github.com/siddarth-pm/al…

When a language model solves a math problem in its head, where in the network is the real calculation happening? This paper finds that almost all the actual math gets done right at the very last token of the sequence, not spread out across all the tokens. The earlier tokens spend a lot of layers just holding information and doing general setup. Then, in just 2 middle layers, they pass their information to the last token. After that, the last token finishes the calculation on its own and produces the answer. They built two techniques to test this, called Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP). These methods let them force the model to only work in certain ways, so they could see which parts were essential. With these tools, they discovered a sparse circuit, which they call All-for-One (AF1). This circuit is surprisingly efficient: most of the network can wait, then only a couple of layers are needed to hand off information, and the final token does the job. This works really well on plain arithmetic like "42 + 20 - 15". But the shortcut fails if the problem is written as a word problem or inside Python code, because then the model also needs to understand language or programming context. In short, the big insight is that language models don’t spread math work across the whole sequence. Instead, they rely heavily on the last token, with just a brief moment of information passing from the earlier ones. ---- Paper – arxiv. org/abs/2509.09650 Paper Title: "All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens"


