


deep Manifold
13.7K posts

@BetaTomorrow
mathematics Thief & Chef "through the window of differential equations, mathematics sees the light in the real world" / "通过微分方程的窗子,数学家看到现实世界的光" (Jiang Zehan)





Remarkably, our ICML'26 paper CCDD also establishes connections between looped models, latent reasoning and diffusion language models. Welcome to check it out! arxiv.org/abs/2510.03206


Very cool train-free extension to TRM. By injecting noise into the latent space, TRMs can explore a wider set of basins, and the exit head can then identify which trajectories succeeded. Feels like unlocking an entirely new scaling axis. Awesome work! 🔗arxiv.org/pdf/2605.19943













New Anthropic Fellows paper with Jack Lindsey on agency in LLMs! 🧵 Paper link: arxiv.org/abs/2605.25459







quick writeup on why i think diffusion isn't more data efficient than AR, since it seemed to surprise a lot of people: - the case for diffusion > AR ([1], [2]) rests on AR saturating at <5 epochs while diffusion can be trained for hundreds of epochs without overfitting. but that's AR with default regularization. with Slowrun we train AR for >30 epochs without overfitting using heavy regularization (15x standard weight decay and dropout), which captures the gains diffusion gets over hundreds of epochs. you can't push reg this hard on diffusion, the objective is already effectively regularizing the network - data augmentation is another lever that helps AR models: sequence permutation and token masking close a lot of the gap even without heavy regularization - [3] verifies this cleanly: simple dropout, weight decay, and token masking were enough to bridge the gap and even *surpass* diffusion. aligns with what we've seen [1] arxiv.org/abs/2511.03276 [2] arxiv.org/abs/2507.15857 [3] arxiv.org/abs/2510.04071









@BetaTomorrow More precisely, aligning fp/attractors to represent the solutions through backpropgating from the supervision loss





🌀 Introducing 𝐄𝐪𝐮𝐢𝐥𝐢𝐛𝐫𝐢𝐮𝐦 𝐑𝐞𝐚𝐬𝐨𝐧𝐞𝐫𝐬 (𝐄𝐪𝐑) ! Feedforward models and weight-tied models behave very differently on hard reasoning generalization. EqR pushes this difference to the extreme by learning 𝐭𝐚𝐬𝐤-𝐜𝐨𝐧𝐝𝐢𝐭𝐢𝐨𝐧𝐞𝐝 𝐧𝐞𝐮𝐫𝐚𝐥 𝐚𝐭𝐭𝐫𝐚𝐜𝐭𝐨𝐫𝐬 . • Sudoku-Extreme: 99.8% • Maze: 93% #ICML2026