Qingye Meng

3

477

Qingye Meng retweetledi

fly51fly@fly51fly·31 Mar

[CL] Weight Tying Biases Token Embeddings Towards the Output Space A Lopardo, A Harish, C Arnett, A Gupta [EleutherAI & UC Berkeley] (2026) arxiv.org/abs/2603.26663

English

8

26

2K

Qingye Meng retweetledi

Da Xiao@xiaoda99·30 Mar

Depthwise attention/recurrence is becoming a trend! After ByteDance's HC (ICLR'24), our MUDDFormer (ICML'25) & Google's DSA (ICML'25), more labs are joining: ByteDance's VWN, DeepSeek's mHC, MoonshotAI's AttnRes, etc. MUDDFormer's key design: input-dependent weights with multiway decoupling across Q/K/V/residual streams. Only +0.23% params, 1.8×–2.4× compute advantage. This is just the beginning. More fundamental architecture innovations to come. arxiv.org/abs/2502.12170 github.com/Caiyun-AI/MUDD…

Qingye Meng@hilbertmeng

Great to see depth-wise attention mechanisms like mHC and Attention Residuals (AttnRes) proving their scalability in large-scale models, and attract more attention to this line of work, including DenseFormer, HC, DeepCrossAttention (DCA) and our MUDDFormer (ICML25). We proposed multi-way dynamic dense connections along transformer layers to address the limitation of residual connections, where DynamicDenseFormer is similar to Kimi's Full AttnRes. I'd like to compare decoupling of residual streams, PP, training stability and details on depth attention weights. 1. Decoupled residual streams In MUDDFormer, we decouple the residual stream into 4-way/stream QKVR—a strategy also explored in the concurrent DCA, which is effective but absent in recent practices. We are motivated by different attribution circuits, like Q-attribution, V-attribution in mechanistic interpretability studies. Decoupled residual streams can better handle cross-layer information flow. In mHC and AttnRes, depth-wise attention is applied before each Attention and FFN block, so they can be seen as a 2-stream residual. 2. Pipeline Parallelism (PP) Efficiency is the primary bottleneck for dense cross-layer connections. Kimi addresses this via Block AttnRes, which reduces communication by attending to block-level summaries, while HC compresses the residual stream into hyper hidden states (typically 4 times wide) to reduce communication. In DenseFormer/MUDDFormer, key-wise dilation on dense connections is also a simple approach to reduce PP overhead. If PP is not a strict requirement (e.g., in TPU-based pretraining), MUDDFormer already demonstrates strong performance, and query-wise dilation can further provide an excellent balance between performance and efficiency. 3. Training stability & Depth attention weights To stabilize the residual mapping, mHC proposed the Sinkhorn-Knopp algorithm, while MUDDFormer tackles training stability by PrePostNorm in deep models. In HC and AttnRes, depth attention weights are dependent on key-wise layer outputs, while MUDDFormer utilizes a small MLP to generate weights from the query-wise hidden states.

English

5

6

571

Qingye Meng@hilbertmeng·23 Mar

@osieberling Decoupling residual stream into 4 streams QKVR can further improve the performance as done in MUDDFormer (or DeepCrossAttention). Full AttnRes is roughly equivalent to DynamicDenseFormer(DDFormer). arxiv.org/abs/2502.12170

English

1/n [ICML 2025 paper] Glad to share our latest work, MUDDFormer, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers, matching ~2x Transformer++. paper: arxiv.org/abs/2502.12170

1

8

861

Oliver Sieberling@osieberling·23 Mar

Attention Residuals doesn't look very promising anymore once you add a stronger baseline (DeepSeek's mHC) into the scaling law plot...

English

6

10

206

19.4K

Qingye Meng@hilbertmeng·21 Mar

Great to see depth-wise attention mechanisms like mHC and Attention Residuals (AttnRes) proving their scalability in large-scale models, and attract more attention to this line of work, including DenseFormer, HC, DeepCrossAttention (DCA) and our MUDDFormer (ICML25). We proposed multi-way dynamic dense connections along transformer layers to address the limitation of residual connections, where DynamicDenseFormer is similar to Kimi's Full AttnRes. I'd like to compare decoupling of residual streams, PP, training stability and details on depth attention weights. 1. Decoupled residual streams In MUDDFormer, we decouple the residual stream into 4-way/stream QKVR—a strategy also explored in the concurrent DCA, which is effective but absent in recent practices. We are motivated by different attribution circuits, like Q-attribution, V-attribution in mechanistic interpretability studies. Decoupled residual streams can better handle cross-layer information flow. In mHC and AttnRes, depth-wise attention is applied before each Attention and FFN block, so they can be seen as a 2-stream residual. 2. Pipeline Parallelism (PP) Efficiency is the primary bottleneck for dense cross-layer connections. Kimi addresses this via Block AttnRes, which reduces communication by attending to block-level summaries, while HC compresses the residual stream into hyper hidden states (typically 4 times wide) to reduce communication. In DenseFormer/MUDDFormer, key-wise dilation on dense connections is also a simple approach to reduce PP overhead. If PP is not a strict requirement (e.g., in TPU-based pretraining), MUDDFormer already demonstrates strong performance, and query-wise dilation can further provide an excellent balance between performance and efficiency. 3. Training stability & Depth attention weights To stabilize the residual mapping, mHC proposed the Sinkhorn-Knopp algorithm, while MUDDFormer tackles training stability by PrePostNorm in deep models. In HC and AttnRes, depth attention weights are dependent on key-wise layer outputs, while MUDDFormer utilizes a small MLP to generate weights from the query-wise hidden states.

Qingye Meng@hilbertmeng

English

2

7

944

Qingye Meng@hilbertmeng·28 Oca

@orvieto_antonio I updated the loss landscape image. Those spikes should be downward. This time it looks right and more natural.

English

26

Qingye Meng@hilbertmeng·16 Oca

@orvieto_antonio I use gemini to generate a loss landscape image reflecting the sharpness of the river and valley. Is this more aligned with your mind？

English

0

3

203

Antonio Orvieto@orvieto_antonio·15 Oca

Learning rate schedulers give us a lens for understanding the geometry of neural net,work landscapes. It turns out that large transformers and small CNNs are surprisingly similar, and pictorially resemble a river valley (arxiv.org/pdf/2410.05192 ). Thanks, Annalisa Belloni and @lorenzo_noci for the insights!

English

10

85

11.7K

Qingye Meng@hilbertmeng·16 Oca

Computation is more than intelligence. Intelligence may be a disguise of computation.

English

45

Qingye Meng retweetledi

Lisan al Gaib@scaling01·31 Eki

Another ByteDance Seed banger? They introduce the looped language models (LoopLMs) Ouro 1.4B and 2.6B trained on 7.7T tokens, that match evaluation results of larger 4B and 8B models respectively. "Ouro" 1.4B is a standard decoder-only Transformer with 24 layers (upcycled Ouro 2.6B = 48 layers), MHA, RoPE, SwiGLU and sandwich norm. This stack is repeatedly applied for T recurrent steps, avoiding the usual collapse of latent state to token space, therefore enabling deeper latent multi-hop reasoning. Like test-time-compute this approach trades more forward passes for additional performance, but with the additional benefit that models are smaller and have more effective depth. Additionally, they add a learned exit gate to allow early exit on simpler inputs improving the performance-cost trade-off. Training Pipeline: The pipeline is staged: warmup → big pretrain → CT-annealing → LongCT (push context, 64k) → mid-training and then a small reasoning SFT pass to make the "Ouro-Thinking" variants. The 2.6B model is an upcycled continuation of the 1.4B run (stack doubled to 48 layers). Benchmark results: - in synthetic 3-hop QA tasks they found that looped models learn the task with fewer examples compared to non looped, iso-parameter model - the looped architecture seems to help with safety as models are better able to distinguish benign prompts from harmful prompts as the number of recurrent steps increases - furthermore they demonstrated improved faithfulness of the reasoning using linear probes to predict responses in the next recurrent step and observe low predictability - they claim: "this systematic disagreement across steps when i <= 4 is precisely what a faithful latent process should exhibit: the model is updating its decision as recurrence deepens, and intermediate predictions are not frozen rationalizations of the final output" Some issues: - They state: "A defining advantage of the LoopLM architecture lies in its capacity for adaptive computation allocation", but find that performance does not increase by scaling recurrence beyond the trained T=4 depth (Table 10) - no extrapolation, which means more training is necessary - 4 recurrent steps mean 4x the FLOPs during inference. So ultimately Ouro-1.4B model with 4 recurrent steps would use more FLOPs than a Qwen3-4B, but less memory - in the appendix they under D.1 they pose the question: "What is the performance gap between standard models and LoopLM?". For this they compare 5 different model sizes: 53M, 134M, 374M, 778M, and 1.36B with recurrent depths: 1, 2, 4, and 8, trained on 20B tokens. The Standard Transformer in this case at depth 2, 4 and 8 has effectively 2, 4 and 8 times more layers and ~params(untied). They find that the Standard Transformer consistently outperforms LoopLM. Furthermore, LoopLM shows no performance increases with 8 recurrent steps. The 8 recurrent step 1.36B model is actually worse than the 778M model with 4 steps. Furthermore as seen in Table 18, the performance difference decreases the larger the models get, but increases with the number of steps/recurrence -> LoopLM is generally worse per-FLOP in compute-matched tests (untied depth wins), but it’s strong per-parameter and under memory/KV constraints. - their RL stage did not yield significant performance gains over the final SFT checkpoint: they blame model saturation and infrastructure challenges - they had to lower the number of recurring steps during training from 8 to 4 due to stability issues other notes: - looping does not increas eknowledge capacity nor improve capacity scaling - KV-cache can't be reused during pre-fill, but can be reused for decoding - recurrent architectures require smaller learning rates compared to standard transformers of equivalent parameter count

English

10

18

155

12.6K

Qingye Meng@hilbertmeng·19 Eyl

@XiaohuaZhai As the second plot shown, we can keep ~baseline quality with p_s=0.5. In contrast, with p=0.8, RINS shows too much improvement over baseline, which counter-intuitively even matches performance of the best AAAB model (p_s=0, 2x inference cost). Could you give some explanation?

English

20

Xiaohua Zhai@XiaohuaZhai·19 Eyl

It's critical to maintain strong quality when test-time scaling is not needed. Our no‑regret recipe adds lightweight recursion-specific linear projection layers (<1% of params) and trains with stochastic recursion. At test time you can: > set recursion to 0 → keep ~baseline quality (no extra FLOPs) > dial recursion up → gain quality (pay only test‑time compute) This keeps deployments flexible, especially great for on‑device / small models!

English

0

3

328

Xiaohua Zhai@XiaohuaZhai·19 Eyl

Want test‑time quality gains from a simple LLM architectural tweak? Try Recursive INference Scaling (RINS) — our #NeurIPS2025 work done at GDM. RINS trains by splitting a block into A+B and reusing A r times with shared weights. Params and pre‑training compute stay the same. At inference you get a compute knob: > on → better quality (pay only extra FLOPs at test time) > off → same strong baseline, no extra cost Simple, drop‑in, no‑regret. Paper: arxiv.org/abs/2502.07503

English

11

85

7.7K

Qingye Meng@hilbertmeng·4 Eyl

@ibomohsin @XiaohuaZhai Excellent work! To reproduce RINS, I trained two 150M llama models(AB, AAAB) on the Pile dataset over 105B tokens, with a loss gap of 0.012, smaller than ~0.04 in the paper. I also failed to reproduce the adapter with unstable training. Can I DM you for further help?

English

0

1

12

Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن@ibomohsin·12 Şub

So, please check out our work (with my co-author @XiaohuaZhai): abs: arxiv.org/abs/2502.07503 pdf: arxiv.org/pdf/2502.07503 and please reach out for any comments or questions.

English

0

5

278

Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن@ibomohsin·12 Şub

🔥Excited to introduce RINS - a technique that boosts model performance by recursively applying early layers during inference without increasing model size or training compute flops! Not only does it significantly improve LMs, but also multimodal systems like SigLIP. (1/N)

Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن tweet media

English

Aaditya Singh@Aaditya6284

12

39

6.1K

Qingye Meng retweetledi

Aaditya Singh@Aaditya6284·15 Tem

Excited to present this work in Vancouver at #ICML2025 today 😀 Come by to hear about why in-context learning emerges and disappears: Talk: 10:30-10:45am, West Ballroom C Poster: 11am-1:30pm, East Exhibition Hall A-B # E-3409

Transformers employ different strategies through training to minimize loss, but how do these tradeoff and why? Excited to share our newest work, where we show remarkably rich competitive and cooperative interactions (termed "coopetition") as a transformer learns. Read on 🔎⏬

English

5

21

2.4K

Qingye Meng retweetledi

Omar Sanseviero@osanseviero·1 Tem

Want to learn about the research behind Gemma 3n? Altup - arxiv.org/abs/2301.13310 LAuReL - arxiv.org/abs/2411.07501 MatFormer - arxiv.org/abs/2310.07707 Activation sparsity - arxiv.org/abs/2506.06644 Universal Speech Model - arxiv.org/abs/2303.01037 Blog - developers.googleblog.com/en/introducing…

English

12

123

661

53K

Qingye Meng@hilbertmeng·27 Haz

@GauravML Congratulations! We also concurrently propose MUDDFormer, with dynamic and multi-way connections to previous layers. Hope enhanced cross-layer connections can be adopted in more architectures. arxiv.org/abs/2502.12170

English

1

2

138

Gaurav Menghani@GauravML·27 Haz

Proud to have contributed to Gemma3N via LAuReL (arxiv.org/abs/2411.07501). developers.googleblog.com/en/introducing…

English

0

6

518

Qingye Meng@hilbertmeng·17 Haz

7/n Thanks a lot to my collaborators, @xiaoda99, Shengping Li and Xingyuan Yuan. code: github.com/Caiyun-AI/MUDD… models: huggingface.co/Caiyun-AI/MUDD…

English

2

69

Qingye Meng@hilbertmeng·17 Haz

6/n This is another improvement over Transformer since our previous work DCFormer(arxiv.org/abs/2405.08553). More importantly, these two improvements are orthogonal and can be combined to enhance Transformer together.

English

0

41

Qingye Meng@hilbertmeng·17 Haz

1/n [ICML 2025 paper] Glad to share our latest work, MUDDFormer, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers, matching ~2x Transformer++. paper: arxiv.org/abs/2502.12170

English