Qingye Meng

211 posts

Qingye Meng

Qingye Meng

@hilbertmeng

NLP Researcher at ColorfulClouds Tech. | Mechanistic interpretability of LLMs | Transformer architecture

Beijing, People's Republic of China Katılım Eylül 2021
497 Takip Edilen40 Takipçiler
Sabitlenmiş Tweet
Qingye Meng
Qingye Meng@hilbertmeng·
1/n [ICML 2025 paper] Glad to share our latest work, MUDDFormer, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers, matching ~2x Transformer++. paper: arxiv.org/abs/2502.12170
English
1
1
3
477
Qingye Meng retweetledi
fly51fly
fly51fly@fly51fly·
[CL] Weight Tying Biases Token Embeddings Towards the Output Space A Lopardo, A Harish, C Arnett, A Gupta [EleutherAI & UC Berkeley] (2026) arxiv.org/abs/2603.26663
fly51fly tweet mediafly51fly tweet mediafly51fly tweet mediafly51fly tweet media
English
1
8
26
2K
Qingye Meng retweetledi
Da Xiao
Da Xiao@xiaoda99·
Depthwise attention/recurrence is becoming a trend! After ByteDance's HC (ICLR'24), our MUDDFormer (ICML'25) & Google's DSA (ICML'25), more labs are joining: ByteDance's VWN, DeepSeek's mHC, MoonshotAI's AttnRes, etc. MUDDFormer's key design: input-dependent weights with multiway decoupling across Q/K/V/residual streams. Only +0.23% params, 1.8×–2.4× compute advantage. This is just the beginning. More fundamental architecture innovations to come. arxiv.org/abs/2502.12170 github.com/Caiyun-AI/MUDD…
Qingye Meng@hilbertmeng

Great to see depth-wise attention mechanisms like mHC and Attention Residuals (AttnRes) proving their scalability in large-scale models, and attract more attention to this line of work, including DenseFormer, HC, DeepCrossAttention (DCA) and our MUDDFormer (ICML25). We proposed multi-way dynamic dense connections along transformer layers to address the limitation of residual connections, where DynamicDenseFormer is similar to Kimi's Full AttnRes. I'd like to compare decoupling of residual streams, PP, training stability and details on depth attention weights. 1. Decoupled residual streams In MUDDFormer, we decouple the residual stream into 4-way/stream QKVR—a strategy also explored in the concurrent DCA, which is effective but absent in recent practices. We are motivated by different attribution circuits, like Q-attribution, V-attribution in mechanistic interpretability studies. Decoupled residual streams can better handle cross-layer information flow. In mHC and AttnRes, depth-wise attention is applied before each Attention and FFN block, so they can be seen as a 2-stream residual. 2. Pipeline Parallelism (PP) Efficiency is the primary bottleneck for dense cross-layer connections. Kimi addresses this via Block AttnRes, which reduces communication by attending to block-level summaries, while HC compresses the residual stream into hyper hidden states (typically 4 times wide) to reduce communication. In DenseFormer/MUDDFormer, key-wise dilation on dense connections is also a simple approach to reduce PP overhead. If PP is not a strict requirement (e.g., in TPU-based pretraining), MUDDFormer already demonstrates strong performance, and query-wise dilation can further provide an excellent balance between performance and efficiency. 3. Training stability & Depth attention weights To stabilize the residual mapping, mHC proposed the Sinkhorn-Knopp algorithm, while MUDDFormer tackles training stability by PrePostNorm in deep models. In HC and AttnRes, depth attention weights are dependent on key-wise layer outputs, while MUDDFormer utilizes a small MLP to generate weights from the query-wise hidden states.

English
0
5
6
571
Qingye Meng
Qingye Meng@hilbertmeng·
@osieberling Decoupling residual stream into 4 streams QKVR can further improve the performance as done in MUDDFormer (or DeepCrossAttention). Full AttnRes is roughly equivalent to DynamicDenseFormer(DDFormer). arxiv.org/abs/2502.12170
Qingye Meng tweet media
English
0
1
8
861
Oliver Sieberling
Oliver Sieberling@osieberling·
Attention Residuals doesn't look very promising anymore once you add a stronger baseline (DeepSeek's mHC) into the scaling law plot...
Oliver Sieberling tweet media
English
6
10
206
19.4K
Qingye Meng
Qingye Meng@hilbertmeng·
Great to see depth-wise attention mechanisms like mHC and Attention Residuals (AttnRes) proving their scalability in large-scale models, and attract more attention to this line of work, including DenseFormer, HC, DeepCrossAttention (DCA) and our MUDDFormer (ICML25). We proposed multi-way dynamic dense connections along transformer layers to address the limitation of residual connections, where DynamicDenseFormer is similar to Kimi's Full AttnRes. I'd like to compare decoupling of residual streams, PP, training stability and details on depth attention weights. 1. Decoupled residual streams In MUDDFormer, we decouple the residual stream into 4-way/stream QKVR—a strategy also explored in the concurrent DCA, which is effective but absent in recent practices. We are motivated by different attribution circuits, like Q-attribution, V-attribution in mechanistic interpretability studies. Decoupled residual streams can better handle cross-layer information flow. In mHC and AttnRes, depth-wise attention is applied before each Attention and FFN block, so they can be seen as a 2-stream residual. 2. Pipeline Parallelism (PP) Efficiency is the primary bottleneck for dense cross-layer connections. Kimi addresses this via Block AttnRes, which reduces communication by attending to block-level summaries, while HC compresses the residual stream into hyper hidden states (typically 4 times wide) to reduce communication. In DenseFormer/MUDDFormer, key-wise dilation on dense connections is also a simple approach to reduce PP overhead. If PP is not a strict requirement (e.g., in TPU-based pretraining), MUDDFormer already demonstrates strong performance, and query-wise dilation can further provide an excellent balance between performance and efficiency. 3. Training stability & Depth attention weights To stabilize the residual mapping, mHC proposed the Sinkhorn-Knopp algorithm, while MUDDFormer tackles training stability by PrePostNorm in deep models. In HC and AttnRes, depth attention weights are dependent on key-wise layer outputs, while MUDDFormer utilizes a small MLP to generate weights from the query-wise hidden states.
Qingye Meng@hilbertmeng

1/n [ICML 2025 paper] Glad to share our latest work, MUDDFormer, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers, matching ~2x Transformer++. paper: arxiv.org/abs/2502.12170

English
0
2
7
944
Qingye Meng
Qingye Meng@hilbertmeng·
@orvieto_antonio I updated the loss landscape image. Those spikes should be downward. This time it looks right and more natural.
Qingye Meng tweet media
English
0
0
0
26
Qingye Meng
Qingye Meng@hilbertmeng·
@orvieto_antonio I use gemini to generate a loss landscape image reflecting the sharpness of the river and valley. Is this more aligned with your mind?
Qingye Meng tweet media
English
1
0
3
203
Antonio Orvieto
Antonio Orvieto@orvieto_antonio·
Learning rate schedulers give us a lens for understanding the geometry of neural net,work landscapes. It turns out that large transformers and small CNNs are surprisingly similar, and pictorially resemble a river valley (arxiv.org/pdf/2410.05192 ). Thanks, Annalisa Belloni and @lorenzo_noci for the insights!
Antonio Orvieto tweet media
English
2
10
85
11.7K
Qingye Meng
Qingye Meng@hilbertmeng·
Computation is more than intelligence. Intelligence may be a disguise of computation.
English
0
0
0
45
Qingye Meng retweetledi
Lisan al Gaib
Lisan al Gaib@scaling01·
Another ByteDance Seed banger? They introduce the looped language models (LoopLMs) Ouro 1.4B and 2.6B trained on 7.7T tokens, that match evaluation results of larger 4B and 8B models respectively. "Ouro" 1.4B is a standard decoder-only Transformer with 24 layers (upcycled Ouro 2.6B = 48 layers), MHA, RoPE, SwiGLU and sandwich norm. This stack is repeatedly applied for T recurrent steps, avoiding the usual collapse of latent state to token space, therefore enabling deeper latent multi-hop reasoning. Like test-time-compute this approach trades more forward passes for additional performance, but with the additional benefit that models are smaller and have more effective depth. Additionally, they add a learned exit gate to allow early exit on simpler inputs improving the performance-cost trade-off. Training Pipeline: The pipeline is staged: warmup → big pretrain → CT-annealing → LongCT (push context, 64k) → mid-training and then a small reasoning SFT pass to make the "Ouro-Thinking" variants. The 2.6B model is an upcycled continuation of the 1.4B run (stack doubled to 48 layers). Benchmark results: - in synthetic 3-hop QA tasks they found that looped models learn the task with fewer examples compared to non looped, iso-parameter model - the looped architecture seems to help with safety as models are better able to distinguish benign prompts from harmful prompts as the number of recurrent steps increases - furthermore they demonstrated improved faithfulness of the reasoning using linear probes to predict responses in the next recurrent step and observe low predictability - they claim: "this systematic disagreement across steps when i <= 4 is precisely what a faithful latent process should exhibit: the model is updating its decision as recurrence deepens, and intermediate predictions are not frozen rationalizations of the final output" Some issues: - They state: "A defining advantage of the LoopLM architecture lies in its capacity for adaptive computation allocation", but find that performance does not increase by scaling recurrence beyond the trained T=4 depth (Table 10) - no extrapolation, which means more training is necessary - 4 recurrent steps mean 4x the FLOPs during inference. So ultimately Ouro-1.4B model with 4 recurrent steps would use more FLOPs than a Qwen3-4B, but less memory - in the appendix they under D.1 they pose the question: "What is the performance gap between standard models and LoopLM?". For this they compare 5 different model sizes: 53M, 134M, 374M, 778M, and 1.36B with recurrent depths: 1, 2, 4, and 8, trained on 20B tokens. The Standard Transformer in this case at depth 2, 4 and 8 has effectively 2, 4 and 8 times more layers and ~params(untied). They find that the Standard Transformer consistently outperforms LoopLM. Furthermore, LoopLM shows no performance increases with 8 recurrent steps. The 8 recurrent step 1.36B model is actually worse than the 778M model with 4 steps. Furthermore as seen in Table 18, the performance difference decreases the larger the models get, but increases with the number of steps/recurrence -> LoopLM is generally worse per-FLOP in compute-matched tests (untied depth wins), but it’s strong per-parameter and under memory/KV constraints. - their RL stage did not yield significant performance gains over the final SFT checkpoint: they blame model saturation and infrastructure challenges - they had to lower the number of recurring steps during training from 8 to 4 due to stability issues other notes: - looping does not increas eknowledge capacity nor improve capacity scaling - KV-cache can't be reused during pre-fill, but can be reused for decoding - recurrent architectures require smaller learning rates compared to standard transformers of equivalent parameter count
Lisan al Gaib tweet mediaLisan al Gaib tweet mediaLisan al Gaib tweet mediaLisan al Gaib tweet media
English
10
18
155
12.6K
Qingye Meng
Qingye Meng@hilbertmeng·
@XiaohuaZhai As the second plot shown, we can keep ~baseline quality with p_s=0.5. In contrast, with p=0.8, RINS shows too much improvement over baseline, which counter-intuitively even matches performance of the best AAAB model (p_s=0, 2x inference cost). Could you give some explanation?
English
0
0
0
20
Xiaohua Zhai
Xiaohua Zhai@XiaohuaZhai·
It's critical to maintain strong quality when test-time scaling is not needed. Our no‑regret recipe adds lightweight recursion-specific linear projection layers (<1% of params) and trains with stochastic recursion. At test time you can: > set recursion to 0 → keep ~baseline quality (no extra FLOPs) > dial recursion up → gain quality (pay only test‑time compute) This keeps deployments flexible, especially great for on‑device / small models!
Xiaohua Zhai tweet media
English
2
0
3
328
Xiaohua Zhai
Xiaohua Zhai@XiaohuaZhai·
Want test‑time quality gains from a simple LLM architectural tweak? Try Recursive INference Scaling (RINS) — our #NeurIPS2025 work done at GDM. RINS trains by splitting a block into A+B and reusing A r times with shared weights. Params and pre‑training compute stay the same. At inference you get a compute knob: > on → better quality (pay only extra FLOPs at test time) > off → same strong baseline, no extra cost Simple, drop‑in, no‑regret. Paper: arxiv.org/abs/2502.07503
Xiaohua Zhai tweet mediaXiaohua Zhai tweet media
English
2
11
85
7.7K
Qingye Meng
Qingye Meng@hilbertmeng·
@ibomohsin @XiaohuaZhai Excellent work! To reproduce RINS, I trained two 150M llama models(AB, AAAB) on the Pile dataset over 105B tokens, with a loss gap of 0.012, smaller than ~0.04 in the paper. I also failed to reproduce the adapter with unstable training. Can I DM you for further help?
English
1
0
1
12
Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن
🔥Excited to introduce RINS - a technique that boosts model performance by recursively applying early layers during inference without increasing model size or training compute flops! Not only does it significantly improve LMs, but also multimodal systems like SigLIP. (1/N)
Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن tweet media
English
2
12
39
6.1K
Qingye Meng retweetledi
Aaditya Singh
Aaditya Singh@Aaditya6284·
Excited to present this work in Vancouver at #ICML2025 today 😀 Come by to hear about why in-context learning emerges and disappears: Talk: 10:30-10:45am, West Ballroom C Poster: 11am-1:30pm, East Exhibition Hall A-B # E-3409
Aaditya Singh@Aaditya6284

Transformers employ different strategies through training to minimize loss, but how do these tradeoff and why? Excited to share our newest work, where we show remarkably rich competitive and cooperative interactions (termed "coopetition") as a transformer learns. Read on 🔎⏬

English
1
5
21
2.4K
Qingye Meng
Qingye Meng@hilbertmeng·
@GauravML Congratulations! We also concurrently propose MUDDFormer, with dynamic and multi-way connections to previous layers. Hope enhanced cross-layer connections can be adopted in more architectures. arxiv.org/abs/2502.12170
English
0
1
2
138
Qingye Meng
Qingye Meng@hilbertmeng·
6/n This is another improvement over Transformer since our previous work DCFormer(arxiv.org/abs/2405.08553). More importantly, these two improvements are orthogonal and can be combined to enhance Transformer together.
English
1
0
0
41
Qingye Meng
Qingye Meng@hilbertmeng·
1/n [ICML 2025 paper] Glad to share our latest work, MUDDFormer, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers, matching ~2x Transformer++. paper: arxiv.org/abs/2502.12170
English
1
1
3
477