Da Xiao

275 posts

Da Xiao

Da Xiao

@xiaoda99

Assoc. Prof. of BUPT | Chief Scientist of ColorfulClouds Tech. | Mechanistic interpretability and model architecture of LLMs

Beijing ,China Katılım Şubat 2016
287 Takip Edilen19 Takipçiler
Da Xiao
Da Xiao@xiaoda99·
Depthwise attention/recurrence is becoming a trend! After ByteDance's HC (ICLR'24), our MUDDFormer (ICML'25) & Google's DSA (ICML'25), more labs are joining: ByteDance's VWN, DeepSeek's mHC, MoonshotAI's AttnRes, etc. MUDDFormer's key design: input-dependent weights with multiway decoupling across Q/K/V/residual streams. Only +0.23% params, 1.8×–2.4× compute advantage. This is just the beginning. More fundamental architecture innovations to come. arxiv.org/abs/2502.12170 github.com/Caiyun-AI/MUDD…
Qingye Meng@hilbertmeng

Great to see depth-wise attention mechanisms like mHC and Attention Residuals (AttnRes) proving their scalability in large-scale models, and attract more attention to this line of work, including DenseFormer, HC, DeepCrossAttention (DCA) and our MUDDFormer (ICML25). We proposed multi-way dynamic dense connections along transformer layers to address the limitation of residual connections, where DynamicDenseFormer is similar to Kimi's Full AttnRes. I'd like to compare decoupling of residual streams, PP, training stability and details on depth attention weights. 1. Decoupled residual streams In MUDDFormer, we decouple the residual stream into 4-way/stream QKVR—a strategy also explored in the concurrent DCA, which is effective but absent in recent practices. We are motivated by different attribution circuits, like Q-attribution, V-attribution in mechanistic interpretability studies. Decoupled residual streams can better handle cross-layer information flow. In mHC and AttnRes, depth-wise attention is applied before each Attention and FFN block, so they can be seen as a 2-stream residual. 2. Pipeline Parallelism (PP) Efficiency is the primary bottleneck for dense cross-layer connections. Kimi addresses this via Block AttnRes, which reduces communication by attending to block-level summaries, while HC compresses the residual stream into hyper hidden states (typically 4 times wide) to reduce communication. In DenseFormer/MUDDFormer, key-wise dilation on dense connections is also a simple approach to reduce PP overhead. If PP is not a strict requirement (e.g., in TPU-based pretraining), MUDDFormer already demonstrates strong performance, and query-wise dilation can further provide an excellent balance between performance and efficiency. 3. Training stability & Depth attention weights To stabilize the residual mapping, mHC proposed the Sinkhorn-Knopp algorithm, while MUDDFormer tackles training stability by PrePostNorm in deep models. In HC and AttnRes, depth attention weights are dependent on key-wise layer outputs, while MUDDFormer utilizes a small MLP to generate weights from the query-wise hidden states.

English
0
5
6
575
Da Xiao
Da Xiao@xiaoda99·
@TrelisResearch Where is task embedding used in VARC? I don't find it in the ARChitects model part. Maybe you mean task embedding in TRM?
English
1
0
1
15
Trelis Research
Trelis Research@TrelisResearch·
OPEN: Trelis Research Collaborations I'm kicking off some short collabs on specific research projects where: - I'll provide compute - Projects are open-sourced Initial projects: - Nanochat but with a recursive transformer. - VARC but dropping task embeddings. - Gather/publish an NVARC-style dataset for ARC AGI II. - Generate ARC tasks using ASAL from Sakana. If interested, send me a DM mentioning the specific project and links to work to previous work you have done.
English
5
3
28
3.8K
Da Xiao
Da Xiao@xiaoda99·
@teortaxesTex what tools are you using to get this parallel listing results?
English
1
0
1
154
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
Qwen is the most stereotypically Chinese Chinese model series in good and in bad ways. They don't make risky bets. They watch what worked elsewhere and implement it with extreme effort. «Scaling model and data». «Cleaning». «RL». High IQ. No magic or uncanny taste at all.
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
English
7
5
186
11.9K
Da Xiao
Da Xiao@xiaoda99·
@eliebakouch you can give a try to our DCMHA arXiv 2405.08553, which is a stronger version of Noam's talking-heads attn (see Table 9). Note that torch.compile is needed for efficient training.
English
1
0
1
164
Da Xiao retweetledi
Qingye Meng
Qingye Meng@hilbertmeng·
@GauravML Congratulations! We also concurrently propose MUDDFormer, with dynamic and multi-way connections to previous layers. Hope enhanced cross-layer connections can be adopted in more architectures. arxiv.org/abs/2502.12170
English
0
1
2
139
Da Xiao
Da Xiao@xiaoda99·
@SonglinYang4 @zhxlin great work! then how does PaTH compare to DeltaNet and how does PaTH-Fox compare to Gated DeltaNet? Does softmax increase KV cache while bring any benefits over linear attention?
English
0
0
0
173
Songlin Yang
Songlin Yang@SonglinYang4·
📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, better extrapolation, and hardware-efficient training. PaTH outperforms RoPE across short and long language modeling benchmarks arxiv.org/abs/2505.16381
English
9
90
546
76.7K
Da Xiao retweetledi
Cristian Garcia
Cristian Garcia@cgarciae88·
The JAX team just released this amazing book on how to scale LLMs. It contains 11 chapters in total, and it goes into very low-level analysis of what LLMs are doing at the hardware level and how to reason about performance in these complex distributed systems.
Jacob Austin@jacobaustin132

Making LLMs run efficiently can feel scary, but scaling isn’t magic, it’s math! We wanted to demystify the “systems view” of LLMs and wrote a little textbook called “How To Scale Your Model” which we’re releasing today. 1/n

English
4
33
251
16.8K
Da Xiao retweetledi
Petar Veličković
Petar Veličković@PetarV_93·
I make all the graphics in my papers (e.g. the one below) using TikZ. I'm pathologically bad at drawing, so programming my own figures is basically the only way to stay sane. I open-source all of my figures' codes here: github.com/PetarV-/TikZ Feel free to re-use (w/ credit :) )
Petar Veličković tweet media
English
11
130
899
0
Da Xiao
Da Xiao@xiaoda99·
7) More results: We train DCPythia-2.8B/6.9B w/ the same HPs as the counterpart Pythia models on 300B tokens from Pile. DCPythia models significantly outperform Pythia models on downstream eval. DCPythia-6.9B is better than Pythia-12B. Model size ↗️, improvement ↗️, overheads ↘️
Da Xiao tweet mediaDa Xiao tweet media
English
1
0
0
59
Da Xiao
Da Xiao@xiaoda99·
Excited to share our work on Dynamically Composable Multi-Head Attention (DCMHA), a drop-in replacement of MHA in any transformer arch. DCFormer matches the performance of ~1.7–2× compute Transformer across different architectures and model scales. Accepted as ICML 2024 (oral) 🪜
Da Xiao tweet media
English
1
1
2
213
Da Xiao retweetledi
Swaroop Mishra
Swaroop Mishra@Swarooprm7·
Summary of expert recommendations for #ICLR2024 and #ICML2024 attendees: 1. Innsbruck for the mountains, the scenery and the hikes. 2. "Königssee" and "Salzburg" 3. Walk the length of the Ringstrasse, see the Karlskirche, go to Cafe Central. Melker Stiftskeller for dinner 4. visitingvienna.com/songsfilms/bef… 5. Hallstatt 6. Gosausee and Grossglockner High Alpine Road 7. The Messe Wien Exhibition center 8. Natural History Museum (NHM) 9. Show at the State Opera Thanks to all the experts who recommended 😍.
Swaroop Mishra@Swarooprm7

Excited to be attending #ICLR2024 in stunning Vienna next week! 🇦🇹 Any recommendations for must-see places in Austria? Also DM me if you'd like to connect!

English
2
2
39
11.8K
Da Xiao retweetledi
Piotr Nawrot
Piotr Nawrot@p_nawrot·
Two free medium-compute Mixture-Of-Experts research ideas: Prerequisite: Mixtral 8x7B is 32 layers, at each layer there are 8 experts, each token is assigned to 2 experts at a given layer. 1) Dynamic Expert Assignment in MoE Models Every token is assigned to 2*32=64 experts in total across all layers. Can we relax how we distribute the number of experts assigned to a given token at a given layer similarly to what Mixture-of-Depths did? Can we e.g. allow for more experts per token in later layers than in early layers, while keeping the total number of experts per token (=64) fixed? Can we learn how to condition this on the token itself? Do we need to keep the total number of experts per token fixed? - To keep the inference time constant for a given query it should be enough to keep the average number of experts per token fixed but we can relax it across the tokens. For example, given a sequence of 100 tokens, we have 100 * 64 expert assignments - what’s the optimal way to distribute this compute budget across tokens and model depth? Ideally, to teach the model how to do it, while aiming to minimise compute needed, I would start from a pre-trained MoE LLM and do short (~% of the original pre-training) fine-tuning. Inference is what matters in the end anyways. References: arxiv.org/pdf/2202.13914, arxiv.org/abs/2404.02258 2) Retrofitting MoE LLMs to Share Experts Across Layers There are 32 * 8 = 256 experts in Mixtral 8x7B. Could we concatenate all experts from all layers to create one gigantic MoE layer with 256 experts and replace every MoE layer in the original model with this gigantic one (share it so that the parameter count is not affected)? Similarly to 1), to teach the model how to make use of the extra choices at each layer, we would do short retrofitting. We don’t want to crash the model with our layer swap so at every layer we would initialise the routing function to behave similarly to how it was before conversion (bias the routing function to the experts that the layer had access to before). Possible benefits: a) Parameter Efficiency: once we retrofit the routing functions it might happen that some of the experts would be shared across layers and some would become obsolete and could be easily pruned; b) Improved Accuracy: having access to more experts at each layer could result in a performance boost while the parameter count and the number of activated parameters during inference is kept constant. Reference - arxiv.org/abs/2107.11817 Reference to both ideas about how to retrofit the pre-trained model to a more efficient variant: arxiv.org/abs/2403.09636 // concept stolen from @jxmnop
English
5
12
73
16.4K
Da Xiao retweetledi
Daniel Johnson
Daniel Johnson@_ddjohnson·
Excited to share Penzai, a JAX research toolkit from @GoogleDeepMind for building, editing, and visualizing neural networks! Penzai makes it easy to see model internals and lets you inject custom logic anywhere. Check it out on GitHub: github.com/google-deepmin…
English
39
390
2K
338.5K
Da Xiao retweetledi
Nora Belrose
Nora Belrose@norabelrose·
RNN language models are making a comeback recently, with new architectures like Mamba and RWKV. But do interpretability tools designed for transformers transfer to the new RNNs? We tested 3 popular interp methods, and find the answer is mostly “yes”! arxiv.org/abs/2404.05971
English
5
36
202
20.8K