Da Xiao

275 posts

Da Xiao

@xiaoda99

Assoc. Prof. of BUPT | Chief Scientist of ColorfulClouds Tech. | Mechanistic interpretability and model architecture of LLMs

Beijing ,China Katılım Şubat 2016

287 Takip Edilen19 Takipçiler

Da Xiao@xiaoda99·30 Mar

Depthwise attention/recurrence is becoming a trend! After ByteDance's HC (ICLR'24), our MUDDFormer (ICML'25) & Google's DSA (ICML'25), more labs are joining: ByteDance's VWN, DeepSeek's mHC, MoonshotAI's AttnRes, etc. MUDDFormer's key design: input-dependent weights with multiway decoupling across Q/K/V/residual streams. Only +0.23% params, 1.8×–2.4× compute advantage. This is just the beginning. More fundamental architecture innovations to come. arxiv.org/abs/2502.12170 github.com/Caiyun-AI/MUDD…

Qingye Meng@hilbertmeng

Great to see depth-wise attention mechanisms like mHC and Attention Residuals (AttnRes) proving their scalability in large-scale models, and attract more attention to this line of work, including DenseFormer, HC, DeepCrossAttention (DCA) and our MUDDFormer (ICML25). We proposed multi-way dynamic dense connections along transformer layers to address the limitation of residual connections, where DynamicDenseFormer is similar to Kimi's Full AttnRes. I'd like to compare decoupling of residual streams, PP, training stability and details on depth attention weights. 1. Decoupled residual streams In MUDDFormer, we decouple the residual stream into 4-way/stream QKVR—a strategy also explored in the concurrent DCA, which is effective but absent in recent practices. We are motivated by different attribution circuits, like Q-attribution, V-attribution in mechanistic interpretability studies. Decoupled residual streams can better handle cross-layer information flow. In mHC and AttnRes, depth-wise attention is applied before each Attention and FFN block, so they can be seen as a 2-stream residual. 2. Pipeline Parallelism (PP) Efficiency is the primary bottleneck for dense cross-layer connections. Kimi addresses this via Block AttnRes, which reduces communication by attending to block-level summaries, while HC compresses the residual stream into hyper hidden states (typically 4 times wide) to reduce communication. In DenseFormer/MUDDFormer, key-wise dilation on dense connections is also a simple approach to reduce PP overhead. If PP is not a strict requirement (e.g., in TPU-based pretraining), MUDDFormer already demonstrates strong performance, and query-wise dilation can further provide an excellent balance between performance and efficiency. 3. Training stability & Depth attention weights To stabilize the residual mapping, mHC proposed the Sinkhorn-Knopp algorithm, while MUDDFormer tackles training stability by PrePostNorm in deep models. In HC and AttnRes, depth attention weights are dependent on key-wise layer outputs, while MUDDFormer utilizes a small MLP to generate weights from the query-wise hidden states.

English

575

Da Xiao@xiaoda99·22 Ara

@TrelisResearch Where is task embedding used in VARC? I don't find it in the ARChitects model part. Maybe you mean task embedding in TRM?

English

Trelis Research@TrelisResearch·10 Ara

OPEN: Trelis Research Collaborations I'm kicking off some short collabs on specific research projects where: - I'll provide compute - Projects are open-sourced Initial projects: - Nanochat but with a recursive transformer. - VARC but dropping task embeddings. - Gather/publish an NVARC-style dataset for ARC AGI II. - Generate ARC tasks using ASAL from Sakana. If interested, send me a DM mentioning the specific project and links to work to previous work you have done.

English

3.8K

Da Xiao@xiaoda99·6 Eyl

@teortaxesTex what tools are you using to get this parallel listing results?

English

154

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·5 Eyl

Qwen is the most stereotypically Chinese Chinese model series in good and in bad ways. They don't make risky bets. They watch what worked elsewhere and implement it with extreme effort. «Scaling model and data». «Cleaning». «RL». High IQ. No magic or uncanny taste at all.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

English

186

11.9K

Da Xiao@xiaoda99·27 Tem

@eliebakouch you can give a try to our DCMHA arXiv 2405.08553, which is a stronger version of Noam's talking-heads attn (see Table 9). Note that torch.compile is needed for efficient training.

English

164

elie@eliebakouch·26 Tem

Tried on nanogpt + muon setup (green curve) vs MHA and it's not bad! There is 2*n_head^2*n_layer additional parameter (3,456 for this 124M model)

elie@eliebakouch

Noam Shazeer 2020 paper with no equation, just pseudo code with einsum

English

135

13.2K

Da Xiao retweetledi

Qingye Meng@hilbertmeng·27 Haz

@GauravML Congratulations! We also concurrently propose MUDDFormer, with dynamic and multi-way connections to previous layers. Hope enhanced cross-layer connections can be adopted in more architectures. arxiv.org/abs/2502.12170

English

139

Da Xiao@xiaoda99·24 May

@SonglinYang4 @zhxlin great work! then how does PaTH compare to DeltaNet and how does PaTH-Fox compare to Gated DeltaNet? Does softmax increase KV cache while bring any benefits over linear attention?

English

173

Songlin Yang@SonglinYang4·24 May

🦊(6/16) FoX (@zhxlin) arxiv.org/abs/2503.02130 can be viewed as a softmax version of Mamba2. We go one step further with PaTH-FoX — a softmax version of Gated DeltaNet arxiv.org/abs/2412.06464

English

1.5K

Songlin Yang@SonglinYang4·24 May

📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, better extrapolation, and hardware-efficient training. PaTH outperforms RoPE across short and long language modeling benchmarks arxiv.org/abs/2505.16381

English

546

76.7K

Da Xiao retweetledi

Cristian Garcia@cgarciae88·4 Şub

The JAX team just released this amazing book on how to scale LLMs. It contains 11 chapters in total, and it goes into very low-level analysis of what LLMs are doing at the hardware level and how to reason about performance in these complex distributed systems.

Jacob Austin@jacobaustin132

Making LLMs run efficiently can feel scary, but scaling isn’t magic, it’s math! We wanted to demystify the “systems view” of LLMs and wrote a little textbook called “How To Scale Your Model” which we’re releasing today. 1/n

English

251

16.8K

Da Xiao retweetledi

Petar Veličković@PetarV_93·7 Eki

I make all the graphics in my papers (e.g. the one below) using TikZ. I'm pathologically bad at drawing, so programming my own figures is basically the only way to stay sane. I open-source all of my figures' codes here: github.com/PetarV-/TikZ Feel free to re-use (w/ credit :) )

English

130

899

Da Xiao@xiaoda99·5 Haz

8) For more details and results: - Paper: arxiv.org/abs/2405.08553 - JAX/PyTorch code, pretrained models: github.com/Caiyun-AI/DCFo… Joint work with @hilbertmeng and other guys at @CaiyunApp Welcome to comment and discuss! @arankomatsuzaki @_akhaliq

English

123

Da Xiao@xiaoda99·5 Haz

7) More results: We train DCPythia-2.8B/6.9B w/ the same HPs as the counterpart Pythia models on 300B tokens from Pile. DCPythia models significantly outperform Pythia models on downstream eval. DCPythia-6.9B is better than Pythia-12B. Model size ↗️, improvement ↗️, overheads ↘️

English

Da Xiao@xiaoda99·5 Haz

Excited to share our work on Dynamically Composable Multi-Head Attention (DCMHA), a drop-in replacement of MHA in any transformer arch. DCFormer matches the performance of ~1.7–2× compute Transformer across different architectures and model scales. Accepted as ICML 2024 (oral) 🪜

English

213

Da Xiao retweetledi

Swaroop Mishra@Swarooprm7·4 May

Summary of expert recommendations for #ICLR2024 and #ICML2024 attendees: 1. Innsbruck for the mountains, the scenery and the hikes. 2. "Königssee" and "Salzburg" 3. Walk the length of the Ringstrasse, see the Karlskirche, go to Cafe Central. Melker Stiftskeller for dinner 4. visitingvienna.com/songsfilms/bef… 5. Hallstatt 6. Gosausee and Grossglockner High Alpine Road 7. The Messe Wien Exhibition center 8. Natural History Museum (NHM) 9. Show at the State Opera Thanks to all the experts who recommended 😍.

Swaroop Mishra@Swarooprm7

Excited to be attending #ICLR2024 in stunning Vienna next week! 🇦🇹 Any recommendations for must-see places in Austria? Also DM me if you'd like to connect!

English

11.8K

Da Xiao retweetledi

Piotr Nawrot@p_nawrot·26 Nis

Two free medium-compute Mixture-Of-Experts research ideas: Prerequisite: Mixtral 8x7B is 32 layers, at each layer there are 8 experts, each token is assigned to 2 experts at a given layer. 1) Dynamic Expert Assignment in MoE Models Every token is assigned to 2*32=64 experts in total across all layers. Can we relax how we distribute the number of experts assigned to a given token at a given layer similarly to what Mixture-of-Depths did? Can we e.g. allow for more experts per token in later layers than in early layers, while keeping the total number of experts per token (=64) fixed? Can we learn how to condition this on the token itself? Do we need to keep the total number of experts per token fixed? - To keep the inference time constant for a given query it should be enough to keep the average number of experts per token fixed but we can relax it across the tokens. For example, given a sequence of 100 tokens, we have 100 * 64 expert assignments - what’s the optimal way to distribute this compute budget across tokens and model depth? Ideally, to teach the model how to do it, while aiming to minimise compute needed, I would start from a pre-trained MoE LLM and do short (~% of the original pre-training) fine-tuning. Inference is what matters in the end anyways. References: arxiv.org/pdf/2202.13914, arxiv.org/abs/2404.02258 2) Retrofitting MoE LLMs to Share Experts Across Layers There are 32 * 8 = 256 experts in Mixtral 8x7B. Could we concatenate all experts from all layers to create one gigantic MoE layer with 256 experts and replace every MoE layer in the original model with this gigantic one (share it so that the parameter count is not affected)? Similarly to 1), to teach the model how to make use of the extra choices at each layer, we would do short retrofitting. We don’t want to crash the model with our layer swap so at every layer we would initialise the routing function to behave similarly to how it was before conversion (bias the routing function to the experts that the layer had access to before). Possible benefits: a) Parameter Efficiency: once we retrofit the routing functions it might happen that some of the experts would be shared across layers and some would become obsolete and could be easily pruned; b) Improved Accuracy: having access to more experts at each layer could result in a performance boost while the parameter count and the number of activated parameters during inference is kept constant. Reference - arxiv.org/abs/2107.11817 Reference to both ideas about how to retrofit the pre-trained model to a more efficient variant: arxiv.org/abs/2403.09636 // concept stolen from @jxmnop

English

16.4K

Da Xiao retweetledi

Daniel Johnson@_ddjohnson·19 Nis

Excited to share Penzai, a JAX research toolkit from @GoogleDeepMind for building, editing, and visualizing neural networks! Penzai makes it easy to see model internals and lets you inject custom logic anywhere. Check it out on GitHub: github.com/google-deepmin…

English

390

338.5K

Da Xiao retweetledi

Nora Belrose@norabelrose·10 Nis

RNN language models are making a comeback recently, with new architectures like Mamba and RWKV. But do interpretability tools designed for transformers transfer to the new RNNs? We tested 3 popular interp methods, and find the answer is mostly “yes”! arxiv.org/abs/2404.05971

English

202

20.8K

Keşfet

@TrelisResearch @teortaxesTex @eliebakouch @GauravML @SonglinYang4 @zhxlin @hilbertmeng @CaiyunApp