Shengjie Luo

119 posts

Shengjie Luo

Shengjie Luo

@Roger98079446

PhD Student @pku1898, interested in Machine Learning

Peking University Katılım Ağustos 2019
525 Takip Edilen334 Takipçiler
Sabitlenmiş Tweet
Shengjie Luo
Shengjie Luo@Roger98079446·
#ICLR2023 New paper! "Rethinking the expressive power of GNNs via graph biconnectivity" accepted as an 𝗼𝗿𝗮𝗹 𝗽𝗿𝗲𝘀𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 (notable-top 5%) 🔥A new direction to study GNN expressivity via graph biconnectivity! 👇Let's see the details of our fruitful results🤗
Bohang Zhang @ICLR 2024@bohang_zhang

Excited to see our paper "Rethinking the expressive power of GNNs via graph biconnectivity" accepted as an 𝗼𝗿𝗮𝗹 𝗽𝗿𝗲𝘀𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 (notable-top 5%) at #ICLR2023! arxiv.org/abs/2301.09505 Joint work with @Roger98079446, Liwei Wang, and Di He 1/n

English
0
0
30
6.3K
Shengjie Luo retweetledi
Tianle Cai
Tianle Cai@tianle_cai·
Can we turn part of an LLM's weights into long-term memory that continuously absorbs new knowledge? We took a small step toward this with In-Place Test-Time Training (In-Place TTT) — accepted as an Oral at ICLR 2026 🎉 The key idea: no new modules, optional pretraining. We repurpose the final projection matrix in every MLP block as fast weights. With an NTP-aligned objective and efficient chunk-wise updates, the model adapts on the fly — complementing attention rather than replacing it. 📄 Paper: arxiv.org/abs/2604.06169 with amazing @Guhao_Feng @Roger98079446 Kai @GeZhang86038849 Di @HuangRubio
English
24
144
1K
73.9K
Shengjie Luo retweetledi
Yiping Lu
Yiping Lu@2prime_PKU·
Gradient-Lipschitz analysis can recovers the scaling behind muP!Studying how network width changes the gradient Lip constant under operator norms, we • recover muP scaling for Adam • Muon’s smoothness can be bad • New Row-wise gradient normalization is competitive with Muon
Yiping Lu tweet mediaYiping Lu tweet mediaYiping Lu tweet media
English
3
36
181
22.2K
Shengjie Luo retweetledi
Karan Dalal
Karan Dalal@karansdalal·
Our new paper, “End-to-End Test-Time Training for Long Context,” is a step towards continual learning in language models. We introduce a new method that blurs the boundary between training and inference. At test-time, our model continues learning from given context using the same next-token prediction objective as training. With this end-to-end objective, our model can efficiently compress substantial context into its weights and still use it effectively, unlocking extremely long context windows for complex reasoning and applications in agents and robotics. Paper: test-time-training.github.io/e2e.pdf Code: github.com/test-time-trai…
Karan Dalal tweet media
English
42
208
1.2K
182.8K
Shengjie Luo retweetledi
Shengjie Luo retweetledi
Tanishq Mathew Abraham, Ph.D.
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·
Diagnosing and Improving Diffusion Models by Estimating the Optimal Loss Value "We first derive the optimal loss in closed form under a unified formulation of diffusion models, and develop effective estimators for it, including a stochastic variant scalable to large datasets with proper control of variance and bias. With this tool, we unlock the inherent metric for diagnosing the training quality of mainstream diffusion model variants, and develop a more performant training schedule based on the optimal loss. Moreover, using models with 120M to 1.5B parameters, we find that the power law is better demonstrated after subtracting the optimal loss from the actual training loss, suggesting a more principled setting for investigating the scaling law for diffusion models."
Tanishq Mathew Abraham, Ph.D. tweet media
English
1
26
146
9.2K
Shengjie Luo retweetledi
Songlin Yang
Songlin Yang@SonglinYang4·
Introducing the first open-source implementation of native sparse attention: github.com/fla-org/native…. Give it a spin and cook your NSA model! 🐳🐳🐳
English
10
118
756
72.4K
Shengjie Luo retweetledi
Tianle Cai
Tianle Cai@tianle_cai·
Just grasped the true significance (not just bc it's submitted by Wenfeng) of this work after reading @SonglinYang4 's explanation. The breakthrough isn't hybrid attention (studied years ago), but the ingenious kernel that delivers real-world speedups for dynamic sparse attention. As someone who worked on efficient transformers in undergrad, I had the impression that combining "efficient attentions" (linear, sparse, conv, block-structured), which theoretically would be faster, had the potential to replace full attention but was practically slower. But Deepseek's solution is different: By having each query group of a token attend to the same KV block, they can really reduce the memory movement and achieve FlashAttention-like memory efficiency. This matters enormously for reasoning models that output long thinking processes (10k+ tokens). The efficient dynamic sparse kernel dramatically speeds up both training and inference for such models. What a brilliant example of algorithm-system co-design!
DeepSeek@deepseek_ai

🚀 Introducing NSA: A Hardware-Aligned and Natively Trainable Sparse Attention mechanism for ultra-fast long-context training & inference! Core components of NSA: • Dynamic hierarchical sparse strategy • Coarse-grained token compression • Fine-grained token selection 💡 With optimized design for modern hardware, NSA speeds up inference while reducing pre-training costs—without compromising performance. It matches or outperforms Full Attention models on general benchmarks, long-context tasks, and instruction-based reasoning. 📖 For more details, check out our paper here: arxiv.org/abs/2502.11089

English
3
39
340
43.4K
Shengjie Luo retweetledi
Jiao Sun
Jiao Sun@sunjiao123sun_·
Mitigating racial bias from LLMs is a lot easier than removing it from humans! Can’t believe this happened at the best AI conference @NeurIPSConf We have ethical reviews for authors, but missed it for invited speakers? 😡
Jiao Sun tweet media
English
175
780
3.7K
2.2M
Shengjie Luo retweetledi
Xiang Fu
Xiang Fu@xiangfu_ml·
Charge density is the core attribute of atomic systems in DFT. When representing and predicting charge density with ML, it is challenging to balance accuracy and efficiency. We propose a recipe that achieves SOTA on both: arxiv.org/abs/2405.19276 1/5
Xiang Fu tweet media
English
8
46
202
44.5K
Shengjie Luo
Shengjie Luo@Roger98079446·
Super Cool! Don't miss it!
Bohang Zhang @ICLR 2024@bohang_zhang

#ICLR2024 Just arrived in Vienna! Don't miss our oral presentation tomorrow afternoon in room Halle A3, focusing on 𝗚𝗡𝗡𝘀 and their 𝗲𝘅𝗽𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗽𝗼𝘄𝗲𝗿! Also, swing by our poster session (Poster272, Halle B). See you there! 🌟

English
0
0
2
218
Shengjie Luo
Shengjie Luo@Roger98079446·
Our Method4⃣: As a fundamental operation, our Gaunt Tensor Product can be applied to major operation classes that are widely used in E(3) equivariant networks. A comprehensive analysis is provided in our work: 9/n
Shengjie Luo tweet media
English
1
2
2
137
Shengjie Luo
Shengjie Luo@Roger98079446·
#ICLR2024 Arrived Vienna! Happy to share our recent work 𝘁𝗼𝘄𝗮𝗿𝗱𝘀 𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁 𝗮𝗻𝗱 𝗲𝗳𝗳𝗲𝗰𝘁𝗶𝘃𝗲 𝗴𝗲𝗼𝗺𝗲𝘁𝗿𝗶𝗰 𝗱𝗲𝗲𝗽 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗳𝗼𝗿 𝘀𝗰𝗶𝗲𝗻𝗰𝗲! With incredible CTL and @ask1729! May 9 10:45am-12:45am (Poster254, Halle B). Details⬇️ (1/n)
Shengjie Luo tweet media
English
2
2
15
1.8K