Shengjie Luo

119 posts

Shengjie Luo

@Roger98079446

PhD Student @pku1898, interested in Machine Learning

Peking University Katılım Ağustos 2019

525 Takip Edilen334 Takipçiler

Sabitlenmiş Tweet

Shengjie Luo@Roger98079446·3 Şub

#ICLR2023 New paper! "Rethinking the expressive power of GNNs via graph biconnectivity" accepted as an 𝗼𝗿𝗮𝗹 𝗽𝗿𝗲𝘀𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 (notable-top 5%) 🔥A new direction to study GNN expressivity via graph biconnectivity! 👇Let's see the details of our fruitful results🤗

Bohang Zhang @ICLR 2024@bohang_zhang

Excited to see our paper "Rethinking the expressive power of GNNs via graph biconnectivity" accepted as an 𝗼𝗿𝗮𝗹 𝗽𝗿𝗲𝘀𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 (notable-top 5%) at #ICLR2023! arxiv.org/abs/2301.09505 Joint work with @Roger98079446, Liwei Wang, and Di He 1/n

English

6.3K

Shengjie Luo retweetledi

Tianle Cai@tianle_cai·4d

x.com/i/article/2042…

ZXX

620

214.6K

Shengjie Luo retweetledi

Tianle Cai@tianle_cai·6d

Can we turn part of an LLM's weights into long-term memory that continuously absorbs new knowledge? We took a small step toward this with In-Place Test-Time Training (In-Place TTT) — accepted as an Oral at ICLR 2026 🎉 The key idea: no new modules, optional pretraining. We repurpose the final projection matrix in every MLP block as fast weights. With an NTP-aligned objective and efficient chunk-wise updates, the model adapts on the fly — complementing attention rather than replacing it. 📄 Paper: arxiv.org/abs/2604.06169 with amazing @Guhao_Feng @Roger98079446 Kai @GeZhang86038849 Di @HuangRubio

English

144

73.9K

Shengjie Luo retweetledi

Yiping Lu@2prime_PKU·11 Mar

Gradient-Lipschitz analysis can recovers the scaling behind muP！Studying how network width changes the gradient Lip constant under operator norms, we • recover muP scaling for Adam • Muon’s smoothness can be bad • New Row-wise gradient normalization is competitive with Muon

English

181

22.2K

Shengjie Luo retweetledi

Karan Dalal@karansdalal·29 Ara

Our new paper, “End-to-End Test-Time Training for Long Context,” is a step towards continual learning in language models. We introduce a new method that blurs the boundary between training and inference. At test-time, our model continues learning from given context using the same next-token prediction objective as training. With this end-to-end objective, our model can efficiently compress substantial context into its weights and still use it effectively, unlocking extremely long context windows for complex reasoning and applications in agents and robotics. Paper: test-time-training.github.io/e2e.pdf Code: github.com/test-time-trai…

English

208

1.2K

182.8K

Shengjie Luo retweetledi

Ilya Sutskever@ilyasut·28 Kas

One point I made that didn’t come across: - Scaling the current thing will keep leading to improvements. In particular, it won’t stall. - But something important will continue to be missing.

Haider.@haider1

here are the most important points from today's ilya sutskever podcast: - superintelligence in 5-20 years - current scaling will stall hard; we're back to real research - superintelligence = super-fast continual learner, not finished oracle - models generalize 100x worse than humans, the biggest AGI blocker - need completely new ML paradigm (i have ideas, can't share rn) - AI impact will hit hard, but only after economic diffusion - breakthroughs historically needed almost no compute - SSI has enough focused research compute to win - current RL already eats more compute than pre-training

English

730

789

9.7K

2.3M

Shengjie Luo retweetledi

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·17 Haz

Diagnosing and Improving Diffusion Models by Estimating the Optimal Loss Value "We first derive the optimal loss in closed form under a unified formulation of diffusion models, and develop effective estimators for it, including a stochastic variant scalable to large datasets with proper control of variance and bias. With this tool, we unlock the inherent metric for diagnosing the training quality of mainstream diffusion model variants, and develop a more performant training schedule based on the optimal loss. Moreover, using models with 120M to 1.5B parameters, we find that the power law is better demonstrated after subtracting the optimal loss from the actual training loss, suggesting a more principled setting for investigating the scaling law for diffusion models."

Tanishq Mathew Abraham, Ph.D. tweet media

English

146

9.2K

Shengjie Luo retweetledi

Songlin Yang@SonglinYang4·21 Şub

Introducing the first open-source implementation of native sparse attention: github.com/fla-org/native…. Give it a spin and cook your NSA model! 🐳🐳🐳

English

118

756

72.4K

Shengjie Luo retweetledi

Tianle Cai@tianle_cai·20 Şub

Just grasped the true significance (not just bc it's submitted by Wenfeng) of this work after reading @SonglinYang4 's explanation. The breakthrough isn't hybrid attention (studied years ago), but the ingenious kernel that delivers real-world speedups for dynamic sparse attention. As someone who worked on efficient transformers in undergrad, I had the impression that combining "efficient attentions" (linear, sparse, conv, block-structured), which theoretically would be faster, had the potential to replace full attention but was practically slower. But Deepseek's solution is different: By having each query group of a token attend to the same KV block, they can really reduce the memory movement and achieve FlashAttention-like memory efficiency. This matters enormously for reasoning models that output long thinking processes (10k+ tokens). The efficient dynamic sparse kernel dramatically speeds up both training and inference for such models. What a brilliant example of algorithm-system co-design!

DeepSeek@deepseek_ai

🚀 Introducing NSA: A Hardware-Aligned and Natively Trainable Sparse Attention mechanism for ultra-fast long-context training & inference! Core components of NSA: • Dynamic hierarchical sparse strategy • Coarse-grained token compression • Fine-grained token selection 💡 With optimized design for modern hardware, NSA speeds up inference while reducing pre-training costs—without compromising performance. It matches or outperforms Full Attention models on general benchmarks, long-context tasks, and instruction-based reasoning. 📖 For more details, check out our paper here: arxiv.org/abs/2502.11089

English

340

43.4K

Shengjie Luo retweetledi

Jiao Sun@sunjiao123sun_·14 Ara

Mitigating racial bias from LLMs is a lot easier than removing it from humans! Can’t believe this happened at the best AI conference @NeurIPSConf We have ethical reviews for authors, but missed it for invited speakers? 😡

English

175

780

3.7K

2.2M

Shengjie Luo retweetledi

Noam Shazeer@NoamShazeer·20 Haz

Character AI is serving 20,000 QPS. Here are the technologies we use to serve hyper-efficiently. [research.character.ai/optimizing-inf… ]

English

181

1.4K

577.3K

Shengjie Luo retweetledi

Yuanqi Du@YuanqiD·18 Haz

🧵1/7 Introducing an “Encyclopedia” of Molecular Design with Machine Learning @NatMachIntell! nature.com/articles/s4225… Collaboration with @arian_jamasb*, @JeffGuo__*, @TianfanFu, @charlieharris01, @yingheng_wang, @chenru_duan, @pl219_Cambridge, @pschwller, and Tom Blundell.

English

22.4K

Shengjie Luo retweetledi

Shubhendu Trivedi@_onionesque·16 Haz

Apropos of some real life discussions: We have superfast custom CUDA implementations for tensor-product-based (Clebsch-Gordan) equivariant NNs: github.com/zlin7/CGNet Based on the papers (and heavily optimized further!) arxiv.org/abs/1806.09231 and arxiv.org/abs/2010.11661

English

8.1K

Shengjie Luo retweetledi

Kacper Kapuśniak@KKapusniak1·3 Haz

If data lives on a manifold, how do we design meaningful interpolations between marginals? We present Metric Flow Matching (MFM)… @PPotaptchik @TeoReu @leoeleoleo1 @AlexanderTong7 @mmbronstein @bose_joey @Francesco_dgv 🔗Dive in here: arxiv.org/abs/2405.14780 🧵 (1/12)

GIF

English

384

107.3K

Shengjie Luo retweetledi

Xiang Fu@xiangfu_ml·30 May

Charge density is the core attribute of atomic systems in DFT. When representing and predicting charge density with ML, it is challenging to balance accuracy and efficiency. We propose a recipe that achieves SOTA on both: arxiv.org/abs/2405.19276 1/5

English

202

44.5K

Shengjie Luo@Roger98079446·6 May

Super Cool! Don't miss it!

Bohang Zhang @ICLR 2024@bohang_zhang

#ICLR2024 Just arrived in Vienna! Don't miss our oral presentation tomorrow afternoon in room Halle A3, focusing on 𝗚𝗡𝗡𝘀 and their 𝗲𝘅𝗽𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗽𝗼𝘄𝗲𝗿! Also, swing by our poster session (Poster272, Halle B). See you there! 🌟

English

218

Shengjie Luo@Roger98079446·6 May

Experiment: Extensive experiments are conducted to verify the efficiency and generality of our approach. See our paper and code repository for more details! Paper: openreview.net/pdf?id=mhyQXJ6… Code: github.com/lsj2408/Gaunt-… Looking forward to your feedback! 10/10

English

Shengjie Luo@Roger98079446·6 May

Our Method4⃣: As a fundamental operation, our Gaunt Tensor Product can be applied to major operation classes that are widely used in E(3) equivariant networks. A comprehensive analysis is provided in our work: 9/n

English

137

Shengjie Luo@Roger98079446·6 May

#ICLR2024 Arrived Vienna! Happy to share our recent work 𝘁𝗼𝘄𝗮𝗿𝗱𝘀 𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁 𝗮𝗻𝗱 𝗲𝗳𝗳𝗲𝗰𝘁𝗶𝘃𝗲 𝗴𝗲𝗼𝗺𝗲𝘁𝗿𝗶𝗰 𝗱𝗲𝗲𝗽 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗳𝗼𝗿 𝘀𝗰𝗶𝗲𝗻𝗰𝗲! With incredible CTL and @ask1729! May 9 10:45am-12:45am (Poster254, Halle B). Details⬇️ (1/n)

English

1.8K

Keşfet

@Guhao_Feng @GeZhang86038849 @HuangRubio @SonglinYang4 @NeurIPSConf @NatMachIntell @arian_jamasb @JeffGuo__