ReThinkLab-SJTU

5 posts

ReThinkLab-SJTU banner
ReThinkLab-SJTU

ReThinkLab-SJTU

@Rethinker135365

Founded by Prof. Junchi Yan since 2018, ReThinkLab's vision is to perform recurrent thinking to change the world through both its technology and talents.

Shanghai Katılım Şubat 2025
5 Takip Edilen72 Takipçiler
ReThinkLab-SJTU
ReThinkLab-SJTU@Rethinker135365·
🚀 Happy to present our new work on LLM Scaling Laws: JTok! We show that token-indexed parameters can serve as a novel, orthogonal scaling axis that decouples model capacity from FLOPs. By modulating Transformer layers with JTok, we achieve comparable model quality with 35% less compute relative to vanilla MoE architectures! ⚙️ JTok & JTok-M Architecture (1) Local Modulation: JTok introduces an auxiliary embedding table at each transformer layer, retrieving token-specific vectors to modulate the backbone through element-wise operations, incurring negligible FLOPs overhead. (2) Dynamic Mixture (JTok-M): To further scale this embedding table, JTok-M introduces an embeeding pool and uses a lightweight router to select a sparse Top-K mixture per token. (3) System Efficiency: Retrieval overlaps with computation. CPU offloading ensures zero extra GPU memory footprint with <7.3% latency increase. 🧠 Key Scaling & Performance Results (1) 35% Compute Savings: Rigorous IsoFLOPs analysis confirms that JTok-M fundamentally shifts the quality-compute Pareto frontier. It achieves comparable model quality while saving 35% compute relative to vanilla MoE models. (2) Massive Downstream Gains: We validated this on backbones up to 61B total parameters (17B backbone + 44B embedding). On a large-scale 17B MoE backbone, JTok-M delivers substantial improvements, including +4.1 on MMLU, +8.3 on ARC, and +8.9 on CEval. (3) Predictable Power-Law Scaling: Validation loss exhibits a log-linear trend with the number of token-indexed parameters. This establishes token-indexed parameters as a highly efficient and scalable dimension alongside traditional dense and sparse scaling. 🎉 Easter Egg: What does "JTok" mean? Technically, it stands for "Joint-Token". But for us SJTUers, "Joint" sounds like "Jiao Tong" (交通), hiding our inside joke: "JT (交通) OK!" 🎓 Following the "jAccount" naming tradition, this is our tribute to Shanghai Jiao Tong University's upcoming 130th anniversary! 🎂❤️ arXiv Link: arxiv.org/pdf/2602.00800
English
0
2
5
530
ReThinkLab-SJTU retweetledi
Xiaohan Qin
Xiaohan Qin@qinxiaohan94414·
🚀 No reference model, yet better token selection? Introducing ssToken — for smarter & cheaper SFT fine-tuning! It improves token-level data selection without extra models, combining learnability + semantics. 📄 HF: huggingface.co/papers/2510.18… 📘 arXiv: arxiv.org/abs/2510.18250
Xiaohan Qin tweet media
English
3
1
5
644
ReThinkLab-SJTU retweetledi
Zhanpeng Zhou
Zhanpeng Zhou@zhanpeng_zhou·
Two papers get accepted by #ICML2025 🥳🥳 [1/2] We discover that different blocks in Transformers exhibit notable disparity in Sharpness. Then we propose Blockwise LR, accelerating large language model (LLM) pre-training (~2x speedup). arxiv.org/abs/2502.19002
English
2
5
26
3.5K
ReThinkLab-SJTU retweetledi
Xiangdong Zhang
Xiangdong Zhang@aHpaBean·
🚀 Can video understanding boost video generation? Introducing VideoREPA (NeurIPS’25) State-of-the-art text-to-video models generate visually stunning results—but still violate basic physics (floating objects, collisions ignored), limiting their reliability as world models.
Xiangdong Zhang tweet media
English
4
2
5
965
ReThinkLab-SJTU retweetledi
Yang Li
Yang Li@LeYangco·
🚀 Happy to present our new work on LLM reasoning! We show that: (1) Attention is a structured map of the model's reasoning logic, uncovering a preplan-and-anchor reasoning rhythm. (2) Aligning RL objectives with the model's intrinsic attention rhythm yields more transparent, fine-grained, and efficient optimization. 🧠 Key Reasoning Patterns in Attention (1) Local Chunking: Near-diagonal sawtooth patterns indicate dense intra-chunk processing. At chunk boundaries, the model performs long-range context retrieval (often with higher entropy), which guides subsequent generation. (2) Global Anchor Planning: Sparse, high-influence anchor tokens exert broad control over later tokens. Perturbing these anchors significantly disrupts downstream reasoning. (3) Preplan-Anchor Coupling: A stable temporal rhythm emerges: the model first emits a "preplan" token, then anchors a core semantic node, repeatedly structuring the reasoning trajectory. ⚙️ RL Innovation We introduce a dynamic reward redistribution mechanism guided by attention-derived reasoning structure: (1) Preplan Guidance: Boosts tokens that guide local chunks and enable long-range referencing. (2) Anchor Enhancement: Prioritizes optimization of globally influential semantic anchors. (3) Coupling Alignment: Reinforces the temporal coordination between preplans and anchors to solidify structured reasoning. HuggingFace Link: huggingface.co/papers/2510.13… arXiv Link: arxiv.org/abs/2510.13554 #LLMs #artificial_intelligence #RL4LLM
Yang Li tweet media
English
9
43
218
13.5K