
🚀 Happy to present our new work on LLM Scaling Laws: JTok!
We show that token-indexed parameters can serve as a novel, orthogonal scaling axis that decouples model capacity from FLOPs. By modulating Transformer layers with JTok, we achieve comparable model quality with 35% less compute relative to vanilla MoE architectures!
⚙️ JTok & JTok-M Architecture
(1) Local Modulation: JTok introduces an auxiliary embedding table at each transformer layer, retrieving token-specific vectors to modulate the backbone through element-wise operations, incurring negligible FLOPs overhead.
(2) Dynamic Mixture (JTok-M): To further scale this embedding table, JTok-M introduces an embeeding pool and uses a lightweight router to select a sparse Top-K mixture per token.
(3) System Efficiency: Retrieval overlaps with computation. CPU offloading ensures zero extra GPU memory footprint with <7.3% latency increase.
🧠 Key Scaling & Performance Results
(1) 35% Compute Savings: Rigorous IsoFLOPs analysis confirms that JTok-M fundamentally shifts the quality-compute Pareto frontier. It achieves comparable model quality while saving 35% compute relative to vanilla MoE models.
(2) Massive Downstream Gains: We validated this on backbones up to 61B total parameters (17B backbone + 44B embedding). On a large-scale 17B MoE backbone, JTok-M delivers substantial improvements, including +4.1 on MMLU, +8.3 on ARC, and +8.9 on CEval.
(3) Predictable Power-Law Scaling: Validation loss exhibits a log-linear trend with the number of token-indexed parameters. This establishes token-indexed parameters as a highly efficient and scalable dimension alongside traditional dense and sparse scaling.
🎉 Easter Egg: What does "JTok" mean? Technically, it stands for "Joint-Token". But for us SJTUers, "Joint" sounds like "Jiao Tong" (交通), hiding our inside joke: "JT (交通) OK!" 🎓 Following the "jAccount" naming tradition, this is our tribute to Shanghai Jiao Tong University's upcoming 130th anniversary! 🎂❤️
arXiv Link: arxiv.org/pdf/2602.00800
English






