
Program Counter
1.6K posts

Program Counter
@program_counter
all things toward agi


1/8 Introducing Recurrent Transformer (RT). At 300M params, RT improves validation CE over standard Transformers. The best RT model is only 6 layers, but wider at 2048 — beating deeper 12- and 24-layer Transformers by trading depth for width.









Now reading:

Beyond MuP: 4. Maintaining Parameter Stability kexue.fm/archives/11729 Based on the principle of minimal modification, we proposes a general framework for maintaining parameter stability during training, encompassing two schemes: Post Clip and Pre Decay. Under the spectral norm, these further evolve into singular value clipping and spectral weight decay. These operations aim to ensure that critical parameter norms remain bounded while minimizing interference with training dynamics.

Must-listen interview by @Changxche with ex-ByteDance AI researcher: - Benchmaxxing - Distillation on US models - Poor data quality and infra - Compute constraints "I don’t even agree with the assumption that Chinese models are catching up — I believe we’re still far behind. I guess the gap is getting larger, very sadly." podcasts.apple.com/us/podcast/a-y…

We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention. Explore on github: github.com/MoonshotAI/Fla…



















