

Chenwei Cui
5.3K posts

@ccui42
CS PhD Student @ Kerner Lab @hannah_kerner @SCAI_ASU. I am interested in the science of machine learning.




Karpathy's Autoresearch is bottlenecked by a single GPU. We removed the bottleneck. We gave the agent access to our K8s cluster with H100s and H200s and let it provision its own GPUs. Over 8 hours: โข ~910 experiments instead of ~96 sequentially โข Discovered that scaling model width mattered more than all hparam tuning โข Taught itself to exploit heterogenous hardware: use H200s for validation, screen ideas on H100s Full setup and results: blog.skypilot.co/scaling-autoreโฆ @karpathy


[LG] MยฒRNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling M Mishra, S Tan, I Stoica, J Gonzalezโฆ [UC Berkeley & MIT-IBM Watson Lab] (2026) arxiv.org/abs/2603.14360




ByteDance also implemented attention over depth. They literally combined it with sequence attention.

Introducing ๐จ๐๐๐๐๐๐๐๐ ๐น๐๐๐๐ ๐๐๐๐: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. ๐น Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. ๐น Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. ๐น Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. ๐น Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. ๐Full report: github.com/MoonshotAI/Attโฆ


I thought about it again. The whole purpose of expanding model dim is to diversify the communication channel between layers. ResC: I HC: c * I where c = H^{post}_i \Prod_{k from i+1 to j-1} H^{res}_k H^{pre}_j LatentMoE: up_i @ down_j


Introducing ๐จ๐๐๐๐๐๐๐๐ ๐น๐๐๐๐ ๐๐๐๐: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. ๐น Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. ๐น Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. ๐น Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. ๐น Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. ๐Full report: github.com/MoonshotAI/Attโฆ


Introducing ๐จ๐๐๐๐๐๐๐๐ ๐น๐๐๐๐ ๐๐๐๐: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. ๐น Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. ๐น Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. ๐น Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. ๐น Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. ๐Full report: github.com/MoonshotAI/Attโฆ


Introducing ๐จ๐๐๐๐๐๐๐๐ ๐น๐๐๐๐ ๐๐๐๐: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. ๐น Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. ๐น Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. ๐น Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. ๐น Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. ๐Full report: github.com/MoonshotAI/Attโฆ