Program Counter

1.6K posts

Program Counter banner
Program Counter

Program Counter

@program_counter

all things toward agi

Valhalla Katılım Aralık 2022
7.5K Takip Edilen317 Takipçiler
Program Counter retweetledi
François Chollet
François Chollet@fchollet·
I wrote Deep Learning with Python to be the definitive guide to how deep learning works and how to best make use of it. Tens of thousands of people got their career start via this book. 120,000 copies sold, and downloaded by millions more. And now it's free to read online: deeplearningwithpython.io
English
40
200
1.5K
83.5K
Program Counter retweetledi
allbilly01
allbilly01@allbilly01·
I reversed RK3588 NPU registers and integrated to tinygrad. Next, i will document it as detail as my ane repo. Link in comment
English
8
9
147
11.5K
Program Counter retweetledi
Oncescu Costin-Andrei
Oncescu Costin-Andrei@costinoncescu·
Proud to share this work! We introduce the Recurrent Transformer - a new architecture which improves modeling performance and leads to better inference efficiency for a transformer. Lots of exciting future directions. More details in this thread by Sham.
Sham Kakade@ShamKakade6

1/8 Introducing Recurrent Transformer (RT). At 300M params, RT improves validation CE over standard Transformers. The best RT model is only 6 layers, but wider at 2048 — beating deeper 12- and 24-layer Transformers by trading depth for width.

English
0
6
37
3.4K
Program Counter retweetledi
Keller Jordan
Keller Jordan@kellerjordan0·
Modded-NanoGPT Optimization Benchmark Hundreds of neural network optimizers have been proposed in the literature, recently including dozens citing Muon: MARS, SWAN, REG, ADANA, Newton-Muon, TrasMuon, AdaMuon, HTMuon, COSMOS, Conda, ASGO, SAGE, and Magma, to name a few. The majority of this innovation is happening in the public research community. But the community currently lacks a widely accepted, easily accessible way to compare and make sense of the deluge of methods. As a result, promising new ideas get buried, and spurious results go unchallenged. To help address these issues, I'm releasing a new optimization benchmark. It's designed for maximum simplicity and speed: Just a single file containing ~350 lines of plain PyTorch, which can complete a baseline LM training within 20 minutes of booting up a fresh 8xH100 machine. It also works with {1,2,4}xH100 or A100. These attributes make the new benchmark more accessible than any prior work. The rules are simple: The optimization algorithm can be changed arbitrarily, with the goal being to minimize the number of training steps needed to reach 3.28 val loss on FineWeb (this is the same target loss as in the main speedrun). Modifying the architecture or dataloader, on the other hand, is not allowed. Wallclock time is unlimited, in order to give a fair chance to optimizers which would need kernel work or larger scale to become wallclock-efficient. Like the main NanoGPT speedrun, submissions are open, and new results will be publicly broadcast. Beyond just improving the step count record, another goal of the benchmark is to collaboratively produce well-tuned baselines for as many optimizers as possible. For example, any improvement to the benchmark's best hyperparameters for AdamW would be considered a worthwhile new result. This benchmark is not intended to be the final measure of optimizer quality across all domains. Convenient shared experimental infrastructure which covers the full space of possibilities -- across varying batch size, tokens per parameter, model scale, epoch count, and architecture -- is desirable, but far beyond the current status quo. This benchmark is only meant to be one step towards that goal. To start the benchmark off, I've spent ~20 runs tuning baselines for Muon and AdamW. From time to time over the next few weeks, I'll add another optimizer from the literature, with my best effort at finding good hyperparameters. Researchers interested in neural network optimization are invited to join in by picking an optimizer and giving it a try on the benchmark. All optimizers are welcome, and even runs that don't necessarily have the best hyperparameters are desirable additions to the repo, because each new run adds to the collective knowledge.
Keller Jordan tweet media
English
23
115
832
150.2K
Program Counter retweetledi
OptimaLab
OptimaLab@optimalab1·
During neural network training, the loss landscape gets sharper until it hits a ceiling. GD pins right at the ceiling. SGD settles below it — and the gap grows as you shrink the batch. Why? We now have the answer. arxiv.org/abs/2604.21016 🧵 Blog: akyrillidis.github.io/aiowls/stochas…
OptimaLab tweet media
English
7
61
410
35.7K
Program Counter retweetledi
Alex Zhurkevich
Alex Zhurkevich@cudagdb·
Tomorrow: Blackwell Programming lecture by yours truly at Stanford CME213, Gates B3, 1:30–2:50 PM. Bring sharp questions.
English
6
9
138
7.4K
Program Counter retweetledi
wh
wh@nrehiew_·
How I read papers now. This is an explainer by Claude about the new Compressed Sparse Attention v4 uses to compress the KV cache.
wh tweet media
wh@nrehiew_

Now reading:

English
6
69
700
55.5K
Program Counter retweetledi
snow
snow@snowclipsed·
goddamn wow
jianlin.su@Jianlin_S

Beyond MuP: 4. Maintaining Parameter Stability kexue.fm/archives/11729 Based on the principle of minimal modification, we proposes a general framework for maintaining parameter stability during training, encompassing two schemes: Post Clip and Pre Decay. Under the spectral norm, these further evolve into singular value clipping and spectral weight decay. These operations aim to ensure that critical parameter norms remain bounded while minimizing interference with training dynamics.

English
0
2
27
4.1K
Program Counter retweetledi
Nathan Lambert
Nathan Lambert@natolambert·
+1. Folks interested in the Chinese llm space should listen to this.
Kyle Chan@kyleichan

Must-listen interview by @Changxche with ex-ByteDance AI researcher: - Benchmaxxing - Distillation on US models - Poor data quality and infra - Compute constraints "I don’t even agree with the assumption that Chinese models are catching up — I believe we’re still far behind. I guess the gap is getting larger, very sadly." podcasts.apple.com/us/podcast/a-y…

English
3
28
311
88.7K
Program Counter retweetledi
nathan chen
nathan chen@nathancgy4·
one of the very first things i worked on after joining kimi was speeding up KDA's kernels with @yzhang_cs and @uniartisan (i got carried :D). it was super fun optimizing those triton kernels... and now comes FlashKDA, a highly efficient KDA in CUTLASS for the open community! side note: knowing how to write a kernel matters less and less, but knowing how it actually works efficiently matters as much as ever. although I rarely write kernels anymore, and instead mostly use kimi k2.6 / opus 4.5-7 to write them—far from optimized ones, simply for the sake of testing for signs of life—for me, those days of trying to make algorithms as hardware-aligned as possible turned out to be special and shaped many intuition for architectural designs that followed. (arch and infra are really two sides of the same coin). would highly recommend reading basic flash/linear attention's triton kernels in FLA (github.com/fla-org/flash-…) for anyone wanting to better understand how efficient kernels work btw
Kimi.ai@Kimi_Moonshot

We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention. Explore on github: github.com/MoonshotAI/Fla…

English
8
27
374
47.1K
Program Counter retweetledi
alex zhang
alex zhang@a1zhang·
Incredibly well written blog on RLMs by @raw_works, highly recommend you read it! He thinks about them in a particularly intuitive way. He’s also been the main driver of recent OSS RLM results on LongCoT :)
alex zhang tweet media
English
7
58
663
39.9K
Program Counter retweetledi
nic
nic@nicholaschen__·
this blog by alexandr wang: alexw.substack.com/p/hire might be written for the hiring side, but for anyone looking for new opportunities (or already working somewhere), you should actually give a shit about the company and product you work for.
English
4
7
161
10.6K
Program Counter retweetledi
Enze Xie
Enze Xie@xieenze_jr·
🚀 Excited to share Sol-RL (Speed-of-Light RL) — a new high-efficiency preference alignment method for Diffusion RL, primarily developed by first author Yitong Li (@yitongli165665 )! It uses a smart two-stage design: FP4 for ultra-fast massive rollouts and quick filtering of high-contrast samples, followed by BF16 high-precision optimization on the selected data only. Achieves up to 4.64× faster convergence while delivering better alignment results on SANA, FLUX.1 & SD3.5-L. 📄 Paper: arxiv.org/abs/2604.06916 Let’s push Diffusion RL forward together! 🔥
English
2
15
100
12.3K
Program Counter retweetledi
Underfox
Underfox@Underfox3·
In this paper is presented Nautilus, a tensor compiler that fully automates GPU kernel generation from program description, and produces high-performance GPU kernels. arxiv.org/pdf/2604.14825
Underfox tweet mediaUnderfox tweet mediaUnderfox tweet mediaUnderfox tweet media
English
2
20
152
8.4K
Program Counter retweetledi
Madison Faulkner
Madison Faulkner@maddiehfaulkner·
multiple of my companies are hiring for founding SE/FDE in top early infra and ai. ping me
English
13
3
59
4.9K
Program Counter retweetledi
Adam Taylor
Adam Taylor@ATaylorFPGA·
I thought I would implement the very interesting Arxiv paper on eml(x,y) in a FPGA. I will write this up in more detail, as not is not suitable for all FPGA applications but it is in some. So I created a simple machine to implement an arbitrary tree length, in fixed resources. arxiv.org/html/2603.2185…
Adam Taylor tweet mediaAdam Taylor tweet mediaAdam Taylor tweet media
English
12
38
312
19.3K