Program Counter

1.6K posts

Program Counter

@program_counter

all things toward agi

Valhalla Katılım Aralık 2022

7.5K Takip Edilen317 Takipçiler

Program Counter retweetledi

François Chollet@fchollet·5h

I wrote Deep Learning with Python to be the definitive guide to how deep learning works and how to best make use of it. Tens of thousands of people got their career start via this book. 120,000 copies sold, and downloaded by millions more. And now it's free to read online: deeplearningwithpython.io

English

200

1.5K

83.5K

Program Counter retweetledi

allbilly01@allbilly01·4d

I reversed RK3588 NPU registers and integrated to tinygrad. Next, i will document it as detail as my ane repo. Link in comment

English

147

11.5K

Program Counter retweetledi

Oncescu Costin-Andrei@costinoncescu·27 Nis

Proud to share this work! We introduce the Recurrent Transformer - a new architecture which improves modeling performance and leads to better inference efficiency for a transformer. Lots of exciting future directions. More details in this thread by Sham.

Sham Kakade@ShamKakade6

1/8 Introducing Recurrent Transformer (RT). At 300M params, RT improves validation CE over standard Transformers. The best RT model is only 6 layers, but wider at 2048 — beating deeper 12- and 24-layer Transformers by trading depth for width.

English

3.4K

Program Counter retweetledi

Keller Jordan@kellerjordan0·6d

Modded-NanoGPT Optimization Benchmark Hundreds of neural network optimizers have been proposed in the literature, recently including dozens citing Muon: MARS, SWAN, REG, ADANA, Newton-Muon, TrasMuon, AdaMuon, HTMuon, COSMOS, Conda, ASGO, SAGE, and Magma, to name a few. The majority of this innovation is happening in the public research community. But the community currently lacks a widely accepted, easily accessible way to compare and make sense of the deluge of methods. As a result, promising new ideas get buried, and spurious results go unchallenged. To help address these issues, I'm releasing a new optimization benchmark. It's designed for maximum simplicity and speed: Just a single file containing ~350 lines of plain PyTorch, which can complete a baseline LM training within 20 minutes of booting up a fresh 8xH100 machine. It also works with {1,2,4}xH100 or A100. These attributes make the new benchmark more accessible than any prior work. The rules are simple: The optimization algorithm can be changed arbitrarily, with the goal being to minimize the number of training steps needed to reach 3.28 val loss on FineWeb (this is the same target loss as in the main speedrun). Modifying the architecture or dataloader, on the other hand, is not allowed. Wallclock time is unlimited, in order to give a fair chance to optimizers which would need kernel work or larger scale to become wallclock-efficient. Like the main NanoGPT speedrun, submissions are open, and new results will be publicly broadcast. Beyond just improving the step count record, another goal of the benchmark is to collaboratively produce well-tuned baselines for as many optimizers as possible. For example, any improvement to the benchmark's best hyperparameters for AdamW would be considered a worthwhile new result. This benchmark is not intended to be the final measure of optimizer quality across all domains. Convenient shared experimental infrastructure which covers the full space of possibilities -- across varying batch size, tokens per parameter, model scale, epoch count, and architecture -- is desirable, but far beyond the current status quo. This benchmark is only meant to be one step towards that goal. To start the benchmark off, I've spent ~20 runs tuning baselines for Muon and AdamW. From time to time over the next few weeks, I'll add another optimizer from the literature, with my best effort at finding good hyperparameters. Researchers interested in neural network optimization are invited to join in by picking an optimizer and giving it a try on the benchmark. All optimizers are welcome, and even runs that don't necessarily have the best hyperparameters are desirable additions to the repo, because each new run adds to the collective knowledge.

English

115

832

150.2K

Program Counter retweetledi

OptimaLab@optimalab1·27 Nis

During neural network training, the loss landscape gets sharper until it hits a ceiling. GD pins right at the ceiling. SGD settles below it — and the gap grows as you shrink the batch. Why? We now have the answer. arxiv.org/abs/2604.21016 🧵 Blog: akyrillidis.github.io/aiowls/stochas…

English

410

35.7K

Program Counter retweetledi

Paul@paulWilliamChan·27 Nis

I wrote a matmul kernel on B200 in pure CUDA/PTX that beats cuBLAS by 6% at M=N=K=8192. Inspired by @gaunernst's blog on Blackwell instructions with benchmarking done on @modal. Blog: paulwillchan.com/articles/outpe… Repo: github.com/Better-Call-Pa…

English

675

31.7K

Program Counter retweetledi

Alex Zhurkevich@cudagdb·23 Nis

Tomorrow: Blackwell Programming lecture by yours truly at Stanford CME213, Gates B3, 1:30–2:50 PM. Bring sharp questions.

English

138

7.4K

Program Counter retweetledi

wh@nrehiew_·24 Nis

How I read papers now. This is an explainer by Claude about the new Compressed Sparse Attention v4 uses to compress the KV cache.

wh@nrehiew_

Now reading:

English

700

55.5K

Program Counter retweetledi

snow@snowclipsed·24 Nis

goddamn wow

jianlin.su@Jianlin_S

Beyond MuP: 4. Maintaining Parameter Stability kexue.fm/archives/11729 Based on the principle of minimal modification, we proposes a general framework for maintaining parameter stability during training, encompassing two schemes: Post Clip and Pre Decay. Under the spectral norm, these further evolve into singular value clipping and spectral weight decay. These operations aim to ensure that critical parameter norms remain bounded while minimizing interference with training dynamics.

English

4.1K

Program Counter retweetledi

Nathan Lambert@natolambert·24 Nis

+1. Folks interested in the Chinese llm space should listen to this.

Kyle Chan@kyleichan

Must-listen interview by @Changxche with ex-ByteDance AI researcher: - Benchmaxxing - Distillation on US models - Poor data quality and infra - Compute constraints "I don’t even agree with the assumption that Chinese models are catching up — I believe we’re still far behind. I guess the gap is getting larger, very sadly." podcasts.apple.com/us/podcast/a-y…

English

311

88.7K

Program Counter retweetledi

nathan chen@nathancgy4·21 Nis

one of the very first things i worked on after joining kimi was speeding up KDA's kernels with @yzhang_cs and @uniartisan (i got carried :D). it was super fun optimizing those triton kernels... and now comes FlashKDA, a highly efficient KDA in CUTLASS for the open community! side note: knowing how to write a kernel matters less and less, but knowing how it actually works efficiently matters as much as ever. although I rarely write kernels anymore, and instead mostly use kimi k2.6 / opus 4.5-7 to write them—far from optimized ones, simply for the sake of testing for signs of life—for me, those days of trying to make algorithms as hardware-aligned as possible turned out to be special and shaped many intuition for architectural designs that followed. (arch and infra are really two sides of the same coin). would highly recommend reading basic flash/linear attention's triton kernels in FLA (github.com/fla-org/flash-…) for anyone wanting to better understand how efficient kernels work btw

Kimi.ai@Kimi_Moonshot

We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention. Explore on github: github.com/MoonshotAI/Fla…

English

374

47.1K

Program Counter retweetledi

alex zhang@a1zhang·20 Nis

Incredibly well written blog on RLMs by @raw_works, highly recommend you read it! He thinks about them in a particularly intuitive way. He’s also been the main driver of recent OSS RLM results on LongCoT :)

English

663

39.9K

Program Counter retweetledi

Charles 🎉 Frye@charles_irl·21 Nis

👀 haoailab.com/cse291-s26/

QME

235

11.6K

Program Counter retweetledi

nic@nicholaschen__·18 Nis

this blog by alexandr wang: alexw.substack.com/p/hire might be written for the hiring side, but for anyone looking for new opportunities (or already working somewhere), you should actually give a shit about the company and product you work for.

English

161

10.6K

Program Counter retweetledi

Xiuyu Li@sheriyuo·18 Nis

x.com/i/article/2045…

ZXX

378

17.4K

Program Counter retweetledi

Elliot Arledge@elliotarledge·18 Nis

what the

vixhaℓ@TheVixhal

x.com/i/article/2044…

English

571

83.8K

Program Counter retweetledi

Enze Xie@xieenze_jr·17 Nis

🚀 Excited to share Sol-RL (Speed-of-Light RL) — a new high-efficiency preference alignment method for Diffusion RL, primarily developed by first author Yitong Li (@yitongli165665 )! It uses a smart two-stage design: FP4 for ultra-fast massive rollouts and quick filtering of high-contrast samples, followed by BF16 high-precision optimization on the selected data only. Achieves up to 4.64× faster convergence while delivering better alignment results on SANA, FLUX.1 & SD3.5-L. 📄 Paper: arxiv.org/abs/2604.06916 Let’s push Diffusion RL forward together! 🔥

English

100

12.3K

Program Counter retweetledi

Underfox@Underfox3·18 Nis

In this paper is presented Nautilus, a tensor compiler that fully automates GPU kernel generation from program description, and produces high-performance GPU kernels. arxiv.org/pdf/2604.14825

English

152

8.4K

Program Counter retweetledi

Madison Faulkner@maddiehfaulkner·17 Nis

multiple of my companies are hiring for founding SE/FDE in top early infra and ai. ping me

English

4.9K

Program Counter retweetledi

Adam Taylor@ATaylorFPGA·18 Nis

I thought I would implement the very interesting Arxiv paper on eml(x,y) in a FPGA. I will write this up in more detail, as not is not suitable for all FPGA applications but it is in some. So I created a simple machine to implement an arbitrary tree length, in fixed resources. arxiv.org/html/2603.2185…

English

312

19.3K

Keşfet

@gaunernst @modal @yzhang_cs @uniartisan @raw_works @yitongli165665 @elonmusk @BarackObama