William Brandon

570 posts

William Brandon

@exists_forall

he/him • Turning FLOPs into Claude @AnthropicAI • Prev: PhD student at MIT CSAIL; ML Compilers at NVIDIA • Opinions my own

San Francisco, CA Katılım Mart 2019

1.3K Takip Edilen765 Takipçiler

Sabitlenmiş Tweet

William Brandon@exists_forall·28 Mar

If any of you ever manage to find another line of code as beautiful as this one, I want to see it.

inform dreams 📚@inform_dreams

A thing can be seen or unseen. A thing is usually unseen. Carry out examining: now the noun is seen.

English

William Brandon@exists_forall·28 Şub

feeling grateful this week to work for an organization that genuinely lives its values 🫡

Anthropic@AnthropicAI

A statement on the comments from Secretary of War Pete Hegseth. anthropic.com/news/statement…

English

420

William Brandon@exists_forall·19 Kas

@johannes_hage Great question. For production use cases, Striped Attention should absolutely be combined with custom attention kernels for higher performance. (At least on GPUs. On TPUs, custom kernels can be challenging) We'd love to do this (or see someone else do it) in the future!

English

106

Johannes Hagemann@johannes_hage·18 Kas

@exists_forall amazing work! any plans to port the Jax Striped Attention implementation to a low level triton/CUDA kernel for even higher MFU and to be able to use it in torch codebases like DeepSpeed? :D

English

240

William Brandon@exists_forall·17 Kas

New preprint out with colleagues from MIT! "Striped Attention: Faster Ring Attention for Causal Transformers" arxiv.org/abs/2311.09431 We introduce a simple extension to the recent Ring Attention algorithm for distributed long-context attention, and see speedups up to 1.65x! 1/

English

15.1K

William Brandon@exists_forall·19 Kas

@DachengLi177 You're right that causal load-balancing is an important part of LightSeq. I tried to address some of the points of comparison here - let me know if you still have any questions! x.com/exists_forall/…

William Brandon@exists_forall

@haozhangml Hi - glad to see you're interested in this! LightSeq is great work (we especially love that you provide an optimized, production-ready implementation!) Our understanding is that LightSeq's approach to causal load-balancing requires sending queries between devices, whereas...

English

Dacheng Li@DachengLi177·18 Kas

@exists_forall Hi Great work! I saw you have described LightSeq in the related work. Actually, rebalancing and optimizing for causal language objective is a core part in LightSeq, but I haven't seen discussion around that. Do you mind providing your opinions more?

English

201

William Brandon@exists_forall·19 Kas

@haozhangml ... Please let me know if you think there are important aspects of LightSeq that we're missing in our comparison. Thanks for your interest!

English

William Brandon@exists_forall·19 Kas

@haozhangml ... Additionally, LightSeq partitions the sequence into blocks of contiguous tokens, like Ring Attention, and does not achieve its causal load-balancing by applying a fine-grained permutation to the input sequence like Striped Attention does. ...

English

127

William Brandon@exists_forall·17 Kas

Getting this implemented, tested, and written up quickly was a great team effort with Ani Nrusimha (@Ani_nlp), Kevin Qian (@skeqiqevian), Zack Ankner (@ZackAnkner), Tian Jin (@jintian), Zoey (Zhiye) Song, and my advisor Jonathan Ragan-Kelley (@jrk). Go follow them! 9/9

English

951

William Brandon@exists_forall·17 Kas

Additionally, thanks to @mcarbin for sponsoring the reading group (mlsys.ai) where we came up with this idea in the first place! 8/

English

923

William Brandon@exists_forall·14 Haz

Ostensibly<T>

English

729

William Brandon@exists_forall·15 Nis

the ~0.12 exponent is asymptotically less training-compute-efficient than the chinchilla scaling exponent of ~0.15. probably suggests the model is overtrained to save on inference costs?

English

535

William Brandon@exists_forall·14 Nis

The equation of the trendline they plot is given by bits per word = 0.27189 * compute^-0.12200 + 1.02433 where "compute" is measured relative to the GPT-4 training run

English

632

William Brandon@exists_forall·14 Nis

I extracted the coordinates from this scaling plot in the GPT-4 technical report, in case anyone is interested: docs.google.com/spreadsheets/d…

English

1.1K

Keşfet

@johannes_hage @DachengLi177 @haozhangml @Ani_nlp @skeqiqevian @ZackAnkner @jintian @jrk