William Brandon

570 posts

William Brandon banner
William Brandon

William Brandon

@exists_forall

he/him • Turning FLOPs into Claude @AnthropicAI • Prev: PhD student at MIT CSAIL; ML Compilers at NVIDIA • Opinions my own

San Francisco, CA Katılım Mart 2019
1.3K Takip Edilen765 Takipçiler
William Brandon
William Brandon@exists_forall·
@johannes_hage Great question. For production use cases, Striped Attention should absolutely be combined with custom attention kernels for higher performance. (At least on GPUs. On TPUs, custom kernels can be challenging) We'd love to do this (or see someone else do it) in the future!
English
0
0
0
106
Johannes Hagemann
Johannes Hagemann@johannes_hage·
@exists_forall amazing work! any plans to port the Jax Striped Attention implementation to a low level triton/CUDA kernel for even higher MFU and to be able to use it in torch codebases like DeepSpeed? :D
English
1
0
1
240
William Brandon
William Brandon@exists_forall·
New preprint out with colleagues from MIT! "Striped Attention: Faster Ring Attention for Causal Transformers" arxiv.org/abs/2311.09431 We introduce a simple extension to the recent Ring Attention algorithm for distributed long-context attention, and see speedups up to 1.65x! 1/
William Brandon tweet mediaWilliam Brandon tweet media
English
5
13
70
15.1K
William Brandon
William Brandon@exists_forall·
@DachengLi177 You're right that causal load-balancing is an important part of LightSeq. I tried to address some of the points of comparison here - let me know if you still have any questions! x.com/exists_forall/…
William Brandon@exists_forall

@haozhangml Hi - glad to see you're interested in this! LightSeq is great work (we especially love that you provide an optimized, production-ready implementation!) Our understanding is that LightSeq's approach to causal load-balancing requires sending queries between devices, whereas...

English
0
0
0
96
Dacheng Li
Dacheng Li@DachengLi177·
@exists_forall Hi Great work! I saw you have described LightSeq in the related work. Actually, rebalancing and optimizing for causal language objective is a core part in LightSeq, but I haven't seen discussion around that. Do you mind providing your opinions more?
English
1
0
2
201
William Brandon
William Brandon@exists_forall·
@haozhangml ... Please let me know if you think there are important aspects of LightSeq that we're missing in our comparison. Thanks for your interest!
English
0
0
0
62
William Brandon
William Brandon@exists_forall·
@haozhangml ... Additionally, LightSeq partitions the sequence into blocks of contiguous tokens, like Ring Attention, and does not achieve its causal load-balancing by applying a fine-grained permutation to the input sequence like Striped Attention does. ...
English
2
0
0
127
William Brandon
William Brandon@exists_forall·
Getting this implemented, tested, and written up quickly was a great team effort with Ani Nrusimha (@Ani_nlp), Kevin Qian (@skeqiqevian), Zack Ankner (@ZackAnkner), Tian Jin (@jintian), Zoey (Zhiye) Song, and my advisor Jonathan Ragan-Kelley (@jrk). Go follow them! 9/9
English
0
0
6
951
William Brandon
William Brandon@exists_forall·
Additionally, thanks to @mcarbin for sponsoring the reading group (mlsys.ai) where we came up with this idea in the first place! 8/
English
1
1
9
923
William Brandon
William Brandon@exists_forall·
the ~0.12 exponent is asymptotically less training-compute-efficient than the chinchilla scaling exponent of ~0.15. probably suggests the model is overtrained to save on inference costs?
English
0
0
1
535
William Brandon
William Brandon@exists_forall·
The equation of the trendline they plot is given by bits per word = 0.27189 * compute^-0.12200 + 1.02433 where "compute" is measured relative to the GPT-4 training run
English
1
0
1
632