Deepak Narayanan

402 posts

Deepak Narayanan

Deepak Narayanan

@deepakn94

Research Scientist at @nvidia. Interested in the intersection of Computer Systems and ML. Occasionally tweet about sports. Views are my own.

Bellevue, WA Katılım Şubat 2012
1.4K Takip Edilen1.4K Takipçiler
Deepak Narayanan
Deepak Narayanan@deepakn94·
@sytelus @RajaXg FWIW, GPipe (and our contemporaneous PipeDream) were designed with GPUs in mind. TPUs have much higher all-to-all network bandwidth and so you can"get away" with using higher-communication parallelism strategies like FSDP.
Deepak Narayanan tweet media
English
0
1
7
517
Shital Shah
Shital Shah@sytelus·
KV cache was by Noam Shazeer while he was still at Google, I think, and likely used TPUs. A lot of distributed training and modeling improvements came from TPU based experiments from papers like GPipe. Overall, population working on GPUs far exceeds TPUs so more innovations will come from there.
English
1
2
38
6.8K
Raja Koduri
Raja Koduri@RajaXg·
Genuine question: All the breakthrough optimizations I see - KV cache, flash attention, quantization, seem to originate from CUDA/GPU land. Are TPUs innovating differently, or is my feed just GPU-biased? Would love examples of TPU-first optimization techniques that later crossed over. Drop links if you’ve got them!
English
31
40
861
101.7K
The Big Three
The Big Three@Big3Tennis·
Big Three matches I’m putting above the 2025 Roland Garros Final: - Nadal Verdasco 2009 AO SF - Djokovic Nadal 2018 Wimbledon SF - Wawrinka Djokovic 2013 4R - Djokovic Nadal 2012 AO Final - Federer Nadal 2008 Wimbledon final - Federer Djokovic 2019 Wimbledon final - Djokovic Federer 2011 US Open SF
English
157
80
1.1K
145.8K
Deepak Narayanan
Deepak Narayanan@deepakn94·
@glennko It seems some combination of local attention (e.g., SWA) or Mamba that looks at a compressed representation and global attention is promising. Other work like Gemma 3 also seems to be showing similar types of results.
English
0
0
0
41
Glenn Ko
Glenn Ko@glennko·
@deepakn94 Great results. Do you think hybrid models will eventually become the norm over transformer models?
English
1
0
0
79
Victor Pontis
Victor Pontis@VictorPontis·
What do you call the pro-natural movement that encompasses skepticism of seed oils, vaccines, processed food, and plastics? Is there a common root for all of these beliefs?
English
2
0
2
754
Jordan Walton
Jordan Walton@jtwalton01·
@StoolGreenie I feel like the scenarios for clinching would be so much easier to keep track of if all the teams ended group play on the same day. Would make for better drama too
English
1
0
1
18.7K
Dan Greenberg
Dan Greenberg@StoolGreenie·
Unfortunately I’m not a math guy (went to ASU for a reason) but I think the number is still TBD. Plan as of now is to win by as many as you can and hope Magic lose I think?
Will Lundregan@WillLundregan5

@StoolGreenie Is there a certain # the Celtics need to win by tonight to be in, or is it just as much as possible to feel safe

English
12
0
37
30.5K
Deepak Narayanan retweetledi
Bryan Catanzaro
Bryan Catanzaro@ctnzr·
A 8B-3.5T hybrid SSM model gets better accuracy than an 8B-3.5T transformer trained on the same dataset: * 7% attention, the rest is Mamba2 * MMLU jumps from 50 to 53.6% * Training efficiency is the same * Inference cost is much less arxiv.org/pdf/2406.07887
Bryan Catanzaro tweet media
English
17
77
435
118.8K
Horace He
Horace He@cHHillee·
What is the most impactful ML systems work from academia published at a systems conference? For example, ASPLOS or OSDI?
English
17
5
72
30.5K
AK
AK@_akhaliq·
ByteDance presents MegaScale Scaling Large Language Model Training to More Than 10,000 GPUs present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.
AK tweet media
English
10
84
416
129.7K
main
main@main_horse·
@_akhaliq damn, I didn't realise megatron-lm was *that* bad.
main tweet media
English
2
0
12
3.6K
Victor Pontis
Victor Pontis@VictorPontis·
I think it would be good to teach your kid that the sun and planets revolve around the Earth. Then they can figure out the heliocentric model on their own. Has anyone tried something like this?
English
1
0
1
1.2K
Deepak Narayanan
Deepak Narayanan@deepakn94·
@robertnishihara @vipulved @anyscalecompute @togethercompute We wrote up some thoughts on our experience measuring inference runtimes of LLM APIs in our recent NeurIPS paper. x.com/deepakn94/stat…
Deepak Narayanan@deepakn94

Seeing a lot of discussion around the fairness of this leaderboard, so figured I would bring up our recent paper that appeared at NeurIPS (openreview.net/pdf?id=RJpAz15…) where we look into some of the issues that make it hard to compare LLM APIs. (1/n)

English
0
0
0
169
Robert Nishihara
Robert Nishihara@robertnishihara·
Hey @vipulved, just reached out over email to collaborate on this. Performance can vary significantly over the course of the day and this is a real limitation of the benchmark. We’ll start measuring performance over a horizon of time as a first step here. x.com/robertnishihar…
English
2
0
4
2.3K
Vipul Ved Prakash
Vipul Ved Prakash@vipulved·
Wow @anyscalecompute is benchmark washing their API’s terrible performance. All you need is curl and time. Same request @togethercompute 3x faster for Llama2 70B model — 72 t/s vs 23 t/s (7.04s vs 21.87s) And this model is under heavy load! Our dedicated instances are dizzying.
Vipul Ved Prakash tweet media
Anyscale@anyscalecompute

📈We’re excited to introduce the LLMPerf leaderboard: the first public and open source leaderboard for benchmarking performance of various LLM inference providers in the market. Our goal with this leaderboard is to equip users and developers with a clear understanding of the capabilities and limitations of LLM inference solutions, featuring key providers such as @replicate, @awscloud, and @togethercompute! You can find the leaderboard here: github.com/ray-project/ll… The LLMPerf leaderboard tracks three main metrics: time-to-first-token, inter-token latency, and success rate. - Time-to-first-token (TTFT) measures the time it takes between the query and the first response of the provider. TTFT is especially important for interactive and streaming applications, such as chatbots. - Inter-token latency measures the average time between consecutive tokens. This is important for applications that require the entirety of the response to be ready, like summarization tasks or agent use cases. - Finally, success rate measures the number of successful responses where the inference API operates without errors. This measure reflects the reliability and stability of API provider. Blog announcement: anyscale.com/blog/comparing… (1/2)

English
7
9
107
78K
Deepak Narayanan
Deepak Narayanan@deepakn94·
@KabirNagrecha We had no visibility, apart from confirmation from OpenAI that `davinci` was the original 175B GPT-3 architecture. We would have loved to analyze the later (and more relevant) OpenAI models like GPT-4 and GPT-3.5 but we were handcuffed by not knowing anything about those models.
English
0
0
0
51
Kabir Nagrecha
Kabir Nagrecha@KabirNagrecha·
@deepakn94 Also good to see the highlight of the difference in compute between prefill/decoding phases & how that influences the thinking on perf measurement. Out of curiosity, how much visibility did you guys have/need reg. the davinci model for the idealized tests?
English
1
0
0
69
Deepak Narayanan
Deepak Narayanan@deepakn94·
Seeing a lot of discussion around the fairness of this leaderboard, so figured I would bring up our recent paper that appeared at NeurIPS (openreview.net/pdf?id=RJpAz15…) where we look into some of the issues that make it hard to compare LLM APIs. (1/n)
Anyscale@anyscalecompute

📈We’re excited to introduce the LLMPerf leaderboard: the first public and open source leaderboard for benchmarking performance of various LLM inference providers in the market. Our goal with this leaderboard is to equip users and developers with a clear understanding of the capabilities and limitations of LLM inference solutions, featuring key providers such as @replicate, @awscloud, and @togethercompute! You can find the leaderboard here: github.com/ray-project/ll… The LLMPerf leaderboard tracks three main metrics: time-to-first-token, inter-token latency, and success rate. - Time-to-first-token (TTFT) measures the time it takes between the query and the first response of the provider. TTFT is especially important for interactive and streaming applications, such as chatbots. - Inter-token latency measures the average time between consecutive tokens. This is important for applications that require the entirety of the response to be ready, like summarization tasks or agent use cases. - Finally, success rate measures the number of successful responses where the inference API operates without errors. This measure reflects the reliability and stability of API provider. Blog announcement: anyscale.com/blog/comparing… (1/2)

English
4
3
16
5.4K
Deepak Narayanan
Deepak Narayanan@deepakn94·
In doing this work, we quickly realized comparing the runtime performance of black-box LLM APIs is tricky (but also extremely important). I'm glad to see conversations around how to do this in the most fair way possible. (7/7)
English
0
0
0
392
Deepak Narayanan
Deepak Narayanan@deepakn94·
The best-case runtime is useful to gauge how far off individual runtime measurements are (e.g., what fraction of reported runtime is actually a function of current load on the system, versus fundamental computation?). (6/n)
English
1
0
0
423