Deepak Narayanan

1

7

517

Shital Shah@sytelus·15 Eyl

KV cache was by Noam Shazeer while he was still at Google, I think, and likely used TPUs. A lot of distributed training and modeling improvements came from TPU based experiments from papers like GPipe. Overall, population working on GPUs far exceeds TPUs so more innovations will come from there.

English

2

38

6.8K

Raja Koduri@RajaXg·15 Eyl

Genuine question: All the breakthrough optimizations I see - KV cache, flash attention, quantization, seem to originate from CUDA/GPU land. Are TPUs innovating differently, or is my feed just GPU-biased? Would love examples of TPU-first optimization techniques that later crossed over. Drop links if you’ve got them!

English

31

40

861

101.7K

Deepak Narayanan@deepakn94·9 Haz

@Big3Tennis Djokovic Nadal 2013 RG semi

Indonesia

90

The Big Three@Big3Tennis·9 Haz

Big Three matches I’m putting above the 2025 Roland Garros Final: - Nadal Verdasco 2009 AO SF - Djokovic Nadal 2018 Wimbledon SF - Wawrinka Djokovic 2013 4R - Djokovic Nadal 2012 AO Final - Federer Nadal 2008 Wimbledon final - Federer Djokovic 2019 Wimbledon final - Djokovic Federer 2011 US Open SF

English

157

80

1.1K

145.8K

Deepak Narayanan@deepakn94·22 Mar

@glennko It seems some combination of local attention (e.g., SWA) or Mamba that looks at a compressed representation and global attention is promising. Other work like Gemma 3 also seems to be showing similar types of results.

English

41

Glenn Ko@glennko·22 Mar

@deepakn94 Great results. Do you think hybrid models will eventually become the norm over transformer models?

English

Nemotron-H: A family of Hybrid Mamba-Transformer LLMs. * Hybrid architecture means up to 3X faster at the same accuracy * Trained in FP8 * Great for VLMs * Weights and instruct versions to come soon. research.nvidia.com/labs/adlr/nemo…

0

79

Deepak Narayanan@deepakn94·22 Mar

Excited to share some models we've been training over the last couple of months!

Bryan Catanzaro@ctnzr

English

3

2

68

6.8K

Deepak Narayanan@deepakn94·3 Ara

@VictorPontis Seed oils aren’t natural?

English

0

101

Victor Pontis@VictorPontis·3 Ara

What do you call the pro-natural movement that encompasses skepticism of seed oils, vaccines, processed food, and plastics? Is there a common root for all of these beliefs?

English

0

2

754

Deepak Narayanan@deepakn94·30 Kas

@jtwalton01 @StoolGreenie Impossible with 5 teams a group unfortunately :/

English

89

Jordan Walton@jtwalton01·30 Kas

@StoolGreenie I feel like the scenarios for clinching would be so much easier to keep track of if all the teams ended group play on the same day. Would make for better drama too

English

Will Lundregan@WillLundregan5

0

1

18.7K

Dan Greenberg@StoolGreenie·30 Kas

Unfortunately I’m not a math guy (went to ASU for a reason) but I think the number is still TBD. Plan as of now is to win by as many as you can and hope Magic lose I think?

@StoolGreenie Is there a certain # the Celtics need to win by tonight to be in, or is it just as much as possible to feel safe

English

12

0

37

30.5K

Deepak Narayanan@deepakn94·4 Ağu

@AdamHimmelsbach It’s best-of-3!

English

1

50

Deepak Narayanan@deepakn94·18 Tem

Thrilled to announce a new model trained jointly between @MistralAI and @nvidia!

Bryan Catanzaro@ctnzr

@MistralAI and @nvidia announce Mistral-NeMo 12B, an awesome bite-size model released under Apache 2.0 that we jointly trained. FP8 aligned checkpoint and 128k context window, great benchmark scores. blogs.nvidia.com/blog/mistral-n… mistral.ai/news/mistral-n…

English

24

2.4K

Deepak Narayanan@deepakn94·14 Haz

More details here: research.nvidia.com/publication/20….

English

3

474

Deepak Narayanan retweetledi

Bryan Catanzaro@ctnzr·13 Haz

A 8B-3.5T hybrid SSM model gets better accuracy than an 8B-3.5T transformer trained on the same dataset: * 7% attention, the rest is Mamba2 * MMLU jumps from 50 to 53.6% * Training efficiency is the same * Inference cost is much less arxiv.org/pdf/2406.07887

English

17

77

435

118.8K

Deepak Narayanan@deepakn94·28 Nis

@cHHillee We published our initial pipeline parallelism paper at SOSP 2019: dl.acm.org/doi/10.1145/33…. Obviously it has found use in a totally different context than what we originally envisioned.

English

0

8

637

Horace He@cHHillee·28 Nis

What is the most impactful ML systems work from academia published at a systems conference? For example, ASPLOS or OSDI?

English

17

5

72

30.5K

Deepak Narayanan@deepakn94·28 Şub

@NikkiAShah Looks cool! :)

English

1

91

Nikki Shah@NikkiAShah·28 Şub

Excited to finally share what we've been working on!

Y Combinator@ycombinator

YC W24's @team_patchwork supercharges your team’s communication — they replace your Slack with a ranked feed of information, AI-assisted post creation, and chat for urgent needs. ycombinator.com/launches/KVH-p… Congrats on the launch, @nikkiashah & @dhruhink!

English

3

0

10

618

Deepak Narayanan@deepakn94·27 Şub

@main_horse @_akhaliq This is the commit they compared to: github.com/NVIDIA/Megatro…. It is from January 11, 2023.

English

0

2

145

main@main_horse·27 Şub

@deepakn94 @_akhaliq ohhh

0

2

278

AK@_akhaliq·27 Şub

ByteDance presents MegaScale Scaling Large Language Model Training to More Than 10,000 GPUs present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.

English

10

84

416

129.7K

Deepak Narayanan@deepakn94·27 Şub

@main_horse @_akhaliq They compared to an older commit of Megatron-LM that is slower than the current version.

English

0

4

259

main@main_horse·27 Şub

@_akhaliq damn, I didn't realise megatron-lm was *that* bad.

English

0

12

3.6K

Deepak Narayanan@deepakn94·3 Oca

@VictorPontis How about telling them the earth is flat?

English

0

132

Victor Pontis@VictorPontis·3 Oca

I think it would be good to teach your kid that the sun and planets revolve around the Earth. Then they can figure out the heliocentric model on their own. Has anyone tried something like this?

English

Deepak Narayanan@deepakn94

0

1

1.2K

Deepak Narayanan@deepakn94·24 Ara

@robertnishihara @vipulved @anyscalecompute @togethercompute We wrote up some thoughts on our experience measuring inference runtimes of LLM APIs in our recent NeurIPS paper. x.com/deepakn94/stat…

Seeing a lot of discussion around the fairness of this leaderboard, so figured I would bring up our recent paper that appeared at NeurIPS (openreview.net/pdf?id=RJpAz15…) where we look into some of the issues that make it hard to compare LLM APIs. (1/n)

English

169

Robert Nishihara@robertnishihara·23 Ara

Hey @vipulved, just reached out over email to collaborate on this. Performance can vary significantly over the course of the day and this is a real limitation of the benchmark. We’ll start measuring performance over a horizon of time as a first step here. x.com/robertnishihar…

English

0

4

2.3K

Vipul Ved Prakash@vipulved·22 Ara

Wow @anyscalecompute is benchmark washing their API’s terrible performance. All you need is curl and time. Same request @togethercompute 3x faster for Llama2 70B model — 72 t/s vs 23 t/s (7.04s vs 21.87s) And this model is under heavy load! Our dedicated instances are dizzying.

Anyscale@anyscalecompute

📈We’re excited to introduce the LLMPerf leaderboard: the first public and open source leaderboard for benchmarking performance of various LLM inference providers in the market. Our goal with this leaderboard is to equip users and developers with a clear understanding of the capabilities and limitations of LLM inference solutions, featuring key providers such as @replicate, @awscloud, and @togethercompute! You can find the leaderboard here: github.com/ray-project/ll… The LLMPerf leaderboard tracks three main metrics: time-to-first-token, inter-token latency, and success rate. - Time-to-first-token (TTFT) measures the time it takes between the query and the first response of the provider. TTFT is especially important for interactive and streaming applications, such as chatbots. - Inter-token latency measures the average time between consecutive tokens. This is important for applications that require the entirety of the response to be ready, like summarization tasks or agent use cases. - Finally, success rate measures the number of successful responses where the inference API operates without errors. This measure reflects the reliability and stability of API provider. Blog announcement: anyscale.com/blog/comparing… (1/2)

English

7

9

107

78K

Deepak Narayanan@deepakn94·24 Ara

@KabirNagrecha We had no visibility, apart from confirmation from OpenAI that `davinci` was the original 175B GPT-3 architecture. We would have loved to analyze the later (and more relevant) OpenAI models like GPT-4 and GPT-3.5 but we were handcuffed by not knowing anything about those models.

English

51

Kabir Nagrecha@KabirNagrecha·24 Ara

@deepakn94 Also good to see the highlight of the difference in compute between prefill/decoding phases & how that influences the thinking on perf measurement. Out of curiosity, how much visibility did you guys have/need reg. the davinci model for the idealized tests?

English

0

69

Deepak Narayanan@deepakn94·23 Ara

Seeing a lot of discussion around the fairness of this leaderboard, so figured I would bring up our recent paper that appeared at NeurIPS (openreview.net/pdf?id=RJpAz15…) where we look into some of the issues that make it hard to compare LLM APIs. (1/n)

Anyscale@anyscalecompute

📈We’re excited to introduce the LLMPerf leaderboard: the first public and open source leaderboard for benchmarking performance of various LLM inference providers in the market. Our goal with this leaderboard is to equip users and developers with a clear understanding of the capabilities and limitations of LLM inference solutions, featuring key providers such as @replicate, @awscloud, and @togethercompute! You can find the leaderboard here: github.com/ray-project/ll… The LLMPerf leaderboard tracks three main metrics: time-to-first-token, inter-token latency, and success rate. - Time-to-first-token (TTFT) measures the time it takes between the query and the first response of the provider. TTFT is especially important for interactive and streaming applications, such as chatbots. - Inter-token latency measures the average time between consecutive tokens. This is important for applications that require the entirety of the response to be ready, like summarization tasks or agent use cases. - Finally, success rate measures the number of successful responses where the inference API operates without errors. This measure reflects the reliability and stability of API provider. Blog announcement: anyscale.com/blog/comparing… (1/2)

English

4

3

16

5.4K

Deepak Narayanan@deepakn94·23 Ara

In doing this work, we quickly realized comparing the runtime performance of black-box LLM APIs is tricky (but also extremely important). I'm glad to see conversations around how to do this in the most fair way possible. (7/7)

English

392

Deepak Narayanan@deepakn94·23 Ara

The best-case runtime is useful to gauge how far off individual runtime measurements are (e.g., what fraction of reported runtime is actually a function of current load on the system, versus fundamental computation?). (6/n)

English