Nitin Kedia

20 posts

Nitin Kedia

@nitinkedi

CS PhD Student at @UTAustin | ex @MSFTResearch @zetasuite @IITGuwahati | Systems for ML

Katılım Mayıs 2023

74 Takip Edilen16 Takipçiler

Nitin Kedia retweetledi

Pratyush Kumar@pratykumar·19 Şub

Drop 13/14: The 30B and 105B models, benchmarks, and HF links will all come. But today it is a drop about people. About how our team of just 15 folks gave it their all to do what many doubted as not doable - ie train usefully large, globally competitive models from scratch in India. This team of 15 has now firmly launched @sarvam into its second innings. Yes, we can! @_mohit_singla @anand_404 @kediaharshit9 @AashaySachdeva @sumanthd17 @ArpitDwivedi100 @HarveenChadha @rkal4 @sushil_khyalia @ManavSinghal157 @sohampetkar missing in the pictuere - @selfawareatom @AnnaUpreti Anand @MeghMakwan33973 Utkarsh

English

198

740

5.3K

308.2K

Nitin Kedia retweetledi

kwatra@kwatra·20 May

TokenWeave – Efficient Compute-Communication Overlap for Distributed LLM Inference. Why? Even with highspeed NVLink on H100 DGX, communication overhead for distributed LLM inference can be > 20 %! Can we recover this overhead? (1/10)

English

1.4K

Nitin Kedia retweetledi

Vima Gupta@vima_gupta·15 Kas

1/7 🧵 MoEs: A tale of expectation vs reality Marketing: "Only compute the expert parameters you need!" Reality: Batch 16 requests → ALL experts activate At serving time (vLLM/TGI), arithmetic intensity: AI ≈ (num_tokens * top_k) / total_experts In simpler terms: Your decode arithmetic intensity scales inversely with expert count 🤔 #MoE #LLMs #ChatGPT #Claude #vllm #AI #ML

English

3.1K

Nitin Kedia retweetledi

Amey Agrawal@agrawalamey12·3 Eki

@Google has silently but surely developed an edge over @OpenAI. Long context processing seems to be the key to Google's AI strategy. NotebookLM is a prime example of what long context processing can unlock. In our latest paper, we talk about how systems can be built to support multi-million context length matching Google's capabilities. In case you missed the paper, here is the NotebookLM generated podcast! Podcast: notebooklm.google.com/notebook/764f5… Arxiv: arxiv.org/abs/2409.17264

English

824

Nitin Kedia@nitinkedi·13 Tem

Are you getting the performance you paid for from your LLM provider? Benchmark it using Metron. It is one our biggest learning while working on LLM Inference for the last year at @MSFTResearch and @gtcomputing when we shipped Chunked Prefill at OSDI'24 and Vidur @MLSysConf.

Amey Agrawal@agrawalamey12

🚀 Introducing Metron: Redefining LLM Serving Benchmarks! 📊 Tired of misleading metrics for LLM performance? Our new paper introduces a holistic framework that captures what really matters - the user experience! 🧠💬 github.com/project-metron… #LLM #AI #Benchmark

English

Nitin Kedia@nitinkedi·11 Tem

Excited to present Sarathi-Serve at OSDI'24. Learn how chunked prefills make your llm chat buttery. @usenix @ChatGPTapp @vllm_project

Amey Agrawal@agrawalamey12

Did you ever feel that @chatgpt is done generating your response and then suddenly a burst of tokens show up? This happens when the serving system is prioritizing someone else’s request before generating your response. But why? well to reduce cost. 🧵

English

121

Nitin Kedia retweetledi

fly51fly@fly51fly·11 May

[LG] Vidur: A Large-Scale Simulation Framework For LLM Inference arxiv.org/abs/2405.05465 - This paper presents Vidur, a high fidelity and easily extensible simulator for large language model (LLM) inference, along with a benchmark and search suite. - Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates end-to-end inference performance for different workloads. - It estimates metrics like latency, throughput, model FLOPs utilization, memory utilization, etc. with high accuracy. - Vidur addresses challenges unique to simulating LLM inference like finer time granularity, varying iteration times, and cascading errors. - It uses insights like architectural uniformity of LLMs, operator triaging, and automated profiling for parallelism strategies to achieve fidelity. - Vidur-Search uses Vidur to automatically identify optimal cost-effective deployment configurations meeting performance constraints.

English

1.7K

Nitin Kedia@nitinkedi·16 May

Made with🩷from @gtcomputing and @MSFTResearch India AI Infra Team. Folks behind: @agrawalamey12, @jayashree2912, Ashish, @nipunkw, Bhargav, @ramaramjee and @alsched.

English

Nitin Kedia@nitinkedi·16 May

Vidur is a tool 🛠️. Use it how you want! We need your contributions to add more devices (@AMD GPUs anyone) and more models and architectures (go MoE @mistalai). Code: github.com/microsoft/vidur (n/n)

English

Nitin Kedia@nitinkedi·16 May

We at @MSFTResearch and @GeorgiaTech believe that running LLM's shouldn't be so expensive 💵 So, we built a tool 🛠️ that will enable you to run it cheaper, make it cheaper. Introducing Vidur👳🏽, the first LLM Inference System simulator. #mlsys #vllm #llm #llama #gpt

English

538

Nitin Kedia retweetledi

Amey Agrawal@agrawalamey12·15 May

1/ LLM inference systems are like high-performance engines ⚙️—complex, powerful, and full of intricate settings. Efficiently deploying them to maximize GPU performance is a challenge typically tackled by experts at orgs like @OpenAI and @AIatMeta 🚀. 🧵

English

3.2K

Keşfet

@sarvam @_mohit_singla @anand_404 @kediaharshit9 @AashaySachdeva @sumanthd17 @ArpitDwivedi100 @HarveenChadha