kwatra

20 posts

kwatra

kwatra

@kwatra

Katılım Ocak 2009
72 Takip Edilen45 Takipçiler
kwatra
kwatra@kwatra·
something
English
0
0
2
0
kwatra retweetledi
Dhruv Deshmukh
Dhruv Deshmukh@DhruvDeshmukh12·
Long-context inference is hitting a wall. 🛑 As context grows, Attention becomes the villain. Why? • Decode: Attention scales linearly (O(N)), while the rest of the model stays constant (O(1)). • Prefill: Attention explodes quadratically(O(N²)). Can we do better?(1/9)
Dhruv Deshmukh tweet media
English
1
1
4
521
kwatra retweetledi
kwatra retweetledi
Amey Agrawal
Amey Agrawal@agrawalamey12·
The bitter lesson of AI infra: The hardest part about building faster LLM inference systems is not designing the systems, but rather it is evaluating if the system is actually faster! 🤔 This graph from a recent top systems venue paper about long-context serving shows average normalized input token latency for a trace with both short and 100K+ token requests. System X looks like a clear win: lower normalized latency and higher request rates. But normalized metrics can obscure the actual user experience: at those rates, long inputs see >2hr delays to the first token! Let’s do the math!🧮
Amey Agrawal tweet media
English
1
10
23
1.9K
kwatra retweetledi
Raja
Raja@_raja_gond·
We have released the source code and benchmarks of TokenWeave. TokenWeave speeds up distributed LLM inference via compute–communication overlap and fused AllReduce, RMSNorm, and residual addition. Code: github.com/microsoft/toke… Paper: arxiv.org/pdf/2505.11329 Try it out!
Ram Ramjee@ramaramjee

TokenWeave is the first system that almost fully hides the ~20% communication cost during inference of LLMs that are sharded in a tensor-parallel manner on H100 DGXs. Check out the thread/paper below!

English
1
2
5
1.7K
kwatra
kwatra@kwatra·
Small token batches enable chunked-prefill schedulers (e.g., Sarathi). (9/10)
English
1
0
3
142
kwatra
kwatra@kwatra·
TokenWeave – Efficient Compute-Communication Overlap for Distributed LLM Inference. Why? Even with highspeed NVLink on H100 DGX, communication overhead for distributed LLM inference can be > 20 %! Can we recover this overhead? (1/10)
kwatra tweet media
English
1
6
18
1.4K
kwatra retweetledi
Abhinav Dutta
Abhinav Dutta@abhinavdutta555·
🚨 Are LLM compression methods (𝘲𝘶𝘢𝘯𝘵𝘪𝘻𝘢𝘵𝘪𝘰𝘯, 𝘱𝘳𝘶𝘯𝘪𝘯𝘨, 𝘦𝘢𝘳𝘭𝘺 𝘦𝘹𝘪𝘵) too good to be true and are existing eval metrics sufficient? We've looked into it in our latest research at @MSFTResearch 🧵 (1/n) arxiv.org/abs/2407.09141
English
2
7
20
5.1K
kwatra retweetledi
main
main@main_horse·
[MSFT] Accuracy is Not All You Need arxiv.org/abs/2407.09141 in comparing quantized/pruned/sparsified vs 16bit models, * observes drastic flipping in correct<->wrong answer pairs, even with otherwise good accuracy * proposes replacing eval accuracy w/ either KL-Divergence or flips * explains this phenomenon as a consequence of the difference in eval Top Margin for correct vs wrong answers
main tweet mediamain tweet mediamain tweet mediamain tweet media
English
8
20
165
14.1K
kwatra retweetledi
Amey Agrawal
Amey Agrawal@agrawalamey12·
🚀 Introducing Metron: Redefining LLM Serving Benchmarks! 📊 Tired of misleading metrics for LLM performance? Our new paper introduces a holistic framework that captures what really matters - the user experience! 🧠💬 github.com/project-metron… #LLM #AI #Benchmark
English
2
15
34
6.6K
kwatra retweetledi
Amey Agrawal
Amey Agrawal@agrawalamey12·
Did you ever feel that @chatgpt is done generating your response and then suddenly a burst of tokens show up? This happens when the serving system is prioritizing someone else’s request before generating your response. But why? well to reduce cost. 🧵
English
1
5
26
7.4K
kwatra retweetledi
Amey Agrawal
Amey Agrawal@agrawalamey12·
Ever wondered why @OpenAI charges 2x price for output tokens compared to input? Turns out that an output token can be up to 200x more compute time than an input token. Why? We explored this phenomenon during my internship at @MSFTResearch. 🧵
English
7
47
378
78K