kwatra (@kwatra) - Twitter Profili | Zamantika Mersobahis Locabet

kwatra@kwatra·5 Oca

something

English

0

2

0

kwatra retweetledi

Dhruv Deshmukh@DhruvDeshmukh12·30 Ara

Long-context inference is hitting a wall. 🛑 As context grows, Attention becomes the villain. Why? • Decode: Attention scales linearly (O(N)), while the rest of the model stays constant (O(1)). • Prefill: Attention explodes quadratically(O(N²)). Can we do better?(1/9)

English

1

4

567

kwatra retweetledi

Ram Ramjee@ramaramjee·19 Tem

Evaluation of LLM serving systems is tricky because several factors influence performance (prefill length, decode length, parallelization) and there are multiple metrics we care about (throughput, ttft, tpot/tbt). We identify common pitfalls and a checklist to avoid them.

Amey Agrawal@agrawalamey12

The bitter lesson of AI infra: The hardest part about building faster LLM inference systems is not designing the systems, but rather it is evaluating if the system is actually faster! 🤔 This graph from a recent top systems venue paper about long-context serving shows average normalized input token latency for a trace with both short and 100K+ token requests. System X looks like a clear win: lower normalized latency and higher request rates. But normalized metrics can obscure the actual user experience: at those rates, long inputs see >2hr delays to the first token! Let’s do the math!🧮

English

0

2

7

511

kwatra retweetledi

Amey Agrawal@agrawalamey12·18 Tem

The bitter lesson of AI infra: The hardest part about building faster LLM inference systems is not designing the systems, but rather it is evaluating if the system is actually faster! 🤔 This graph from a recent top systems venue paper about long-context serving shows average normalized input token latency for a trace with both short and 100K+ token requests. System X looks like a clear win: lower normalized latency and higher request rates. But normalized metrics can obscure the actual user experience: at those rates, long inputs see >2hr delays to the first token! Let’s do the math!🧮

English

1

10

23

2K

kwatra retweetledi

Raja@_raja_gond·20 Haz

We have released the source code and benchmarks of TokenWeave. TokenWeave speeds up distributed LLM inference via compute–communication overlap and fused AllReduce, RMSNorm, and residual addition. Code: github.com/microsoft/toke… Paper: arxiv.org/pdf/2505.11329 Try it out!

Ram Ramjee@ramaramjee

TokenWeave is the first system that almost fully hides the ~20% communication cost during inference of LLMs that are sharded in a tensor-parallel manner on H100 DGXs. Check out the thread/paper below!

English

1

2

6

2.1K

kwatra@kwatra·20 May

Dive into the details → arxiv.org/pdf/2505.11329. Code on the way—stay tuned! (10/10)

English

0

4

142

kwatra@kwatra·20 May

Small token batches enable chunked-prefill schedulers (e.g., Sarathi). (9/10)

English

1

0

3

144

kwatra@kwatra·20 May

TokenWeave – Efficient Compute-Communication Overlap for Distributed LLM Inference. Why? Even with highspeed NVLink on H100 DGX, communication overhead for distributed LLM inference can be > 20 %! Can we recover this overhead? (1/10)

English

1

6

18

1.4K

kwatra retweetledi

Abhinav Dutta@abhinavdutta555·15 Tem

🚨 Are LLM compression methods (𝘲𝘶𝘢𝘯𝘵𝘪𝘻𝘢𝘵𝘪𝘰𝘯, 𝘱𝘳𝘶𝘯𝘪𝘯𝘨, 𝘦𝘢𝘳𝘭𝘺 𝘦𝘹𝘪𝘵) too good to be true and are existing eval metrics sufficient? We've looked into it in our latest research at @MSFTResearch 🧵 (1/n) arxiv.org/abs/2407.09141

English

2

7

19

5.1K

kwatra retweetledi

Amey Agrawal@agrawalamey12·13 Tem

🚀 Introducing Metron: Redefining LLM Serving Benchmarks! 📊 Tired of misleading metrics for LLM performance? Our new paper introduces a holistic framework that captures what really matters - the user experience! 🧠💬 github.com/project-metron… #LLM #AI #Benchmark

English

2

15

34

6.6K

kwatra retweetledi

Amey Agrawal@agrawalamey12·11 Tem

Did you ever feel that @chatgpt is done generating your response and then suddenly a burst of tokens show up? This happens when the serving system is prioritizing someone else’s request before generating your response. But why? well to reduce cost. 🧵

English

1

5

26

7.4K

kwatra retweetledi

Amey Agrawal@agrawalamey12·2 Eyl

Ever wondered why @OpenAI charges 2x price for output tokens compared to input? Turns out that an output token can be up to 200x more compute time than an input token. Why? We explored this phenomenon during my internship at @MSFTResearch. 🧵

English

7

46

376

78K

kwatra

Keşfet