Jayashree Mohan

76 posts

Jayashree Mohan

Jayashree Mohan

@jayashree2912

Researcher | Microsoft Research India

Bengaluru, India Katılım Kasım 2015
351 Takip Edilen663 Takipçiler
Jayashree Mohan retweetledi
kwatra
kwatra@kwatra·
TokenWeave – Efficient Compute-Communication Overlap for Distributed LLM Inference. Why? Even with highspeed NVLink on H100 DGX, communication overhead for distributed LLM inference can be > 20 %! Can we recover this overhead? (1/10)
kwatra tweet media
English
1
6
18
1.4K
Jayashree Mohan
Jayashree Mohan@jayashree2912·
Check out this exciting recent work from our group that shows why legacy eval metrics like accuracy don't completely capture the compressed LLM model behavior and the metrics you should actually look out for! Read details here...
Abhinav Dutta@abhinavdutta555

🚨 Are LLM compression methods (𝘲𝘶𝘢𝘯𝘵𝘪𝘻𝘢𝘵𝘪𝘰𝘯, 𝘱𝘳𝘶𝘯𝘪𝘯𝘨, 𝘦𝘢𝘳𝘭𝘺 𝘦𝘹𝘪𝘵) too good to be true and are existing eval metrics sufficient? We've looked into it in our latest research at @MSFTResearch 🧵 (1/n) arxiv.org/abs/2407.09141

English
0
0
10
2.2K
Jayashree Mohan
Jayashree Mohan@jayashree2912·
Conventional latency and throughput metrics are insufficient in capturing user-facing performance for interactive LLM applications. Introducing our effort : Metron, a benchmark with fluidity metric and fluidity token generation rate that captures user experience!
Amey Agrawal@agrawalamey12

🚀 Introducing Metron: Redefining LLM Serving Benchmarks! 📊 Tired of misleading metrics for LLM performance? Our new paper introduces a holistic framework that captures what really matters - the user experience! 🧠💬 github.com/project-metron… #LLM #AI #Benchmark

English
1
0
11
2.2K
Jayashree Mohan retweetledi
Alexey Tumanov
Alexey Tumanov@alsched·
[proud advisor moment] Happy to see @agrawalamey12 work in close collaboration with MSR-India already enjoying adoption and impact at @anyscalecompute! Please refer to Sarathi-Serve (arxiv.org/abs/2403.02310) for the better reference and coverage of the chunked prefill mechanism.
Robert Nishihara@robertnishihara

One of @vllm_project's strengths is that it exposes the ability to trade off latency and throughput. However, higher qps regimes cause significant latency degradation. The underlying reason has to do with inference taking place in two stages: prefilling (processing the input context) and decoding (generating output tokens). Normally in vLLM, these stages don't happen at the same time and so an incoming request will trigger prefill computations that interrupt ongoing decoding (leading to a spike in latency). Chunked prefill breaks prefill computations into multiple "chunks" and batches them along with ongoing decoding computations to avoid interruptions. In a low qps regime, this makes no difference, but in the high qps regime with constant interrupts, this is a big deal. More details in this paper arxiv.org/pdf/2308.16369 by @agrawalamey12. Also in this RFC github.com/vllm-project/v….

English
0
2
21
1.4K
Jayashree Mohan
Jayashree Mohan@jayashree2912·
Talk to @agrawalamey12 or @nitinkedi at #osdi24 to learn more about our recent work Sarathi-Serve, a scheduler that addresses the latency-throughput tradeoff in LLM serving.
Amey Agrawal@agrawalamey12

Did you ever feel that @chatgpt is done generating your response and then suddenly a burst of tokens show up? This happens when the serving system is prioritizing someone else’s request before generating your response. But why? well to reduce cost. 🧵

English
0
0
11
2K
Jayashree Mohan
Jayashree Mohan@jayashree2912·
Checkout our recent work on speeding up token generation for LLM inference. See the tweet deck below to learn more! Full paper is now up on arXiv - arxiv.org/pdf/2308.16369…
Amey Agrawal@agrawalamey12

Ever wondered why @OpenAI charges 2x price for output tokens compared to input? Turns out that an output token can be up to 200x more compute time than an input token. Why? We explored this phenomenon during my internship at @MSFTResearch. 🧵

English
0
0
18
2.6K
Jayashree Mohan
Jayashree Mohan@jayashree2912·
NEW! Spring Deadline : Eurosys 2023 Consider submitting your work to the Spring deadline (May 11 - abstract, May 18- full paper)
ACM SIGOPS@ACMSIGOPS

#CFP @EuroSys_conf 2023 will take place in Rome, #Italy. Eurosys adopts a new dual-deadline format. The first of the two deadlines (Spring & Fall) is 11-May-2022 (for abstract). CFP and other details can be found at #cfp" target="_blank" rel="nofollow noopener">2023.eurosys.org/cfp.html#cfp

English
0
3
17
0
Jayashree Mohan
Jayashree Mohan@jayashree2912·
The @UTSASLab family 🤗 Successfully defended my PhD thesis today. Mixed feelings! Going to miss the enthusiasm and energy of this lab more than anything! Can’t believe time flew by in the blink of an eye! We missed you @SuprShastri !
Jayashree Mohan tweet mediaJayashree Mohan tweet mediaJayashree Mohan tweet media
English
5
0
93
0
Jayashree Mohan
Jayashree Mohan@jayashree2912·
Wohoo. Back after a long break on Twitter! You now know what I was upto 😅 The past 5 years have been a beautiful experience and it’s hard to bid goodbye. So fortunate to have a caring and supportive advisor @vj_chidambaram and group @UTSASLab
Round Rock, TX 🇺🇸 English
13
2
135
0
Debolina Chatterjee
Debolina Chatterjee@Debolin17031191·
@jayashree2912 @ShashidharNanj4 Hi Jayashree, if you don't mind may I ask how you are getting updates on the availability? This is help me a lot since I don't have to check the account so many times so that it geta frozen. Please help. Thanks for your prompt help so far.
English
2
0
0
0
Jayashree Mohan
Jayashree Mohan@jayashree2912·
@Debolin17031191 @ShashidharNanj4 Oh no, sorry about that! I think all slots are gone. There were 50+ slots when I checked. If your account has been locked, maybe you should try after 48-72 hours.
English
2
0
0
0
Debolina Chatterjee
Debolina Chatterjee@Debolin17031191·
@jayashree2912 @ShashidharNanj4 Thanks a ton. Hey i tried Delhi. But it says error maximum number of times viewed this page!!!! Ufff I am so frustrated. Are they still available? Please let me know if you have any info. After how long should I try again?
English
1
0
0
0