Alexey Tumanov

250 posts

Alexey Tumanov

Alexey Tumanov

@alsched

Associate Professor @gatech_scs @gtcomputing | PostDoc @Berkeley_EECS @ucbrise | ML/LLM Systems: cloud & edge.

Atlanta, GA Katılım Aralık 2012
300 Takip Edilen561 Takipçiler
Sabitlenmiş Tweet
Alexey Tumanov
Alexey Tumanov@alsched·
Thank you for pointing out the inadequacy of current SOTA implementations of long context support. Our team has demonstrated the superiority of chunked prefill mechanism for enabling preemptive scheduling in support of long context requests (10M tokens): arxiv.org/abs/2409.17264
Lianmin Zheng@lm_zheng

Chunked pipeline parallelism is arguably the most general and scalable system technique for accelerating super-long-context inference. It remains underrated today, largely because there still isn’t a strong, high-quality open-source implementation. The SGLang team recently fully optimized it and published a detailed blog post explaining all the key details.

English
0
0
0
237
Alexey Tumanov retweetledi
Amey Agrawal
Amey Agrawal@agrawalamey12·
LLM inference is quickly becoming a global infrastructure problem. Leveraging AI agents to optimize these systems is the natural way to accelerate their development. But AI driven optimization requires a fast, cheap, and accurate evaluation mechanism. 🧵
Ion Stoica@istoica05

Systems research is changing: AI is enabling white-box algorithm discovery, not just black-box tuning. Our new ADRS paper evaluates three frameworks and shows consistent gains over human designs—up to 13× speedup in load balancing for MoE experts and 35% lower cloud costs across different cloud regions.

English
1
2
5
327
Alexey Tumanov
Alexey Tumanov@alsched·
Academic credit for chunk prefill keeps getting missed by those reimplementing it in their own systems. arxiv.org/abs/2409.17264 demonstrated the superiority of chunked prefill mechanism for enabling preemptive scheduling in long context requests (up to 10M tokens).
Amey Agrawal@agrawalamey12

Chunk pipeline parallelism provides two critical advantages: 1. It scales ridiculously well. you can get 85%+ efficiency even at pp=32, 128 H100! 2. It supports preemption, so you avoid the terrible convoy effect that happens with long context. 10x+ improvement over CP 1/2

English
0
0
3
135
Alexey Tumanov retweetledi
Amey Agrawal
Amey Agrawal@agrawalamey12·
After hitting evaluation puzzles like this in our own work, we analyzed patterns across LLM inference papers and identified 8 systematic evaluation issues that can make performance comparisons misleading. We have compiled a practical evaluation checklist to help avoid these pitfalls. 📄 arxiv.org/abs/2507.09019 We're also releasing Veeksha, our comprehensive LLM inference evaluation framework, later this month to help the community design more robust benchmarks! 🛠️ What evaluation issues have you discovered in your systems work? Let's learn from each other's mistakes! @nitinkedi @jayashree2912 @kwatra @thisissouvikk @ramaramjee @alsched @gtcomputing @MSFTResearch @intel
English
0
3
5
578
Alexey Tumanov retweetledi
Amey Agrawal
Amey Agrawal@agrawalamey12·
Interesting work on long context inference from @nvidia, where they scale KV parallelism on gb200-nvl72 systems! To learn more about accelerating long context inference and trade-offs between different parallelism dimensions checkout out our paper, Medha: arxiv.org/abs/2409.17264
NVIDIA AI Developer@NVIDIAAIDev

What if you could ask a chatbot a question the size of an entire encyclopedia—and get an answer in real time? Multi-million token queries with 32x more users are now possible with Helix Parallelism, an innovation by #NVIDIAResearch that drives inference at huge scale. 🔗 nvda.ws/4eCXxqh

English
0
5
14
1K
Alexey Tumanov retweetledi
Georgia Tech School of Computer Science
Congratulations 👏 to our faculty who were recognized on the Spring 2025 CIOS Honor Roll for their outstanding teaching and educational impact: Assoc. Prof. Alexey Tumanov and Asst. Prof. Jan Van Den Brand!
Georgia Tech School of Computer Science tweet media
English
0
1
11
404
Alexey Tumanov retweetledi
Amey Agrawal
Amey Agrawal@agrawalamey12·
Super excited to share another incredible systems that we have built over the past two years! Training giant foundation models (like Llama-3 405B) costs a FORTUNE 💰 (millions of dollars)! Optimizing the training "recipe" (parallelism, memory tricks, etc.) is critical but incredibly complex. The wrong choices can waste millions. How do we find the best setup without burning GPUs? This is the problem we tackle with Maya, GPU cluster emulation tool -- have a 1000 GPU job, want to know how it would will perform, all you need is a one cpu and Maya virtual GPU runtime ✨ Arxiv: arxiv.org/abs/2503.20191 Code: Coming soon! 🧵
English
1
12
20
1.8K
Alexey Tumanov retweetledi
Amey Agrawal
Amey Agrawal@agrawalamey12·
Super long-context models with context window spanning millions of tokens are becoming commonplace (@GoogleDeepMind Gemini, @xai Grok 3, @Alibaba_Qwen Qwen2.5). But efficiently serving these models is tough, especially alongside short requests. Head-of-Line (HOL) blocking becomes a major issue, hurting latency for everyone. We present Medha, a system designed to handle this mix efficiently. Achieving 30x lower latency, and 5x higher throughput compared to the state-of-the-art. Full paper: arxiv.org/pdf/2409.17264. 🧵
Amey Agrawal tweet media
English
1
14
31
3.5K
Alexey Tumanov retweetledi
Amey Agrawal
Amey Agrawal@agrawalamey12·
Maya offers a transparent, accurate, and efficient way to model and optimize large-scale DL training without needing expensive hardware clusters for exploration. A crucial step towards sustainable AI! Read the paper: arxiv.org/abs/2503.20191 Work done with @Y_Srihas , @1ntEgr8 , Hakesh Darapaneni, Mitali Meratwal, Shivam Mittal, Pranavi Bajjuri, Srinivas Sridharan, @alsched at @GeorgiaTech @NVIDIA @NVIDIAAI
English
0
1
2
297
Alexey Tumanov retweetledi
Alexey Tumanov retweetledi
ACM SoCC
ACM SoCC@ACMSoCC·
At SoCC’24, Anastasia Ailamaki from EPFL will give a keynote on how disaggregated memory resources are becoming the norm and how this “new memory wall” affects database system design. This talk will be amazing, make sure to be there!! acmsocc.org/2024/keynotes.…
ACM SoCC tweet media
English
0
1
10
536
Alexey Tumanov retweetledi
Amey Agrawal
Amey Agrawal@agrawalamey12·
⚡ Speed Meets Accuracy: Unlike approximation-based methods, Mnemosyne achieves exact inference—ensuring that the generated output remains precise, even when processing 10 million tokens by effectively combining all these parallelization techniques to scale up to hundred of GPUS
Amey Agrawal tweet mediaAmey Agrawal tweet media
English
1
2
6
549
Alexey Tumanov retweetledi
Amey Agrawal
Amey Agrawal@agrawalamey12·
@Google has silently but surely developed an edge over @OpenAI. Long context processing seems to be the key to Google's AI strategy. NotebookLM is a prime example of what long context processing can unlock. In our latest paper, we talk about how systems can be built to support multi-million context length matching Google's capabilities. In case you missed the paper, here is the NotebookLM generated podcast! Podcast: notebooklm.google.com/notebook/764f5… Arxiv: arxiv.org/abs/2409.17264
English
2
4
11
824
Alexey Tumanov
Alexey Tumanov@alsched·
First publicly known support for LLM context of up to 10M tokens with high throughput & interactive production-grade TBT SLOs (30ms) with Mnemosyne. What would it take to pair program with GenAI on millions of LoC? Or analyze 10/110hrs of video/audio content? All precisely! <v>
English
0
0
10
1K
Alexey Tumanov
Alexey Tumanov@alsched·
@Sriraam_UTD love this, thank you for sharing! If only we could have a dataset that would capture "time to acceptance" (including infinity) for all the ML arxiv papers out there, doing some correlation analysis could be insightful.
English
0
0
1
128
Sriraam Natarajan
Sriraam Natarajan@Sriraam_UTD·
Very happy to report that all our NeuRIPS papers are rejected!
English
16
4
383
50.5K