Alexey Tumanov

250 posts

Alexey Tumanov

@alsched

Associate Professor @gatech_scs @gtcomputing | PostDoc @Berkeley_EECS @ucbrise | ML/LLM Systems: cloud & edge.

Atlanta, GA Katılım Aralık 2012

300 Takip Edilen559 Takipçiler

Sabitlenmiş Tweet

Alexey Tumanov@alsched·16 Oca

Thank you for pointing out the inadequacy of current SOTA implementations of long context support. Our team has demonstrated the superiority of chunked prefill mechanism for enabling preemptive scheduling in support of long context requests (10M tokens): arxiv.org/abs/2409.17264

Lianmin Zheng@lm_zheng

Chunked pipeline parallelism is arguably the most general and scalable system technique for accelerating super-long-context inference. It remains underrated today, largely because there still isn’t a strong, high-quality open-source implementation. The SGLang team recently fully optimized it and published a detailed blog post explaining all the key details.

English

281

Alexey Tumanov retweetledi

Amey Agrawal@agrawalamey12·21 Oca

LLM inference is quickly becoming a global infrastructure problem. Leveraging AI agents to optimize these systems is the natural way to accelerate their development. But AI driven optimization requires a fast, cheap, and accurate evaluation mechanism. 🧵

Ion Stoica@istoica05

Systems research is changing: AI is enabling white-box algorithm discovery, not just black-box tuning. Our new ADRS paper evaluates three frameworks and shows consistent gains over human designs—up to 13× speedup in load balancing for MoE experts and 35% lower cloud costs across different cloud regions.

English

354

Alexey Tumanov@alsched·16 Oca

Academic credit for chunk prefill keeps getting missed by those reimplementing it in their own systems. arxiv.org/abs/2409.17264 demonstrated the superiority of chunked prefill mechanism for enabling preemptive scheduling in long context requests (up to 10M tokens).

Amey Agrawal@agrawalamey12

Chunk pipeline parallelism provides two critical advantages: 1. It scales ridiculously well. you can get 85%+ efficiency even at pp=32, 128 H100! 2. It supports preemption, so you avoid the terrible convoy effect that happens with long context. 10x+ improvement over CP 1/2

English

150

Alexey Tumanov retweetledi

Amey Agrawal@agrawalamey12·18 Tem

After hitting evaluation puzzles like this in our own work, we analyzed patterns across LLM inference papers and identified 8 systematic evaluation issues that can make performance comparisons misleading. We have compiled a practical evaluation checklist to help avoid these pitfalls. 📄 arxiv.org/abs/2507.09019 We're also releasing Veeksha, our comprehensive LLM inference evaluation framework, later this month to help the community design more robust benchmarks! 🛠️ What evaluation issues have you discovered in your systems work? Let's learn from each other's mistakes! @nitinkedi @jayashree2912 @kwatra @thisissouvikk @ramaramjee @alsched @gtcomputing @MSFTResearch @intel

English

595

Alexey Tumanov retweetledi

Amey Agrawal@agrawalamey12·8 Tem

Interesting work on long context inference from @nvidia, where they scale KV parallelism on gb200-nvl72 systems! To learn more about accelerating long context inference and trade-offs between different parallelism dimensions checkout out our paper, Medha: arxiv.org/abs/2409.17264

NVIDIA AI Developer@NVIDIAAIDev

What if you could ask a chatbot a question the size of an entire encyclopedia—and get an answer in real time? Multi-million token queries with 32x more users are now possible with Helix Parallelism, an innovation by #NVIDIAResearch that drives inference at huge scale. 🔗 nvda.ws/4eCXxqh

English

Alexey Tumanov retweetledi

Georgia Tech School of Computer Science@gatech_scs·11 Tem

Congratulations 👏 to our faculty who were recognized on the Spring 2025 CIOS Honor Roll for their outstanding teaching and educational impact: Assoc. Prof. Alexey Tumanov and Asst. Prof. Jan Van Den Brand!

Georgia Tech School of Computer Science tweet media

English

410

Alexey Tumanov retweetledi

Sachit Kuhar@SachitKuhar·17 May

Full code 🔓 github.com/sachitkuhar/PL… Collaboration with @jinga_lala1 and @alsched. (6/6) #EfficientAI #EdgeAI #Quantization #TMLR #AI #GaTech #GeorgiaTech

English

177

Alexey Tumanov retweetledi

Amey Agrawal@agrawalamey12·28 Mar

Super excited to share another incredible systems that we have built over the past two years! Training giant foundation models (like Llama-3 405B) costs a FORTUNE 💰 (millions of dollars)! Optimizing the training "recipe" (parallelism, memory tricks, etc.) is critical but incredibly complex. The wrong choices can waste millions. How do we find the best setup without burning GPUs? This is the problem we tackle with Maya, GPU cluster emulation tool -- have a 1000 GPU job, want to know how it would will perform, all you need is a one cpu and Maya virtual GPU runtime ✨ Arxiv: arxiv.org/abs/2503.20191 Code: Coming soon! 🧵

English

1.9K

Alexey Tumanov retweetledi

Amey Agrawal@agrawalamey12·27 Mar

Super long-context models with context window spanning millions of tokens are becoming commonplace (@GoogleDeepMind Gemini, @xai Grok 3, @Alibaba_Qwen Qwen2.5). But efficiently serving these models is tough, especially alongside short requests. Head-of-Line (HOL) blocking becomes a major issue, hurting latency for everyone. We present Medha, a system designed to handle this mix efficiently. Achieving 30x lower latency, and 5x higher throughput compared to the state-of-the-art. Full paper: arxiv.org/pdf/2409.17264. 🧵

English

3.6K

Alexey Tumanov retweetledi

Amey Agrawal@agrawalamey12·28 Mar

Maya offers a transparent, accurate, and efficient way to model and optimize large-scale DL training without needing expensive hardware clusters for exploration. A crucial step towards sustainable AI! Read the paper: arxiv.org/abs/2503.20191 Work done with @Y_Srihas , @1ntEgr8 , Hakesh Darapaneni, Mitali Meratwal, Shivam Mittal, Pranavi Bajjuri, Srinivas Sridharan, @alsched at @GeorgiaTech @NVIDIA @NVIDIAAI

English

302

Alexey Tumanov retweetledi

Amey Agrawal@agrawalamey12·27 Oca

Sequence pipeline parallelism being rapidly adopted for extreme long context inference in the industry! Checkout our paper on system design for long context inference for more details arxiv.org/abs/2409.17264

Qwen@Alibaba_Qwen

We're leveling up the game with our latest open-source models, Qwen2.5-1M ! 💥 Now supporting a 1 MILLION TOKEN CONTEXT LENGTH 🔥 Here's what’s new: 1️⃣ Open Models: Meet Qwen2.5-7B-Instruct-1M & Qwen2.5-14B-Instruct-1M —our first-ever models handling 1M-token contexts! 🤯 2️⃣ Lightning-Fast Inference Framework: We’ve fully open-sourced our inference framework based on vLLM , integrated with sparse attention methods. Experience 3x to 7x faster processing for 1M-token inputs! ⚡⚡ 3️⃣ Tech Deep Dive: Check out our detailed Technical Report for all the juicy details behind the Qwen2.5-1M series! 📊 📖 Technical Report: …anwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-1M/Qwe… 📄 Blog: qwenlm.github.io/blog/qwen2.5-1… Experience Qwen2.5-1M live: 👉 Play with Qwen2.5-Turbo supporting 1M tokens in Qwen Chat (chat.qwenlm.ai) 👉 Try it on Huggingface (huggingface.co/collections/Qw…) 👉 Or head over to Modelscope (modelscope.cn/collections/Qw…)

English

1.1K

Alexey Tumanov retweetledi

ACM SoCC@ACMSoCC·15 Kas

At SoCC’24, Anastasia Ailamaki from EPFL will give a keynote on how disaggregated memory resources are becoming the norm and how this “new memory wall” affects database system design. This talk will be amazing, make sure to be there!! acmsocc.org/2024/keynotes.…

English

542

Alexey Tumanov@alsched·30 Eki

Super-charged technical program this year at @ACMSoCC: acmsocc.org/2024/schedule.… Looking forward! Hope to see you there! #socc24

ACM SoCC@ACMSoCC

We are just under a month away from SoCC’24! This year’s conference will be from Nov 20-22 at the Microsoft Campus in Redmond, WA . Early bird registration is now open until Nov 6. Make sure to register! acmsocc.org/2024/register.…

English

282

Alexey Tumanov retweetledi

Amey Agrawal@agrawalamey12·27 Eyl

⚡ Speed Meets Accuracy: Unlike approximation-based methods, Mnemosyne achieves exact inference—ensuring that the generated output remains precise, even when processing 10 million tokens by effectively combining all these parallelization techniques to scale up to hundred of GPUS

English

553

Alexey Tumanov@alsched·12 Eki

@samiramanabi Amen to that

English

130

Samira Khan@samiramanabi·12 Eki

A true indication of someone’s standard, work ethic, and motivation — how you do anything is how you do everything.

Samira Khan@samiramanabi

How you do anything is how you do everything.

English

2.1K

Alexey Tumanov retweetledi

Amey Agrawal@agrawalamey12·3 Eki

@Google has silently but surely developed an edge over @OpenAI. Long context processing seems to be the key to Google's AI strategy. NotebookLM is a prime example of what long context processing can unlock. In our latest paper, we talk about how systems can be built to support multi-million context length matching Google's capabilities. In case you missed the paper, here is the NotebookLM generated podcast! Podcast: notebooklm.google.com/notebook/764f5… Arxiv: arxiv.org/abs/2409.17264

English

832

Alexey Tumanov retweetledi

Amey Agrawal@agrawalamey12·27 Eyl

🔗 Curious to learn more? Dive into our paper to explore the technical details behind Mnemosyne: arxiv.org/abs/2409.17264…. Join work between @gtcomputing, @Microsoft and @UCSDJacobs with amazing Esha Choukse, @alsched, @ramaramjee, @Junda_Chen_, Íñigo Goiri & Chaojie Zhang!

English

553

Alexey Tumanov@alsched·27 Eyl

First publicly known support for LLM context of up to 10M tokens with high throughput & interactive production-grade TBT SLOs (30ms) with Mnemosyne. What would it take to pair program with GenAI on millions of LoC? Or analyze 10/110hrs of video/audio content? All precisely! <v>

English

Alexey Tumanov@alsched·27 Eyl

@Sriraam_UTD love this, thank you for sharing! If only we could have a dataset that would capture "time to acceptance" (including infinity) for all the ML arxiv papers out there, doing some correlation analysis could be insightful.

English

128

Sriraam Natarajan@Sriraam_UTD·25 Eyl

Very happy to report that all our NeuRIPS papers are rejected!

English

378

50.5K

Keşfet

@nitinkedi @jayashree2912 @kwatra @thisissouvikk @ramaramjee @gtcomputing @MSFTResearch @intel