Aurick Qiao

172 posts

Aurick Qiao

Aurick Qiao

@aurickq

ML Systems @thinkymachines | PhD CS @CarnegieMellon

Seattle, WA Katılım Kasım 2016
327 Takip Edilen866 Takipçiler
Aurick Qiao retweetledi
Hao Zhang
Hao Zhang@haozhangml·
Real-time videogen has been something I have been pushing hard at FastVideo Team. And Today, we have a big update -- we just made it: Now you can create a 5s 1080p Video in 4.5s with FastVideo on a Single GPU I believe this is the fastest 1080p text-image-to-audio-video pipeline ever! Try our free demo to feel the speed and quality: 1080p.fastvideo.org and give us feedback Blog: haoailab.com/blogs/fastvide…
Hao AI Lab@haoailab

(1/N) Content creators have been stuck with costly and slow video generation APIs for far too long. We couldn’t take it anymore.😅😭 FastVideo’s new real-time inference stack has the fastest 1080p TI2AV pipeline ever.😍🚀🚀 Our optimized LTX-2.3 pipeline creates 5-second 1080p videos with audio in 4.55 s, on a single GPU! 3.9x faster than the next fastest option. 🕹️Live demo: 1080p.fastvideo.org 📜Blog: haoailab.com/blogs/fastvide…

English
6
11
110
15.6K
Aurick Qiao retweetledi
Woosuk Kwon
Woosuk Kwon@woosuk_k·
Today, we're proud to announce @inferact, a startup founded by creators and core maintainers of @vllm_project, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. The Challenge Inference is not solved. It's getting harder. Models grow larger. New architectures proliferate: mixture-of-experts, multimodal, agentic. Every breakthrough demands new infrastructure. Meanwhile, hardware fragments: more accelerators, more programming models, and more combinations to optimize. The capability gap between models and the systems that serve them is widening. Left this way, the most capable models remain bottlenecked and with full scope of their capabilities accessible only to those who can build custom infrastructure. Close the gap, and we unlock new possibilities. And the problem is growing. Inference is shifting from a fraction of compute to the majority: test-time compute, RL training loops, synthetic data. We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building. Why Us vLLM sits at the intersection of models and hardware: a position that took years to build. When model vendors ship new architectures, they work with us to ensure day-zero support. When hardware vendors develop new silicon, they integrate with vLLM. When teams deploy at scale, they run vLLM, from frontier labs to hyperscalers to startups serving millions of users. Today, vLLM supports 500+ model architectures, runs on 200+ accelerator types, and powers inference at global scale. This ecosystem, built with 2,000+ contributors, is our foundation. We've been stewards of this engine since its first commit. We know it inside out. We deployed it at frontier scale—in research and in production. Open Source vLLM was built in the open. That's not changing. Inferact exists to supercharge vLLM adoption. The optimizations we develop flow back to the community. We plan to push vLLM's performance further, deepen support for emerging model architectures, and expand coverage across frontier hardware. The AI industry needs inference infrastructure that isn't locked behind proprietary walls. Join Us Through the open source community, we are fortunate to work with some of the best people we know. For @inferact, we're hiring engineers and researchers to work at the frontier of inference, where models meet hardware at scale. Come build with us. We're fortunate to be supported by investors who share our vision, including @a16z and @lightspeedvp who led our $150M seed, as well as @sequoia, @AltimeterCap, @Redpoint, @ZhenFund, The House Fund, @strikervp, @LaudeVentures, and @databricks. - @woosuk_k, @simon_mo_, @KaichaoYou, @rogerw0108, @istoica05 and the rest of the founding team
Woosuk Kwon tweet media
English
180
129
1.2K
469.3K
Aurick Qiao retweetledi
Woosuk Kwon
Woosuk Kwon@woosuk_k·
It still feels a little unreal to look back at how far @vllm_project has come. What started as a small research project that Zhuohan and I launched ended up receiving so much love and connecting me with people who are now some of my closest friends. In so many ways, I already feel incredibly lucky for what this journey has given me. To be honest, my path with vLLM hasn’t been perfectly straight. Over the past three years, my passion dipped at times, and I did spend my energy exploring things I thought were more interesting than vLLM and inference. vLLM is what it is today because of the community, and I’m truly grateful for their commitment. My view on inference also evolved a lot along the way. What once felt mostly “solved” turned out to be far from it. The rapid pace of new models, increasingly complex architectures, diverse hardware setups, and agents have made inference genuinely hard. The need for strong inference infrastructure has only kept growing, and it became clear just how much important work remains. Somewhere along that journey, I realized how special this work really is and how uniquely positioned vLLM is. Now, I’m committed to pushing it all the way. With that, I started @inferact with @simon_mo_, @KaichaoYou, @rogerw0108, @istoica05, and amazing founding team from both inside and outside the vLLM community. I’m deeply grateful to our investors, including @a16z and @lightspeedvp, for believing in us and giving us this opportunity. Excited for this next chapter, and looking forward to sharing more soon.
Woosuk Kwon@woosuk_k

Today, we're proud to announce @inferact, a startup founded by creators and core maintainers of @vllm_project, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. The Challenge Inference is not solved. It's getting harder. Models grow larger. New architectures proliferate: mixture-of-experts, multimodal, agentic. Every breakthrough demands new infrastructure. Meanwhile, hardware fragments: more accelerators, more programming models, and more combinations to optimize. The capability gap between models and the systems that serve them is widening. Left this way, the most capable models remain bottlenecked and with full scope of their capabilities accessible only to those who can build custom infrastructure. Close the gap, and we unlock new possibilities. And the problem is growing. Inference is shifting from a fraction of compute to the majority: test-time compute, RL training loops, synthetic data. We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building. Why Us vLLM sits at the intersection of models and hardware: a position that took years to build. When model vendors ship new architectures, they work with us to ensure day-zero support. When hardware vendors develop new silicon, they integrate with vLLM. When teams deploy at scale, they run vLLM, from frontier labs to hyperscalers to startups serving millions of users. Today, vLLM supports 500+ model architectures, runs on 200+ accelerator types, and powers inference at global scale. This ecosystem, built with 2,000+ contributors, is our foundation. We've been stewards of this engine since its first commit. We know it inside out. We deployed it at frontier scale—in research and in production. Open Source vLLM was built in the open. That's not changing. Inferact exists to supercharge vLLM adoption. The optimizations we develop flow back to the community. We plan to push vLLM's performance further, deepen support for emerging model architectures, and expand coverage across frontier hardware. The AI industry needs inference infrastructure that isn't locked behind proprietary walls. Join Us Through the open source community, we are fortunate to work with some of the best people we know. For @inferact, we're hiring engineers and researchers to work at the frontier of inference, where models meet hardware at scale. Come build with us. We're fortunate to be supported by investors who share our vision, including @a16z and @lightspeedvp who led our $150M seed, as well as @sequoia, @AltimeterCap, @Redpoint, @ZhenFund, The House Fund, @strikervp, @LaudeVentures, and @databricks. - @woosuk_k, @simon_mo_, @KaichaoYou, @rogerw0108, @istoica05 and the rest of the founding team

English
4
5
154
15.1K
Aurick Qiao retweetledi
Kevin Kwok
Kevin Kwok@kevinakwok·
Wow TML really *is* taking a different approach from the other AI labs
Kevin Kwok tweet media
English
28
4
300
129.6K
Aurick Qiao retweetledi
Thinking Machines
Thinking Machines@thinkymachines·
Tinker is now generally available. We also added support for advanced vision input models, Kimi K2 Thinking, and a simpler way to sample from models. thinkingmachines.ai/blog/tinker-ge…
English
48
173
1.7K
1.1M
Angela Jiang
Angela Jiang@jiangelaa·
👋@worktrace_ai is out of stealth! Which also means I've officially rejoined the workforce...I couldn't help but join @deepakv91 to pursue this vision together. I really think we & our amazing team are on track to make a meaningful difference in bridging the AI divide. Join us!
Worktrace AI@worktrace_ai

Today, we're launching @worktrace_ai to help businesses uncover their best automation opportunities and build those automations. Our founders, Angela Jiang (product manager of GPT-3.5 and GPT-4 at OpenAI) and Deepak Vasisht (UIUC CS professor, MIT researcher, IIT graduate of the last decade), are determined to eliminate the AI divide between frontier labs and the workforce. Our $9M seed round is led by @8vc and @conviction with participation from @OpenAI, @svangel and @_geniusventures. Join us!

English
21
17
132
70.9K
Ying Sheng
Ying Sheng@ying11231·
We've been running @radixark for a few months, started by many core developers in SGLang @lmsysorg and its extended ecosystem (slime @slime_framework , AReaL @jxwuyi). I left @xai in August — a place where I built deep emotions and countless beautiful memories. It was the best place I’ve ever worked, the place I watched grow from a few dozen people to hundreds, and it truly felt like home. What pushed me to make such a hard decision is the momentum of building SGLang open source and the mission of creating an ambitious future, within an open spirit that I learnt from my first job at @databricks after my PhD. We started SGLang in the summer of 2023 and made it public in January 2024. Over the past 2 years, hundreds of people have made great efforts to get to where they are today. We experienced several waves of growth after its first release. I still remember the many dark nights in the summer of 2024, I spent with @lm_zheng , @lsyincs , and @zhyncs42 debugging, while @ispobaoke single-handedly took on DeepSeek inference optimizations, seeing @GenAI_is_real and the community strike team tag-teaming on-call shifts non-stop. There are so many more who have joined that I'm out of space to call out, but they're recorded on the GitHub contributor list forever. The demands grow exponentially, and we have been pushed to make it a dedicated effort supported by RadixArk. It’s the step-by-step journey of a thousand miles that has carried us here today, and the same relentless Long March that will lead us into the tens of thousands of miles yet to come. The story never stops growing. Over the past year, we’ve seen something very clear: The world is full of people eager to build AI, but the infrastructure that makes it possible is not shared. The most advanced inference and training stacks live inside a few companies. Everyone else is forced to rebuild the same schedulers, compilers, serving engines, and training pipelines again and again — often under enormous pressure, with lots of duplicated effort and wasted insight. RadixArk was born to change that. Today, we’re building an infrastructure-first, deep-tech company with a simple and ambitious mission: "Make frontier-level AI infrastructure open and accessible to everyone." If the two values below resonate with you, come talk to us: (1) Engineering as an art. Infrastructure is a first-class citizen in RadixArk. We care about elegant design and code that lasts. Beneath every line of code lies the soul of the engineer who wrote it. (2) A belief in openness. We share what we build. We bet on long-term compounding through community, contribution, and giving more than we take. A product is defined by its users, yet it truly comes alive the moment functionality transcends mere utility and begins to embody aesthetics. Thanks to all the miles (the name of our first released RL framework; see below). radixark.ai
English
112
128
1.1K
539.3K
Aurick Qiao
Aurick Qiao@aurickq·
After two amazing years at Snowflake AI Research, I have joined @thinkymachines! I am excited to work with the incredible team here and build world-class ML systems for the next generation of multimodal AI
English
11
0
189
24.9K
Aurick Qiao retweetledi
Zhihao Jia
Zhihao Jia@JiaZhihao·
Super excited about this work! 🔥 SuffixDecoding accelerates multi-round agent serving by reusing and optimizing over previous agent iterations—5x speedups on AgenticSQL. Come see @GabrieleOliaro’s #NeurIPS2025 Spotlight!
Gabriele Oliaro@GabrieleOliaro

🐢 Are your #LLM #agents too slow? 🚀 Introducing SuffixDecoding: make agentic workloads run up to 5.3x faster! 🎯 Emerging AI workflows suffer high latency. We fix this with extreme speculative decoding using suffix trees. 🌟 Come see our #NeurIPS2025 Spotlight!

English
0
3
22
4.1K
Aurick Qiao retweetledi
Gabriele Oliaro
Gabriele Oliaro@GabrieleOliaro·
🐢 Are your #LLM #agents too slow? 🚀 Introducing SuffixDecoding: make agentic workloads run up to 5.3x faster! 🎯 Emerging AI workflows suffer high latency. We fix this with extreme speculative decoding using suffix trees. 🌟 Come see our #NeurIPS2025 Spotlight!
Gabriele Oliaro tweet media
English
1
4
21
6K
Aurick Qiao
Aurick Qiao@aurickq·
Suffix Decoding is at #NeurIPS2025 as a 🏅spotlight! It accelerates LLM inference for coding, agents, and RL. We also optimized its speculation speed by 7.4x and merged it into vLLM (incoming to SGLang). Talk to @GabrieleOliaro or me at poster #816 Friday 11am! Links in🧵
Aurick Qiao tweet media
English
2
4
29
12.6K
Aurick Qiao retweetledi
Zhiting Hu
Zhiting Hu@ZhitingHu·
🔥Really excited to see the release of PAN world model, a project I had been working over the past years. PAN is a general world model capable of simulating physical, agentic, and nested worlds, synthesizing infinite interactive experiences for training AI agents. Building on top of pretrained LLMs and video diffusion models, PAN connects language, perception, action, and latent thoughts, for long-horizon simulation and reasoning. PAN shows overwhelming performance gains over JEPA-2, Cosmos-2, and other prior models. More in the thread👇 ... 1/
English
8
52
241
31K
Aurick Qiao retweetledi
Aurick Qiao retweetledi
Hao AI Lab
Hao AI Lab@haoailab·
🔥 New Blog: “Disaggregated Inference: 18 Months Later” 18 months in LLM inference feels like a new Moore’s Law cycle – but this time not just 2x per year: 💸 Serving cost ↓10–100x 🚀 Throughput ↑10x ⚡ Latency ↓5x A big reason? Disaggregated Inference. From DistServe, our early research system on prefill-decode disaggregation, to today’s production frameworks, disaggregation has become the backbone of modern LLM serving. So what is disaggregated inference? Why does the LLM inference community love it? And how far have we come? As the inventors of this technique, we take a look back – 18 months later - at how the idea reshaped the landscape and what comes next. 🔗 Read the full story: hao-ai-lab.github.io/blogs/distserv…
English
7
49
175
39.6K
Karim C
Karim C@BrandGrowthOS·
@StasBekman Dynamic switching between parallelism strategies sounds like it could seriously reduce my agent deployment costs. When's this available for testing?
English
1
0
0
186
Stas Bekman
Stas Bekman@StasBekman·
Yay, our team has just published a new paper, “Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads" arxiv.org/abs/2509.16495 Shift Parallelism is a new inference parallelism strategy that can dynamically switch between Tensor Parallelism and Sequence Parallelism, delivering: - Up to 1.5x lower latency in interactive workloads - Up to 50% higher throughput under heavy traffic The tech shows a robust performance across dynamic, real-world traffic pattern The creators are: Mert Hidayetoglu, Aurick Zhou, Michael Wyatt, Jeff Rasley, Yuxiong He, and Samyam Rajbhandari @ Snowflake AI Research This work extends Arctic Inference to further optimize LLM inference under dynamic, real-world traffic patterns. The working code is here: github.com/snowflakedb/Ar…
Stas Bekman tweet media
English
5
38
290
17K