Yiming Zhao

47 posts

Yiming Zhao

@YimingBob

Katılım Nisan 2023

191 Takip Edilen45 Takipçiler

Thrilled to see our DFlash work featured by @googledevs ! 3.13x raw speedup (standalone), 2.29x end-to-end on TPU v5p. Great collaboration with @haozhangml, @aaronzhfeng, the Google team, and everyone involved. Proud of this partnership between @Google and @haoailab at UCSD!

Google for Developers@googledevs

Breaking LLM inference’s autoregressive bottleneck 🛠️ We've teamed up with @haozhangml, @YimingBob, and @aaronzhfeng, among others from UCSD to achieve a massive 3.13X speedup for LLM inference on Google Cloud TPUs using Diffusion-Style Speculative Decoding (DFlash). Read the blog: goo.gle/4naZ8Yv

English

190

Yiming Zhao retweetledi

Hao AI Lab@haoailab·3d

Excited to share our recent work accepted to ICML 2026! These projects span efficient causal parallel decoders, diffusion LLMs, video sparse attention, video QAT, online speculative decoding, and agentic document reasoning. Huge thanks to all collaborators and co-authors across these efforts. Looking forward to seeing everyone in Seoul this summer! 🇰🇷

English

12.1K

Yiming Zhao retweetledi

Hao AI Lab@haoailab·9 Nis

(1/5) FP4 hardware is here, but 4-bit attention still kills model quality, blocking true end-to-end FP4 serving. To fix that, we propose Attn-QAT, the first systematic study of quantization-aware training for attention. The result: FP4 attention quality is comparable to BF16 attention with 1.1x–1.5x higher throughput than SageAttention3 on an RTX 5090 and 1.39x speedup over FlashAttention-4 on a B200. Blog: haoailab.com/blogs/attn-qat/ Code: github.com/hao-ai-lab/Fas… Checkpoints: huggingface.co/FastVideo/14B_…

English

242

35.6K

Yiming Zhao retweetledi

Hao AI Lab@haoailab·18 Mar

Wow! The Vera Rubin demo looks great but real-time editing is actually already here on a single B200! Try Dreamverse today and generate 30s 1080p videos (with audio) faster than you can watch them. Demo: dreamverse.fastvideo.org

Runway@runwayml

A breakthrough in real-time video generation. As a research preview developed with @NVIDIA and shared at @NVIDIAGTC this week, we trained a new real-time video model running on Vera Rubin. HD videos generate instantly, with time-to-first-frame under 100ms. Unlocking an entirely new creative paradigm and bolstering the foundations of our General World Model, GWM-1. Real-time generation opens a fundamentally different design space for video models and world simulation. We're investing in co-designing our models alongside advances in hardware to keep pushing this frontier.

English

5.6K

Yiming Zhao retweetledi

Hao AI Lab@haoailab·18 Mar

(1/N) We're launching Dreamverse. Most AI video models take minutes to generate a 5 s 1080p clip. In 4.5 seconds, we can generate 30 s 1080p clips on a single GPU. Our videos generate faster than you can watch them: stop waiting on prompts and start directing scenes live. 🕹️Demo: dreamverse.fastvideo.org 📑 Blog: haoailab.com/blogs/dreamver… Welcome to the era of vibe-directing 👇

English

548

87.4K

Yiming Zhao retweetledi

Hao Zhang@haozhangml·18 Mar

So excited that our prototype Dreamverse is finally out!! I increasingly believe real-time diffusion is not just about latency optimization. It might unlock a fundamentally new interface for creating and using AI video. Once high-quality videogen becomes faster than playback, you no longer have to write one giant prompt, wait forever, and then start over. You can stay in the loop: generate, watch, chat, revise, continue. Or maybe, in the future, you can just interact with the world like how humans do it. 😯😯 The key point here is videogen stops feeling like submitting a job and starts feeling like directing. That is the core idea behind Dreamverse, and what we call vibe directing. Check an early prototype from us: dreamverse.fastvideo.org I also really want to shout out our FastVideo team here. - We got 5s 1080p video + audio generation down to real time in less than a month 😉 - we then turned that systems capability into this vibe-directing prototype experience in about a week 🫡. That is an extremely high bar of execution. Small team, strong taste, and the ability to move from infra breakthrough to prototype demo very fast. I am genuinely very proud of what the team built. Please try the demo and let us know what you think 👇👇

Hao AI Lab@haoailab

English

5.9K

Yiming Zhao retweetledi

Hao AI Lab@haoailab·13 Mar

(1/N) Content creators have been stuck with costly and slow video generation APIs for far too long. We couldn’t take it anymore.😅😭 FastVideo’s new real-time inference stack has the fastest 1080p TI2AV pipeline ever.😍🚀🚀 Our optimized LTX-2.3 pipeline creates 5-second 1080p videos with audio in 4.55 s, on a single GPU! 3.9x faster than the next fastest option. 🕹️Live demo: 1080p.fastvideo.org 📜Blog: haoailab.com/blogs/fastvide…

English

112

28.8K

Hao Zhang@haozhangml·18 Şub

Can’t believe I get to say this -- deeply honored to be named a 2026 Sloan Research Fellow: today.ucsd.edu/story/2026-slo… Early faculty life is… "hyper-intense": teaching, advising, hiring, papers, grants; and trying to build a lab culture you’ll still be proud of years later. There were many weeks where it felt like we were building the plane mid-flight, burning plenty of midnight oil along the way. Over the past few years, I’ve been incredibly lucky to work with amazing students and collaborators on a chain of OSS project: Vicuna → Chatbot Arena → vLLM → DistServe → LMGame → FastVideo; each one then pushed forward way further by people far beyond our lab. This award feels less like a finish line and more like fuel for the lab, for our students, and for the next set of systems we haven’t built yet. A core principle of us is building "open-source research that ships." At the same time, it’s hard not to feel a mix of excitement + uncertainty + anxiety about where CS is heading. Coding agents are improving so fast that I am feeling the AGI first-handedly. I have gone back to builder mode -- only more productive than ever -- outside of my faculty admin work. I’ve watched friends and colleagues hit numbers that would’ve sounded like science fiction a year ago (e.g., 100+ commits/day). So what does it mean to “do great computer science” when baseline productivity keeps jumping? For me, it makes “research that ships” more important, and even raises the bar. The leverage shifts toward taste and problem selection, principled system design, and translating ideas into reliable artifacts. We're excited to keep proving that through real systems people can use! Deeply grateful to: - My students and collaborators — for the ideas, execution, and drive. - @HDSIUCSD , Dean @GuptaUcsd, and my @UCSanDiego colleagues — for building an environment where ambitious work can happen. - @nvidia and @mbzuai (and other compute sponsors) — for support that helped us move faster and turn ideas into real artifacts. Even as the interface changes, the need for efficient compute and solid infrastructure only grows. Most of all: credit to the students at @haoailab. You’re the reason any of this is worth doing. Keep building and shipping!

English

185

16.5K

Yiming Zhao@YimingBob·18 Şub

@haozhangml Congrats, Prof. Zhang!

English

219

Yiming Zhao retweetledi

Hao AI Lab@haoailab·27 Oca

✨ Kicking Off a Great Year ✨ We’re thrilled to share that four papers from our lab members and amazing collaborators have been accepted to #ICLR2026! 🎉 More than an acceptance, it marks the beginning of a good year: new ideas to chase, and lots of fun research ahead. Huge thanks to everyone who shared feedback and helped sharpen this work, and a big shout-out to the ACs/PCs for their hard work this year. 🙏🚀

English

8.5K

Yiming Zhao@YimingBob·6 Oca

RT @haoailab: Really awesome to see HY-World 1.5’s training built on top of FastVideo’s training framework. Big thanks to the HY-World 1.5…

English

Yiming Zhao retweetledi

Hao AI Lab@haoailab·6 Oca

Excited to announce FastVideo v0.1.7 to start off the new year! 🚀 FastVideo Team has grown substantially over the past months and many new models and features have been added! Inference: - Longcat T2V, I2V, Video Continuation - added by @a1zhang, Shao Duan - Hunyuan 1.5 - added by @WZhou35897 - Matrix Game 2.0 w/ streaming support - added by @kaiqin_kong - TurboWan 1.3/14B T2V and I2V - added by @l0ayr - Layerwise offloading - added by @OhmRishabhV - fp8 text encoder support - added by Yechen Xu Training: - Improved FastWan 1.3B T2V checkpoint - Sequence packing support - added by @l0ayr Misc: - FVD metric - added by Ketaki Tank - New docs page - added by Mihir Jagtap Checkout our docs: hao-ai-lab.github.io/FastVideo/ Example inference scripts: - Python: github.com/hao-ai-lab/Fas… - CLI: github.com/hao-ai-lab/Fas…

English

Yiming Zhao retweetledi

Hao AI Lab@haoailab·23 Ara

🎉 Our group’s year-end dinner 🎉 Thank you to all members in Hao AI Lab for a wonderful year! Enjoy the holidays and we really look forward to another exciting year ahead!

English

140

19.5K

Yiming Zhao retweetledi

Hao Zhang@haozhangml·20 Ara

We arXiv’ed this paper a few months back, and I still find myself thinking about this work a lot: CAD arguably is a direct continuation of our previous DistServe line of work. Two after-thoughts: 1. For >4 years, training systems have been surprisingly stable... We've had Megatron/DeepSpeed (and now FSDP2) for ages, and in the “classic” pretrain regime (16- 32K context, fairly uniform batches), it’s fair to feel like the remaining wins are incremental. If you counted papers in MLSYS/OSDI, I believe # training papers have declined a lot recently. But the workload quietly changed: As agents + post-training became the main compute eater, context lengths jumped from already long to “ridiculously long”: 32K → 128K → 256K (some even start to claim 1M), and suddenly the #1 problem isn’t just parallelisms/kernels, but imbalance / stragglers. When one part of the pipeline grows ~quadratically with sequence length while most others are closer to linear, any “colocate everything on the same GPUs” design becomes a straggler source. 2. This naturally leads to the second thought: disaggregation isn’t just for serving. We’ve talked a lot about P/D disaggregation in serving (DistServe), and AFD-style ideas for MoE. Here we show the same principle applies to training: the core attention compute -- softmax(QKᵀ)V -- is (1) essentially stateless (no trainable params) and (2) surprisingly composable at token granularity with modern kernels (thanks for all kernel developers like flash attention and flash infer). That means you can treat attention less like “a layer you must shard carefully” and more like “a compute service you can schedule.” So instead of falling into the usual CP/SP rabbit hole (“what’s the perfect sharding scheme to balance this?” as when we think about TP/EP), we decouple the quadratic component, push it onto a pool of attention servers, and then shard/rebatch attention tasks *however* is convenient to equalize compute, even non-uniformly, without losing kernel efficiency. Training is throughput-sensitive (NOT latency-sensitive), so we can be aggressive with pipelining/overlap (ping-pong execution, comm/compute overlap, ) to hide all these overheads in training. I hope this work provides some new perspectives about how people should think about CP/SP and disaggregation. 😀

Hao AI Lab@haoailab

🔥CAD: Efficient Long-context Language Model Training by Core Attention Disaggregation Repo: github.com/hao-ai-lab/dis… Blog: hao-ai-lab.github.io/blogs/distca/ Training a long-context LLM model can suffer from severe workload imbalance caused by core-attention - the softmax(QK^T)V part. Core-attention disaggregation (CAD) fundamentally eliminates workload imbalance by disaggregating core-attention from the rest of the model.

English

148

15.6K

Yiming Zhao retweetledi

Hao AI Lab@haoailab·19 Ara

GIF

English

249

89.1K

Yiming Zhao retweetledi

Hao Zhang@haozhangml·18 Ara

One of the most interesting things I’ve been working on recently: Jacobi Forcing -- a recipe that turns any autoregressive (AR) LLM into a native, causal parallel decoder (I am glad to send it out to wrap up a great year of 2025📣📣) There's a lot of buzz around diffusion LLMs (and yes --we work on those, too 😃). They’re exciting because they can decode many tokens in parallel. But in practice, there are some big cons: * Quality gap vs. strong AR baselines is still common. * Systems mismatch: non-causal attention often breaks “free wins” we’ve spent the last ~2 years optimizing in serving stacks (kernels, batching, etc.). Speculative decoding (SD) is also great, but most people building real serving systems probably heard about this: the speedup we can achieve in a real system through SD is much lower than those reported in SoTA paper headlines. See these threads: * github.com/sgl-project/sg… * github.com/vllm-project/v… A big reason is that SD introduces an extra drafting/verification procedure (and sometimes a draft model), which makes scheduling/orchestration much harder to do efficiently at scale. This Jacobi Forcing finds a very nice sweet spot in "middle path"! * It keeps the causal (left -> right) generation order, so the model stays close to the AR distribution (as well as compute kernels and scheduling from the system perspective) * But it behaves diffusion-like in how much it decodes per forward pass -- multi-token generation without adding drafting heads / models (the tokens per forward in our current version can go up to 5) * Here I emphasize "native": the model itself learns to parallel-decode, which makes integration into existing serving engines much cleaner. This addresses a big painpoint of integrating SD into serving systems. High level idea: we use the model’s own Jacobi decoding trajectory and progressively distill a sequential decoder into a parallel one — hence the name Jacobi Forcing. we hypothesizes the causal decoding order is crucial to keep the generation quality as high as the original model, which turns out to be that case in our empirical results. The blog goes into the full recipe (noise schedule, training mask, and the inference tricks that make it actually fast in wall-clock). If you’ve been thinking about the AR vs. SD vs. dLLM design space: would love to hear your take — especially on whether keeping causal order is a key ingredient for preserving quality.

Hao AI Lab@haoailab

Jacobi Forcing: training AR models as diffusion-style parallel decoders with 4x speedup while staying causal and maintaining high generation quality. 🚀🎯 Autoregressive (AR) LLM and diffusion LLMs each have their own strengths. We analyze each method's pros and cons and ask the question: can we get the best of both worlds by turning an AR model into a causal, native parallel decoder? Our answer is YES. 👉 Read the full story here: hao-ai-lab.github.io/blogs/jacobi-f…

English

282

31.5K

Yiming Zhao retweetledi

Hao AI Lab@haoailab·17 Ara

GIF

English

168

63.1K

Yiming Zhao retweetledi

Hao AI Lab@haoailab·13 Ara

We had a great evening at the Snowflake x FastVideo social event last week🎉! Thank you all for coming and making it a memorable gathering, and thank you @Snowflake for organizing the event! #NeurIPS2025

English

Yiming Zhao retweetledi

Hao AI Lab@haoailab·12 Ara

We are grateful to have the new @AMD MI350X to @HDSIUCSD @haoailab. This generous support from @AMD is a meaningful recognition of our work and a valuable opportunity for the UCSD MLSys community and @haoailab to advance research at the intersection of AI and systems. We look forward to putting the MI350X to use in our upcoming projects.

English

29.2K

Yiming Zhao retweetledi

Hao AI Lab@haoailab·11 Ara

🔥 New blog: AUP: when Accuracy Meets Parallelism in Diffusion Language Models. 🔗hao-ai-lab.github.io/blogs/text-dif… Diffusion LLMs promise parallel decoding, error correction, and random-order generation. But if you look at both speed and accuracy: Are dLLMs actually better than AR + speculative decoding? Our study: not yet… Here’s why, and how we design our ultra-fast dLLM framework d3LLM 🚀 to close the gap!

English

25K

Keşfet

@googledevs @haozhangml @aaronzhfeng @Google @haoailab @HDSIUCSD @GuptaUcsd @UCSanDiego