Zongze Li

@freelulul

CS PhD student @UChicago

Chicago, IL Katılım Aralık 2025

27 Takip Edilen84 Takipçiler

Zongze Li retweetledi

mal@mal_shaik·31 Mar

x.com/i/article/2038…

ZXX

333

2.3K

1.7M

Zongze Li@freelulul·25 Mar

@loopbobb @ce_zhang CUDA graphs are only used for puredecode steps; when prefill tokens are present, the scheduler automatically falls back to eager mode for that iteration. Once the append-prefill completes, subsequent steps resume CUDA-graphed pure decode

English

loop@loopbobb·24 Mar

@freelulul @ce_zhang Great work, but I have a questions about how to insert the append-prefill into the decoding cause the decoding part use cuda graph cannot do prefill and decode together

English

Zongze Li@freelulul·23 Mar

My first PhD work: "Not All Prefills Are Equal" Prefill-Decode disaggregation is the standard for LLM serving. But for multi-turn conversations, it re-transfers the entire KV cache every turn. We found a better way! Thanks for my amazing advisor @ce_zhang and collaborators!

English

135

12.5K

Zongze Li@freelulul·25 Mar

@loopbobb @ce_zhang Thanks for the great question! We don't need to manually insert prefill into decoding. vLLM's scheduler already handles this natively via chunked prefill (Sarathi-Serve, OSDI'24). When an append-prefill arrives at the decode node, the scheduler merges it into the next iteration

English

Zongze Li@freelulul·23 Mar

@ce_zhang 4/4 Results across 3,060 benchmark configs and real ShareGPT/WildChat traffic: 48–73% Turn 2+ TTFT reduction 100% success rate (PD baselines degrade under load) About 75% less KV transfer Less than 1ms overhead Paper: arxiv.org/abs/2603.13358 Code: github.com/freelulul/vllm…

English

898

Zongze Li@freelulul·23 Mar

@ce_zhang 3/4 But no single fixed strategy wins everywhere. 92.2% of workload-QPS combinations have different optimal configs for TTFT vs TPOT. So we built PPD, a dynamic router that decides per request whether to route append-prefill to decode nodes or use the traditional PD path.

English

657

Zongze Li@freelulul·23 Mar

@ce_zhang 2/4 The key insight: append-prefill (reusing cached KV states) causes only about 2% decode interference vs about 48% for full prefill. This order-of-magnitude gap means decode nodes can process follow-up turns locally with minimal impact on generation quality.

English

776

Keşfet

@loopbobb @ce_zhang @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA