Zongze Li

7 posts

Zongze Li

Zongze Li

@freelulul

CS PhD student @UChicago

Chicago, IL Katılım Aralık 2025
27 Takip Edilen84 Takipçiler
Zongze Li
Zongze Li@freelulul·
@loopbobb @ce_zhang CUDA graphs are only used for puredecode steps; when prefill tokens are present, the scheduler automatically falls back to eager mode for that iteration. Once the append-prefill completes, subsequent steps resume CUDA-graphed pure decode
English
1
0
0
21
loop
loop@loopbobb·
@freelulul @ce_zhang Great work, but I have a questions about how to insert the append-prefill into the decoding cause the decoding part use cuda graph cannot do prefill and decode together
English
2
0
0
15
Zongze Li
Zongze Li@freelulul·
My first PhD work: "Not All Prefills Are Equal" Prefill-Decode disaggregation is the standard for LLM serving. But for multi-turn conversations, it re-transfers the entire KV cache every turn. We found a better way! Thanks for my amazing advisor @ce_zhang and collaborators!
English
5
9
135
12.5K
Zongze Li
Zongze Li@freelulul·
@loopbobb @ce_zhang Thanks for the great question! We don't need to manually insert prefill into decoding. vLLM's scheduler already handles this natively via chunked prefill (Sarathi-Serve, OSDI'24). When an append-prefill arrives at the decode node, the scheduler merges it into the next iteration
English
0
0
0
61
Zongze Li
Zongze Li@freelulul·
@ce_zhang 3/4 But no single fixed strategy wins everywhere. 92.2% of workload-QPS combinations have different optimal configs for TTFT vs TPOT. So we built PPD, a dynamic router that decides per request whether to route append-prefill to decode nodes or use the traditional PD path.
Zongze Li tweet media
English
0
1
6
657
Zongze Li
Zongze Li@freelulul·
@ce_zhang 2/4 The key insight: append-prefill (reusing cached KV states) causes only about 2% decode interference vs about 48% for full prefill. This order-of-magnitude gap means decode nodes can process follow-up turns locally with minimal impact on generation quality.
Zongze Li tweet media
English
0
1
6
776