Zongze Li retweetledi
Zongze Li
7 posts

Zongze Li
@freelulul
CS PhD student @UChicago
Chicago, IL Katılım Aralık 2025
27 Takip Edilen84 Takipçiler

@freelulul @ce_zhang Great work, but I have a questions about how to insert the append-prefill into the decoding cause the decoding part use cuda graph cannot do prefill and decode together
English

My first PhD work: "Not All Prefills Are Equal"
Prefill-Decode disaggregation is the standard for LLM serving. But for multi-turn conversations, it re-transfers the entire KV cache every turn.
We found a better way!
Thanks for my amazing advisor @ce_zhang and collaborators!
English

@loopbobb @ce_zhang Thanks for the great question! We don't need to manually insert prefill into decoding. vLLM's scheduler already handles this natively via chunked prefill (Sarathi-Serve, OSDI'24). When an append-prefill arrives at the decode node, the scheduler merges it into the next iteration
English

@ce_zhang 4/4 Results across 3,060 benchmark configs and real ShareGPT/WildChat traffic:
48–73% Turn 2+ TTFT reduction
100% success rate (PD baselines degrade under load)
About 75% less KV transfer
Less than 1ms overhead
Paper: arxiv.org/abs/2603.13358
Code: github.com/freelulul/vllm…
English

@ce_zhang 3/4 But no single fixed strategy wins everywhere. 92.2% of workload-QPS combinations have different optimal configs for TTFT vs TPOT.
So we built PPD, a dynamic router that decides per request whether to route append-prefill to decode nodes or use the traditional PD path.

English

@ce_zhang 2/4 The key insight: append-prefill (reusing cached KV states) causes only about 2% decode interference vs about 48% for full prefill.
This order-of-magnitude gap means decode nodes can process follow-up turns locally with minimal impact on generation quality.

English
