Rui Pan 潘瑞
68 posts

Rui Pan 潘瑞
@ruipeterpan
4th-yr PhD @PrincetonCS, systems & algorithms for efficient LLM inference, previously @Google @AmazonScience @maxplanckpress @WisconsinCS, fan of @fcbarcelona




Introducing DeepConf: Deep Think with Confidence 🚀 First method to achieve 99.9% on AIME 2025 with open-source models! Using GPT-OSS-120B even without tools, we reached this almost-perfect accuracy while saving up to 85% generated tokens. It also delivers many strong advantages for parallel thinking: 🔥 Performance boost: ~10% accuracy across models & datasets ⚡ Ultra-efficient: Up to 85% fewer tokens generated 🔧 Plug & play: Works with ANY existing model - zero training needed (no hyperparameter tuning as well!) ⭐ Easy to deploy: Just ~50 lines of code in vLLM (see PR below) 📚 Paper: arxiv.org/pdf/2508.15260 🌐 Project: jiaweizzhao.github.io/deepconf joint work with: @FuYichao123 , xuewei_wang, @tydsh (see details in the comments below)






Marconi: Prefix Caching for the Era of Hybrid LLMs Marconi improves caching for hybrid LLMs with policies optimizing reuse likelihood and compute savings, achieving 34.4× higher token hit rates and significantly reducing latency.

Our work received an honorable mention for the outstanding paper!!! 🤓


🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and contributor to the project. We’ve chosen FlashInfer to release our top LLM inference kernels, including those from TensorRT-LLM, making them easy to integrate into @vllm_project, SGLang (@lmsysorg), and custom inference engines. It started as a collaborative research project at @uwcse, @CarnegieMellon, and OctoAI (acquired by NVIDIA) with the goal of creating a flexible LLM inference kernel library that is engine agnostic, highly optimized, and easy to extend for new techniques such as algorithms for KV cache reuse. It is now a thriving open source project with production deployments and contributions from research and development teams across the AI systems community. Check out FlashInfer today to get started to see our first Blackwell kernels for DeepSeek MLA available now: nvda.ws/4djKdq7 Congratulations again to Zihao Ye and all authors of the MLSys paper -- Lequn Chen, Wuwei Lin, Yineng Zhang, Stephanie Wang, Baris Kasikci, Arvind Krishnamurthy, Vinod Grover, Tianqi Chen, Ruihang Lai. And thank you to all community contributions, we look forward to continuing to grow this project. FlashInfer paper: nvda.ws/4kj2Htc Blackwell MLA kernel: nvda.ws/4jWjLW2













