Rui Pan 潘瑞

68 posts

Rui Pan 潘瑞

@ruipeterpan

4th-yr PhD @PrincetonCS, systems & algorithms for efficient LLM inference, previously @Google @AmazonScience @maxplanckpress @WisconsinCS, fan of @fcbarcelona

Princeton, NJ Beigetreten Ağustos 2016

1.9K Folgt737 Follower

Rui Pan 潘瑞 retweetet

Kaiqu Liang@kaiqu_liang·19 Şub

New Meta Research 🚀 AI agents are powerful, but don’t stay aligned with you over time. When preferences shift, they don’t adapt. You correct them once…they repeat the mistake. 🤦 Introducing PAHF: continual personalization where agents learn from feedback to stay in sync.

English

313

47.1K

Rui Pan 潘瑞@ruipeterpan·10 Oca

@charles_irl @rahulgs @akshat_b @_dcw02 @modal We wrote a concurrent work that also explored the policies around using dLLMs as drafters: arxiv.org/pdf/2512.20573. TLDR: dLLM’s acceptance rate is not necessarily higher, but its ability to unmask multiple tokens in each forward pass allows them to generate longer drafts cheaply

English

Charles 🎉 Frye@charles_irl·10 Oca

@rahulgs @akshat_b @_dcw02 @modal hard to say because the paper isn't published yet! my rough speculation (ha!) is that diffusion models are more compute efficient than transformers up to a certain quality level

English

239

Akshat Bubna@akshat_b·10 Oca

Two days since DFlash was released, and @_dcw02 (on @modal research) already shipped support for it in SGLang. Why are we so excited about this? Diffusion speculators let us get *way* higher tok/s than auto-regressive models. E.g. we're seeing a 4.73x boost with H200s + FA3 already — with still more improvements to come! Reach out to us if we can help you get this in prod today, and huge thanks to @zhijianliu_ and team for coming up with this technique.

English

203

54.7K

Rui Pan 潘瑞 retweetet

yifei e/λ (meetmeinshibuya march 15)@yifever·5 Kas

congrats to llama 3 large for winning the LLM trading contest by not participating

yifei e/λ (meetmeinshibuya march 15) tweet media

English

131

3.7K

280.8K

Rui Pan 潘瑞 retweetet

Mathieu@miniapeur·25 Eki

ぷらぎあ@plastic_gear

ZXX

457

4.6K

333.4K

Rui Pan 潘瑞@ruipeterpan·23 Ağu

Super cool work!

Jiawei Zhao@jiawzhao

Introducing DeepConf: Deep Think with Confidence 🚀 First method to achieve 99.9% on AIME 2025 with open-source models! Using GPT-OSS-120B even without tools, we reached this almost-perfect accuracy while saving up to 85% generated tokens. It also delivers many strong advantages for parallel thinking: 🔥 Performance boost: ~10% accuracy across models & datasets ⚡ Ultra-efficient: Up to 85% fewer tokens generated 🔧 Plug & play: Works with ANY existing model - zero training needed (no hyperparameter tuning as well!) ⭐ Easy to deploy: Just ~50 lines of code in vLLM (see PR below) 📚 Paper: arxiv.org/pdf/2508.15260 🌐 Project: jiaweizzhao.github.io/deepconf joint work with: @FuYichao123 , xuewei_wang, @tydsh (see details in the comments below)

English

732

Rui Pan 潘瑞 retweetet

Lijie(Derrick) Yang@LijieyYang·12 Ağu

[1/N] 🚀 Excited to introduce my first work at @Princeton: LessIsMore – a training-free sparse attention method tailored for efficient reasoning in LRMs, achieving lossless accuracy with high sparsity up to 87.5% and 1.1x avg decoding speedup compared to Full Attention on reasoning tasks like AIME-24. (More details in 🧵) 💻 Code: github.com/DerrickYLJ/Les… 📄 arXiv: huggingface.co/papers/2508.07… 🔍 HF Daily Paper: huggingface.co/papers/2508.07…

English

8.5K

Rui Pan 潘瑞 retweetet

Siddhant Ray@siddhantrayyy·14 Tem

With RAG and agents becoming ubiquitous in LLM systems, tuning quality and performance JOINTLY is essential to achieve the best LLM quality-of-experience. Our paper at SOSP this year, addresses this exact tradeoff!🔥

English

2.1K

Rui Pan 潘瑞 retweetet

Lindia Tjuatja@lltjuatja·7 Tem

committed to doing my part in decreasing reviewer workload by writing fewer papers

English

238

15.5K

Rui Pan 潘瑞@ruipeterpan·26 Haz

@Synced_Global @wzihanw It’s supposed to be @rui4research lol

English

978

机器之心 JIQIZHIXIN@jiqizhixin·25 Haz

Chain-of-Experts (CoE) , proposed by @wzihanw and @ruipeterpan , is a fresh take on Mixture-of-Experts that trades parallelism for communication. Instead of picking experts once, CoE iteratively routes tokens through a chain of experts within each layer, re-evaluating at every step. 💡 Why it matters: 🔁 Dynamic routing = richer, more adaptive representations 📉 Reduced validation loss (1.20 → 1.12 on math reasoning) ⚖️ New scaling axis: depth through iteration, not just width 💾 Up to 42% memory savings vs. standard MoE scaling

English

100

10K

Rui Pan 潘瑞 retweetet

Arthur Zucker@art_zucker·15 May

A quick update on the future of the `transformers` library! In order to provide a source of truth for all models, we are working with the rest of the ecosystem to make the modeling code the standard. A joint effort with vLLM, LlamaCPP, SGLang, Mlx, Qwen, Glm, Unsloth, Axoloth, Deepspeed, IBM, Gemma, Llama, Deepseek, microsoft, nvidia, internLM, Llava, AllenAI, Cohere, TogetherAI.....

English

102

165.5K

Rui Pan 潘瑞@ruipeterpan·14 May

@melissapan Thanks, Melissa!

English

Melissa Pan@melissapan·14 May

@ruipeterpan Congrats, Rui!

English

102

Rui Pan 潘瑞@ruipeterpan·13 May

I'm too lazy to write a real promo post or make a poster. But! 🤓👆I will be at MLSys presenting our work this Thursday in Session 10. This stemmed from an amazing collaboration with Zhuang, Zhen, Can, @ZancatoLuca, @yidawang from AWS, and @tri_dao and Ravi from Princeton. Come say hi -- happy to chat!

𝚐𝔪𝟾𝚡𝚡𝟾@gm8xx8

Marconi: Prefix Caching for the Era of Hybrid LLMs Marconi improves caching for hybrid LLMs with policies optimizing reuse likelihood and compute savings, achieving 34.4× higher token hit rates and significantly reducing latency.

English

2.5K

Tri Dao@tri_dao·14 May

Congrats to @ruipeterpan and the entire team for the recognition of Marconi at MLSys. Hybrid mamba-transformer LLMs are getting more popular, and the change in compute / cache size ratio changes the way you should cache and schedule requests

Rui Pan 潘瑞@ruipeterpan

Our work received an honorable mention for the outstanding paper!!! 🤓

English

7.8K

Rui Pan 潘瑞@ruipeterpan·14 May

@tri_dao Thanks, Tri! It was a pleasure to work together on this project and I couldn’t have asked for a more awesome collaborator 🥹

English

139

Rui Pan 潘瑞@ruipeterpan·14 May

@ye_combinator @lmsysorg Congrats 🎉

English

232

Zihao Ye@ye_combinator·14 May

We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to @lmsysorg’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to @vllm_project for integration support. With continued help from @NVIDIAAIDev , FlashInfer is becoming more stable and faster. Let’s keep building together!

NVIDIA AI Developer@NVIDIAAIDev

🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and contributor to the project. We’ve chosen FlashInfer to release our top LLM inference kernels, including those from TensorRT-LLM, making them easy to integrate into @vllm_project, SGLang (@lmsysorg), and custom inference engines. It started as a collaborative research project at @uwcse, @CarnegieMellon, and OctoAI (acquired by NVIDIA) with the goal of creating a flexible LLM inference kernel library that is engine agnostic, highly optimized, and easy to extend for new techniques such as algorithms for KV cache reuse. It is now a thriving open source project with production deployments and contributions from research and development teams across the AI systems community. Check out FlashInfer today to get started to see our first Blackwell kernels for DeepSeek MLA available now: nvda.ws/4djKdq7 Congratulations again to Zihao Ye and all authors of the MLSys paper -- Lequn Chen, Wuwei Lin, Yineng Zhang, Stephanie Wang, Baris Kasikci, Arvind Krishnamurthy, Vinod Grover, Tianqi Chen, Ruihang Lai. And thank you to all community contributions, we look forward to continuing to grow this project. FlashInfer paper: nvda.ws/4kj2Htc Blackwell MLA kernel: nvda.ws/4jWjLW2

English

233

39.1K

Rui Pan 潘瑞@ruipeterpan·14 May

@LijieyYang Thanks, Lijie!

English

149

Lijie(Derrick) Yang@LijieyYang·14 May

@ruipeterpan Congrats, Rui!

English

204

Rui Pan 潘瑞@ruipeterpan·14 May

@FuYichao123 Thanks, Yichao! ☺️

English

111

Yichao Fu@FuYichao123·14 May

@ruipeterpan Congrats!

English

129

Rui Pan 潘瑞 retweetet

Minghao Yan@Minghao__Yan·29 Nis

I will be presenting our work, Decoding Speculative Decoding, at @naacl tomorrow. We identified the performance bottleneck in speculative decoding to be draft model depth and demonstrated low correlation between language modeling performance and token acceptance rate.

English

953

Entdecken

@charles_irl @rahulgs @akshat_b @_dcw02 @modal @zhijianliu_ @Princeton @rui4research