Rui Pan 潘瑞

68 posts

Rui Pan 潘瑞 banner
Rui Pan 潘瑞

Rui Pan 潘瑞

@ruipeterpan

4th-yr PhD @PrincetonCS, systems & algorithms for efficient LLM inference, previously @Google @AmazonScience @maxplanckpress @WisconsinCS, fan of @fcbarcelona

Princeton, NJ Katılım Ağustos 2016
1.9K Takip Edilen737 Takipçiler
Rui Pan 潘瑞 retweetledi
Kaiqu Liang
Kaiqu Liang@kaiqu_liang·
New Meta Research 🚀 AI agents are powerful, but don’t stay aligned with you over time. When preferences shift, they don’t adapt. You correct them once…they repeat the mistake. 🤦 Introducing PAHF: continual personalization where agents learn from feedback to stay in sync.
Kaiqu Liang tweet media
English
11
45
313
47.1K
Charles 🎉 Frye
Charles 🎉 Frye@charles_irl·
@rahulgs @akshat_b @_dcw02 @modal hard to say because the paper isn't published yet! my rough speculation (ha!) is that diffusion models are more compute efficient than transformers up to a certain quality level
English
1
0
7
239
Akshat Bubna
Akshat Bubna@akshat_b·
Two days since DFlash was released, and @_dcw02 (on @modal research) already shipped support for it in SGLang. Why are we so excited about this? Diffusion speculators let us get *way* higher tok/s than auto-regressive models. E.g. we're seeing a 4.73x boost with H200s + FA3 already — with still more improvements to come! Reach out to us if we can help you get this in prod today, and huge thanks to @zhijianliu_ and team for coming up with this technique.
Akshat Bubna tweet media
English
9
23
203
54.7K
Rui Pan 潘瑞 retweetledi
Lijie(Derrick) Yang
Lijie(Derrick) Yang@LijieyYang·
[1/N] 🚀 Excited to introduce my first work at @Princeton: LessIsMore – a training-free sparse attention method tailored for efficient reasoning in LRMs, achieving lossless accuracy with high sparsity up to 87.5% and 1.1x avg decoding speedup compared to Full Attention on reasoning tasks like AIME-24. (More details in 🧵) 💻 Code: github.com/DerrickYLJ/Les… 📄 arXiv: huggingface.co/papers/2508.07… 🔍 HF Daily Paper: huggingface.co/papers/2508.07…
Lijie(Derrick) Yang tweet media
English
8
9
54
8.5K
Rui Pan 潘瑞 retweetledi
Siddhant Ray
Siddhant Ray@siddhantrayyy·
With RAG and agents becoming ubiquitous in LLM systems, tuning quality and performance JOINTLY is essential to achieve the best LLM quality-of-experience. Our paper at SOSP this year, addresses this exact tradeoff!🔥
Siddhant Ray tweet media
English
1
6
17
2.1K
Rui Pan 潘瑞 retweetledi
Lindia Tjuatja
Lindia Tjuatja@lltjuatja·
committed to doing my part in decreasing reviewer workload by writing fewer papers
English
8
18
238
15.5K
机器之心 JIQIZHIXIN
机器之心 JIQIZHIXIN@jiqizhixin·
Chain-of-Experts (CoE) , proposed by @wzihanw and @ruipeterpan , is a fresh take on Mixture-of-Experts that trades parallelism for communication. Instead of picking experts once, CoE iteratively routes tokens through a chain of experts within each layer, re-evaluating at every step. 💡 Why it matters: 🔁 Dynamic routing = richer, more adaptive representations 📉 Reduced validation loss (1.20 → 1.12 on math reasoning) ⚖️ New scaling axis: depth through iteration, not just width 💾 Up to 42% memory savings vs. standard MoE scaling
机器之心 JIQIZHIXIN tweet media
English
5
23
100
10K
Rui Pan 潘瑞 retweetledi
Arthur Zucker
Arthur Zucker@art_zucker·
A quick update on the future of the `transformers` library! In order to provide a source of truth for all models, we are working with the rest of the ecosystem to make the modeling code the standard. A joint effort with vLLM, LlamaCPP, SGLang, Mlx, Qwen, Glm, Unsloth, Axoloth, Deepspeed, IBM, Gemma, Llama, Deepseek, microsoft, nvidia, internLM, Llava, AllenAI, Cohere, TogetherAI.....
English
25
102
1K
165.5K
Rui Pan 潘瑞
Rui Pan 潘瑞@ruipeterpan·
I'm too lazy to write a real promo post or make a poster. But! 🤓👆I will be at MLSys presenting our work this Thursday in Session 10. This stemmed from an amazing collaboration with Zhuang, Zhen, Can, @ZancatoLuca, @yidawang from AWS, and @tri_dao and Ravi from Princeton. Come say hi -- happy to chat!
𝚐𝔪𝟾𝚡𝚡𝟾@gm8xx8

Marconi: Prefix Caching for the Era of Hybrid LLMs Marconi improves caching for hybrid LLMs with policies optimizing reuse likelihood and compute savings, achieving 34.4× higher token hit rates and significantly reducing latency.

English
1
1
26
2.5K
Rui Pan 潘瑞
Rui Pan 潘瑞@ruipeterpan·
@tri_dao Thanks, Tri! It was a pleasure to work together on this project and I couldn’t have asked for a more awesome collaborator 🥹
English
0
0
4
139
Zihao Ye
Zihao Ye@ye_combinator·
We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to @lmsysorg’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to @vllm_project for integration support. With continued help from @NVIDIAAIDev , FlashInfer is becoming more stable and faster. Let’s keep building together!
NVIDIA AI Developer@NVIDIAAIDev

🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and contributor to the project. We’ve chosen FlashInfer to release our top LLM inference kernels, including those from TensorRT-LLM, making them easy to integrate into @vllm_project, SGLang (@lmsysorg), and custom inference engines. It started as a collaborative research project at @uwcse, @CarnegieMellon, and OctoAI (acquired by NVIDIA) with the goal of creating a flexible LLM inference kernel library that is engine agnostic, highly optimized, and easy to extend for new techniques such as algorithms for KV cache reuse. It is now a thriving open source project with production deployments and contributions from research and development teams across the AI systems community. Check out FlashInfer today to get started to see our first Blackwell kernels for DeepSeek MLA available now: nvda.ws/4djKdq7 Congratulations again to Zihao Ye and all authors of the MLSys paper -- Lequn Chen, Wuwei Lin, Yineng Zhang, Stephanie Wang, Baris Kasikci, Arvind Krishnamurthy, Vinod Grover, Tianqi Chen, Ruihang Lai. And thank you to all community contributions, we look forward to continuing to grow this project. FlashInfer paper: nvda.ws/4kj2Htc Blackwell MLA kernel: nvda.ws/4jWjLW2

English
15
37
233
39.1K
Rui Pan 潘瑞 retweetledi
Minghao Yan
Minghao Yan@Minghao__Yan·
I will be presenting our work, Decoding Speculative Decoding, at @naacl tomorrow. We identified the performance bottleneck in speculative decoding to be draft model depth and demonstrated low correlation between language modeling performance and token acceptance rate.
Minghao Yan tweet media
English
1
2
10
953