Ruisi Cai

27 posts

Ruisi Cai

@ccccrs_0908

Ph.D. student @UTAustin; Research Intern @NVIDIA @CitadelSecurities; BS @USTC; NVIDIA fellowship 2025 recipient

Katılım Mayıs 2020

286 Takip Edilen426 Takipçiler

Ruisi Cai retweetledi

VITA Group@VITAGroupUT·13 Tem

7 VITA papers @ #ICML2025 1️⃣ Contextualized Equivariant PE 2️⃣ Linear Attention 3️⃣ Multi-view Video Diffusion 4️⃣ Alignment as Statistical Estimation 5️⃣ Low-Rank LLM Weight Theory 6️⃣ Geo-Distributed LLM Training 7️⃣ μP Scale Separation Come find us at poster sessions 👇🧵

English

1.2K

Ruisi Cai@ccccrs_0908·13 May

Thanks for sharing our work! Check out the paper for some interesting theory on 3DGS. Code coming soon! 🔜 #3DGS #CVPR2025

Zhenjun Zhao@zhenjun_zhao

Steepest Descent Density Control for Compact 3D Gaussian Splatting @peihao_wang, @yuehaowang, @dilin_wang, @mohan_sreyas, @WayneINR, Lemeng Wu, @ccccrs_0908, @YuYingYeh1, Zhangyang Wang, @lqiang67, Rakesh Ranjan tl;dr: split Gaussians in saddle area into two off-springs & displace new primitives along the steepest descent directions->escape the saddle area->avoid local sub-optimal parameters arxiv.org/abs/2505.05587

English

976

Ruisi Cai@ccccrs_0908·19 Ara

Huge thanks to @NVIDIA, my mentors, and collaborators for guidance and support!💚 #NVIDIA

English

1.3K

Ruisi Cai@ccccrs_0908·19 Ara

Excited to share that I have been awarded NVIDIA fellowship! 🎉 Immensely grateful for the recognition and support - this inspires me to continue advancing research in LLM efficiency and AI security. blogs.nvidia.com/blog/graduate-…

English

254

22K

Ruisi Cai@ccccrs_0908·6 Ara

Layer-wise routers are surprisingly redundant in current MoE. Check out Read-ME for the system-friendly MoE refactorization technique with system co-design!

Yeonju Ro@j777ro

(1/n) Do you think token batching in MoE is inefficient? Are you looking for ways to transform pre-trained LLMs into MoEs? Then you should check out Read-ME at NeurIPS'24! 📖 arxiv.org/abs/2410.19123

English

3.7K

Ruisi Cai retweetledi

Pavlo Molchanov@PavloMolchanov·22 Kas

Sharing our team’s latest work on Hymba - an efficient small language model with hybrid architecture. Tech report: arxiv.org/abs/2411.13676 Discover the tradeoff between Mamba and Attention, how they can be combined, how attention sink and forced-to-attend phenomena can be mitigated, and how KV cache can be shared across layers. Learn how we built a model with end-to-end ecosystem: data selection, architecture analysis and design, training Base and Instruct models and open them to the community. Did I mention that our Hymba-1.5B Base model outperforms LLaMA 3.2-3B while being trained on 7× fewer tokens and achieving 12× higher throughput? More details and model links come soon!

English

494

97.6K

Ruisi Cai@ccccrs_0908·29 Eki

@hilbertmeng Thanks Qingye for your suggestion! Just modify this part.

English

128

Qingye Meng@hilbertmeng·14 Eki

@ccccrs_0908 Interesting work! I read your paper but got confused by compressing queries both in Equation (5) and the highlighted sentence in Section 3.2.1. Are these typos?

English

Ruisi Cai@ccccrs_0908·12 Haz

Managing long context is challenging due to quadratic attention memory usage. But what if we could compress growing context information into a fixed-size memory? 🤔 Check out our new ICML paper: "LoCoCo: Dropping In Convolutions for Long Context Compression"! 1/3

English

20.1K

Ruisi Cai@ccccrs_0908·19 Eki

With countless open-source LLM checkpoints available, each specializing in unique domain knowledge, how can we tap into their full potential? Check out Model-GLUE! 🚀 We introduce a framework that integrates model merging, mixture, and stacking to unlock new possibilities.

VITA Group@VITAGroupUT

1/ 🌟 Excited to announce #Model-#GLUE (#neurips2024 D&B), a new framework designed by an extensive team from UNC, UMD, UT Austin, HKUST, Google, and CMU to #scale pre-trained LLMs efficiently! 🚀 Tackling the challenge of #aggregating disparate pre-trained LLM, we introduce a holistic guideline and benchmarking if you have a large, diverse model zoo "in the wild"! #LLM #AIresearch

English

2.3K

Ruisi Cai@ccccrs_0908·10 Eki

Exciting to see flexible inference being explored on the mamba architecture! Our recent work Flextron tackles similar challenges. Looking forward to seeing how these approaches complement each other! 🚀

Abhinav Shukla@Abhinav95_

Announcing MatMamba - an elastic Mamba2🐍architecture with🪆Matryoshka-style training and adaptive inference. Train a single elastic model, get 100s of nested submodels for free! Paper: sca.fo/mmpaper Code: sca.fo/mmcode 🧵(1/10)

English

Ruisi Cai retweetledi

Pavlo Molchanov@PavloMolchanov·24 Tem

🚀@ICML presenting out work Flextron (cairuisi.github.io/Flextron/) today: Poster: 🕜1:30 pm — 3 pm 🗺️ Hall C 4-9 #605 Oral: 🕔5:15 p.m. — 5:30 p.m. CEST 🗺️4E LLMs Here is the poster and fast presentation.

GIF

English

1.2K

Ruisi Cai@ccccrs_0908·19 Tem

@MikaStars39_ @XinmingHou Thanks so much for the interest! The poster session will be at Hall C 4-9 #713, at 1:30 pm on July 25th.

English

MikaStars★@MikaStars39·18 Tem

@ccccrs_0908 @XinmingHou How about the poster location? 😃hope to discuss this work.

English

143

Ruisi Cai@ccccrs_0908·17 Tem

Train one - Get many🚀! Check more details about Flextron at cairuisi.github.io/Flextron/

Pavlo Molchanov@PavloMolchanov

🚀 Introducing Flextron - a Many-in-One LLM - Oral at ICML! Train one model and get many optimal models for each GPU at inference without any additional retraining. 🌟 🔗 Paper: arxiv.org/abs/2406.10260 Main benefits with only 5% post-training finetuning: ✅ Best model for every GPU (small & large) without retraining ✅ Change inference cost on the fly based on load ✅ Input-adaptive inference (heterogeneous weight-shared MoE, Attention) ✅Instead of training many models, we train only 1: LLaMa2-7B ➡️ 3B, 4B, 5B, 6B, etc. Method in observation in thread. 🧵👇

English

1.3K

Ruisi Cai retweetledi

VITA Group@VITAGroupUT·16 Tem

🚀 VITA group is thrilled to present our latest work in #LLM #efficient #training and #deployment, mechanistic #understanding, and #trust at @icmlconf ! 🌟 We're humble bragging with two #oral papers on GaLore and Flextron, plus eight more #posters on everything from long context, to LLM trustworthiness, to graph. Check out our (very packed) presentation schedule! #AI #ML #research #icml

English

2.4K

Ruisi Cai@ccccrs_0908·18 Haz

Grateful to collaborate with @srv_m, Greg Henrich, @yin_hongxu, @VITAGroupUT, @jankautz , and @PavloMolchanov! Excited for more great work ahead! 🙌✨

English

376

Ruisi Cai@ccccrs_0908·18 Haz

The Flextron-Llama2-7B model family demonstrates superior MMLU performance compared to both open-source models (including Pythia, OpenLLaMA-v2) and existing post-hoc compression methods (including Sheared-LLaMA, SliceGPT, LLM-Pruner, Compresso, LaCo).

English

1.3K

Ruisi Cai@ccccrs_0908·18 Haz

Flextron optimizes resources with adaptive computation. Using a MoE-like architecture, we route different tokens to different model sizes instead of domain experts. Paper: arxiv.org/pdf/2406.10260

English

355

Ruisi Cai@ccccrs_0908·18 Haz

In Flextron, we support adaptive model loading: get the best model for every GPU (small and large) without re-training the model. We can dynamically adjust inference speed depending on the GPU load.

English

309

Ruisi Cai@ccccrs_0908·18 Haz

Tired of training varying-size LLMs to fit various GPU memory and latency requirements? Check out Flextron! Our new ICML (Oral) paper shows how to train one model deployable across GPU series. Learn more: cairuisi.github.io/Flextron/🚀

English

5.3K

Ruisi Cai@ccccrs_0908·12 Haz

Great collaboration with @tydsh, @VITAGroupUT, @BeidiChen. Many thanks for the opportunity!

English

693

Ruisi Cai@ccccrs_0908·12 Haz

With a fixed 512-size KV cache, LoCoCo also extends the context length of pre-trained LLMs to 32K 🌟, achieving performance similar to fine-tuning with entire sequences. Arxiv: arxiv.org/pdf/2406.05317 3/3

English

795

Ruisi Cai@ccccrs_0908·12 Haz

LoCoCo offers universal compatibility with existing LLM architectures for seamless integration. By injecting convolutional heads, we compressed sequences of up to 3482 tokens into a 128-size KV cache, retaining comparable performance - all with just 104M tokens of tuning! 🚀 2/3

English

864

Keşfet

@NVIDIA @hilbertmeng @ICML @XinmingHou @icmlconf @srv_m @yin_hongxu @VITAGroupUT