Ruisi Cai

27 posts

Ruisi Cai banner
Ruisi Cai

Ruisi Cai

@ccccrs_0908

Ph.D. student @UTAustin; Research Intern @NVIDIA @CitadelSecurities; BS @USTC; NVIDIA fellowship 2025 recipient

Katılım Mayıs 2020
286 Takip Edilen426 Takipçiler
Ruisi Cai retweetledi
VITA Group
VITA Group@VITAGroupUT·
7 VITA papers @ #ICML2025 1️⃣ Contextualized Equivariant PE 2️⃣ Linear Attention 3️⃣ Multi-view Video Diffusion 4️⃣ Alignment as Statistical Estimation 5️⃣ Low-Rank LLM Weight Theory 6️⃣ Geo-Distributed LLM Training 7️⃣ μP Scale Separation Come find us at poster sessions 👇🧵
English
1
2
12
1.2K
Ruisi Cai
Ruisi Cai@ccccrs_0908·
Thanks for sharing our work! Check out the paper for some interesting theory on 3DGS. Code coming soon! 🔜 #3DGS #CVPR2025
Zhenjun Zhao@zhenjun_zhao

Steepest Descent Density Control for Compact 3D Gaussian Splatting @peihao_wang, @yuehaowang, @dilin_wang, @mohan_sreyas, @WayneINR, Lemeng Wu, @ccccrs_0908, @YuYingYeh1, Zhangyang Wang, @lqiang67, Rakesh Ranjan tl;dr: split Gaussians in saddle area into two off-springs & displace new primitives along the steepest descent directions->escape the saddle area->avoid local sub-optimal parameters arxiv.org/abs/2505.05587

English
0
0
10
976
Ruisi Cai
Ruisi Cai@ccccrs_0908·
Huge thanks to @NVIDIA, my mentors, and collaborators for guidance and support!💚 #NVIDIA
English
0
0
4
1.3K
Ruisi Cai
Ruisi Cai@ccccrs_0908·
Excited to share that I have been awarded NVIDIA fellowship! 🎉 Immensely grateful for the recognition and support - this inspires me to continue advancing research in LLM efficiency and AI security. blogs.nvidia.com/blog/graduate-…
English
15
6
254
22K
Ruisi Cai retweetledi
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
Sharing our team’s latest work on Hymba - an efficient small language model with hybrid architecture. Tech report: arxiv.org/abs/2411.13676 Discover the tradeoff between Mamba and Attention, how they can be combined, how attention sink and forced-to-attend phenomena can be mitigated, and how KV cache can be shared across layers. Learn how we built a model with end-to-end ecosystem: data selection, architecture analysis and design, training Base and Instruct models and open them to the community. Did I mention that our Hymba-1.5B Base model outperforms LLaMA 3.2-3B while being trained on 7× fewer tokens and achieving 12× higher throughput? More details and model links come soon!
Pavlo Molchanov tweet media
English
10
90
494
97.6K
Ruisi Cai
Ruisi Cai@ccccrs_0908·
@hilbertmeng Thanks Qingye for your suggestion! Just modify this part.
English
0
0
1
128
Qingye Meng
Qingye Meng@hilbertmeng·
@ccccrs_0908 Interesting work! I read your paper but got confused by compressing queries both in Equation (5) and the highlighted sentence in Section 3.2.1. Are these typos?
Qingye Meng tweet media
English
1
0
0
20
Ruisi Cai
Ruisi Cai@ccccrs_0908·
Managing long context is challenging due to quadratic attention memory usage. But what if we could compress growing context information into a fixed-size memory? 🤔 Check out our new ICML paper: "LoCoCo: Dropping In Convolutions for Long Context Compression"! 1/3
Ruisi Cai tweet media
English
5
24
88
20.1K
Ruisi Cai
Ruisi Cai@ccccrs_0908·
With countless open-source LLM checkpoints available, each specializing in unique domain knowledge, how can we tap into their full potential? Check out Model-GLUE! 🚀 We introduce a framework that integrates model merging, mixture, and stacking to unlock new possibilities.
VITA Group@VITAGroupUT

1/ 🌟 Excited to announce #Model-#GLUE (#neurips2024 D&B), a new framework designed by an extensive team from UNC, UMD, UT Austin, HKUST, Google, and CMU to #scale pre-trained LLMs efficiently! 🚀 Tackling the challenge of #aggregating disparate pre-trained LLM, we introduce a holistic guideline and benchmarking if you have a large, diverse model zoo "in the wild"! #LLM #AIresearch

English
2
4
16
2.3K
Ruisi Cai
Ruisi Cai@ccccrs_0908·
Exciting to see flexible inference being explored on the mamba architecture! Our recent work Flextron tackles similar challenges. Looking forward to seeing how these approaches complement each other! 🚀
Abhinav Shukla@Abhinav95_

Announcing MatMamba - an elastic Mamba2🐍architecture with🪆Matryoshka-style training and adaptive inference. Train a single elastic model, get 100s of nested submodels for free! Paper: sca.fo/mmpaper Code: sca.fo/mmcode 🧵(1/10)

English
0
2
20
2K
Ruisi Cai retweetledi
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
🚀@ICML presenting out work Flextron (cairuisi.github.io/Flextron/) today: Poster: 🕜1:30 pm — 3 pm 🗺️ Hall C 4-9 #605 Oral: 🕔5:15 p.m. — 5:30 p.m. CEST 🗺️4E LLMs Here is the poster and fast presentation.
Pavlo Molchanov tweet media
GIF
English
6
4
22
1.2K
Ruisi Cai
Ruisi Cai@ccccrs_0908·
Train one - Get many🚀! Check more details about Flextron at cairuisi.github.io/Flextron/
Pavlo Molchanov@PavloMolchanov

🚀 Introducing Flextron - a Many-in-One LLM - Oral at ICML! Train one model and get many optimal models for each GPU at inference without any additional retraining. 🌟 🔗 Paper: arxiv.org/abs/2406.10260 Main benefits with only 5% post-training finetuning: ✅ Best model for every GPU (small & large) without retraining ✅ Change inference cost on the fly based on load ✅ Input-adaptive inference (heterogeneous weight-shared MoE, Attention) ✅Instead of training many models, we train only 1: LLaMa2-7B ➡️ 3B, 4B, 5B, 6B, etc. Method in observation in thread. 🧵👇

English
0
2
14
1.3K
Ruisi Cai
Ruisi Cai@ccccrs_0908·
The Flextron-Llama2-7B model family demonstrates superior MMLU performance compared to both open-source models (including Pythia, OpenLLaMA-v2) and existing post-hoc compression methods (including Sheared-LLaMA, SliceGPT, LLM-Pruner, Compresso, LaCo).
Ruisi Cai tweet media
English
1
1
6
1.3K
Ruisi Cai
Ruisi Cai@ccccrs_0908·
Flextron optimizes resources with adaptive computation. Using a MoE-like architecture, we route different tokens to different model sizes instead of domain experts. Paper: arxiv.org/pdf/2406.10260
English
1
0
5
355
Ruisi Cai
Ruisi Cai@ccccrs_0908·
In Flextron, we support adaptive model loading: get the best model for every GPU (small and large) without re-training the model. We can dynamically adjust inference speed depending on the GPU load.
English
1
0
4
309
Ruisi Cai
Ruisi Cai@ccccrs_0908·
Tired of training varying-size LLMs to fit various GPU memory and latency requirements? Check out Flextron! Our new ICML (Oral) paper shows how to train one model deployable across GPU series. Learn more: cairuisi.github.io/Flextron/🚀
English
2
7
29
5.3K
Ruisi Cai
Ruisi Cai@ccccrs_0908·
With a fixed 512-size KV cache, LoCoCo also extends the context length of pre-trained LLMs to 32K 🌟, achieving performance similar to fine-tuning with entire sequences. Arxiv: arxiv.org/pdf/2406.05317 3/3
English
1
0
3
795
Ruisi Cai
Ruisi Cai@ccccrs_0908·
LoCoCo offers universal compatibility with existing LLM architectures for seamless integration. By injecting convolutional heads, we compressed sequences of up to 3482 tokens into a 128-size KV cache, retaining comparable performance - all with just 104M tokens of tuning! 🚀 2/3
English
1
0
6
864