

Yanshu Li✈️ICML2026
169 posts

@karrsen0713
Incoming CS PhD @ UT | MSCS @ Brown | Multimodal LLMs & Agents








1/ 🚀 We’re excited to share Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation! Tuna-2 is a native unified multimodal model that supports visual understanding, text-to-image generation, and image editing directly from pixel embeddings. 🐟✨ 📄 Paper: arxiv.org/abs/2604.24763 🌐 Project: tuna-ai.org/tuna-2 💻 Code: github.com/facebookresear… Most unified multimodal models still rely on pretrained vision encoders, which add architectural complexity and can create representation mismatches between understanding and generation. Tuna-2 asks a simple question: Do we still need vision encoders? 👀 Our answer is No! Tuna-2 has a completely encoder-free architecture, where images are processed directly by a unified transformer together with text tokens. Take a glimpse at what our model can generate ↓ 🎨🖼️




有个小朋友问我 我想说既然是“焚决”,不也是一步一步进化的嘛 ccfa怎么感觉跟大白菜一样 你说发top,哪怕是trans我也有经验啊 这个ccfa感觉隔行如隔山了 一年多碰个明白点老师加上投稿周期短一点我觉得也挺难 ccfa真的有这么好发嘛(小声逼逼 (有一阵子干的东西跟CS沾点边,发过trans,但真不懂ccfa)







Stop using LoRA for RLVR!!! New paper released👉Evaluating Parameter Efficient Methods for RLVR 📖Alphaxiv: alphaxiv.org/abs/2512.23165 💻Github: github.com/MikaStars39/Pe… Is standard LoRA truly the optimal choice for Reinforcement Learning?. We present the first large-scale evaluation of over 12 PEFT methodologies using the DeepSeek-R1-Distill family on complex mathematical reasoning benchmarks. Key Finding: Standard LoRA is suboptimal. Structural variants such as DoRA, AdaLoRA, and MiSS consistently outperform standard LoRA. Notably, DoRA (46.6% avg. accuracy) even surpasses full-parameter fine-tuning (44.9%) across multiple benchmarks. The failure of SVD-based initialization. Strategies like PiSSA and MiLORA experience significant performance degradation or total training collapse. This is due to a fundamental "spectral misalignment": these methods force updates on principal components, while RLVR intrinsically operates in the off-principal regime. The Expressivity Floor. While RLVR can tolerate moderate parameter reduction, extreme compression (e.g., VeRA, IA³, or Rank-1 adapters) creates an information bottleneck. Reasoning tasks require a minimum threshold of trainable capacity to successfully reorient policy circuits. Recommendations for the community: a. Move beyond the default adoption of standard LoRA. b. Prioritize geometry-aware adapters like DoRA that decouple magnitude and direction. c. Avoid SVD-informed initializations for RL tasks.

✨ Steering vectors are everywhere in today’s LLM field—but is subtracting activations really a “good” vector? We’d like to share two recent works where we rethink steering vectors / activation steering from first principles, and ask what actually makes steering generalizable and reliable 👇 📄 ICR Towards Generalizable Implicit In-Context Learning with Attention Routing arxiv.org/abs/2509.22854 📄 SVF Steering Vector Fields for Context-Aware Inference-Time Control in LLMs arxiv.org/abs/2602.01654




Excited to be giving this talk on learning multi-agent communication skills tomorrow at the Workshop on Multi-Agent Learning and Its Opportunities in the Era of Generative AI (MAL-GAI) at ICLR 2026! Catch it tomorrow at 1pm local time (12 ET), and don't miss the panel in the afternoon as well!