Linjie (Lindsey) Li

189 posts

Linjie (Lindsey) Li banner
Linjie (Lindsey) Li

Linjie (Lindsey) Li

@LINJIEFUN

researching @Microsoft, @UW, contributed to https://t.co/VzcJa9Skx3

Seattle, WA Katılım Ağustos 2012
503 Takip Edilen2.8K Takipçiler
Linjie (Lindsey) Li retweetledi
Xuhui Zhou
Xuhui Zhou@nlpxuhui·
Creating user simulators is a key to evaluating and training models for user-facing agentic applications. But are stronger LLMs better user simulators? TL;DR: not really. We ran the largest sim2real study for AI agents to date: 31 LLM simulators vs. 451 real humans across 165 tasks. Here's what we found (co-lead with @sunweiwei12).
Xuhui Zhou tweet media
English
5
62
270
28K
Linjie (Lindsey) Li retweetledi
Jae Sung Park
Jae Sung Park@jjaesungpark·
VLMs today—including our own Molmo—point via raw text strings (e.g. ""). What if pointing meant directly selecting the visual tokens instead? 🤔 Introducing MolmoPoint: Better Pointing for VLMs with Grounding Tokens 🎯 🔓models, code, data, demo all OPEN 🧵👇 Paper: allenai.org/papers/molmopo…
English
10
34
342
45.6K
Linjie (Lindsey) Li retweetledi
Yulu Gan
Yulu Gan@yule_gan·
Simply adding Gaussian noise to LLMs (one step—no iterations, no learning rate, no gradients) and ensembling them can achieve performance comparable to or even better than standard GRPO/PPO on math reasoning, coding, writing, and chemistry tasks. We call this algorithm RandOpt. To verify that this is not limited to specific models, we tested it on Qwen, Llama, OLMo3, and VLMs. What's behind this? We find that in the Gaussian search neighborhood around pretrained LLMs, diverse task experts are densely distributed — a regime we term Neural Thickets. Paper: arxiv.org/pdf/2603.12228 Code: github.com/sunrainyg/Rand… Website: thickets.mit.edu
Yulu Gan tweet media
English
87
430
3K
671.4K
Linjie (Lindsey) Li retweetledi
Zihan "Zenus" Wang
Zihan "Zenus" Wang@wzenus·
In Agent RL, models suffer from Template Collapse. They generate vast, diverse outputs (High Entropy) that lose all meaningful connection to the input prompt (Low Mutual Information). In other words, agent learn different ways to say nothing. 🚀 Introducing RAGEN-v2 -- Here's how we define and fix such silent failure modes in Agent RL. 🧵
English
12
53
209
127.8K
Linjie (Lindsey) Li retweetledi
Jiawei Gu
Jiawei Gu@Kuvvius·
⛔️ Can MLLMs truly learn WHEN and HOW to use tools? (🛠AdaReasoner says: yes!! Like… actually decide: - “Should I call a tool right now?” - “Which one?” - “How many times?” What happened surprised us: a 7B model beats GPT-5 on visual tool-reasoning—and shows adaptive behaviors we never programmed. (1/17)🧵👇 📄 arxiv.org/abs/2601.18631 🌐 adareasoner.github.io
GIF
English
1
5
15
6.4K
Linjie (Lindsey) Li retweetledi
Zihan "Zenus" Wang
Zihan "Zenus" Wang@wzenus·
AlphaGo’s 10-year anniversary today — huge milestone for RL! Small serendipity: it’s also 1 year since we released 𝐑𝐀𝐆𝐄𝐍, our LLM Agent RL framework. Some thoughts on the past decade of RL, plus a major 𝐑𝐀𝐆𝐄𝐍 update on reasoning collapse in Agent RL coming soon. 1/ Ten years ago, on Jan 27, DeepMind brought AlphaGo to the world. Back then, RL felt mythic. For the first time, it reached top professional-level in a domain that demands long-horizon planning -- already gone 5–0 against the European champion. That moment made a lot of people truly believe this: a policy can “grow out of interaction” instead of being hand-coded or hand-taught. One year ago, on Jan 27, we released RAGEN, an RL codebase for LLM agents. We started applying RL with verifiable rewards beyond ‘winning a game’ to large reasoning models that can plan and interact with the world. RL is no longer just about winning inside a closed board. It now plays out in a more open, long-horizon training loop that can resemble parts of the real world. But in this year, we also saw a quieter kind of collapse. It does not always look like failure. Sometimes it looks stable. Sometimes it even looks safer and more consistent. Yet the policy slowly turns into a “persona”, a “template”, a “low-effort sense of security”. So I’ve increasingly felt that 𝐑𝐀𝐆𝐄𝐍 isn’t just a system. For me, it reads more like the second half of a decade-long thread I’ve been watching unfold. The first half: “RL can learn reasoning.” The second half: “RL can also quietly collapse if we don’t have the right diagnostics.” It feels like a time marker: ten years later, we’re finally forced to look beyond reward and ask what stays input-conditioned—and what drifts. 2/ If I use this coincidence as an anchor, I would split the last decade of RL into three chapters. The AlphaGo era: RL proved itself on long-horizon planning. It proved policies can emerge from interaction; The RLHF era: RL moved from winning games to alignment. It became a core mechanism that makes language models track human preferences. It became a key part behind many products today; The LLM Agent RL era: RL enters closed-loop, multi-turn self-training. The LLM agent learns more than answers. It learns plans, tools, revisions, reflection, and behavioral consistency across longer time scales. Put together, these chapters point to a missing piece for me: we still lack a clear, shared vocabulary and practical gauges for “failure modes in LLM Agent RL”. Progress has been fast on the capability side. But the language and gauges for how LLM agents degrade—especially in closed-loop training—still feel less settled. That’s the piece we’ve been trying to put words and measurements to this year. 3/ A decade after AlphaGo, a lot of the attention and resources in RL do seem to be shifting from closed worlds like board games toward systems like LLM agents. At the same time, closed-loop self-training can introduce a more systemic risk. In a loop of self-sampling and self-updating, a model can gradually settle into a “task-insensitive but cheaper” strategy. It does not look terrible. It may even look safe and consistent. But it slowly loses prompt “discriminability”. It can lose the property that makes reasoning actually change with the input. I like to define this with one sentence: “training continues, but learning is idling”. Rewards still move. Gradients still update. But the information is already dry. The policy solidifies toward templates, inertia, and risk-avoidance. One transferable takeaway from our year with 𝐑𝐀𝐆𝐄𝐍 is this: In LLM Agent RL, it’s not enough to only watch the reward or success rate. You must also watch whether “input-conditioned information” is still flowing. You must watch whether the LLM agent is still sensitive to the task. We are now preparing a new version of 𝐑𝐀𝐆𝐄𝐍. You do not need to believe any result in advance. But we will make this line much clearer: how the battlefield shifts, how the new collapses happen, and which diagnosis view is the most actionable. 4/ Here I want to write something more personal, because this part wasn’t “thought up”. It was almost collided into. Right before writing this, I was sprinting on the new 𝐑𝐀𝐆𝐄𝐍. After days of deadline pressure, I finally took a breath and noticed the date coincidence. Thinking about the past year, I started crying. When I actually began typing, the tears had just stopped. I looked at the time. It was 5pm, Jan 20, 2026, and my screen had gone dark. The contrast made the point feel sharper. This year wasn’t about “one more loss term” or “one more trick”. It was about a latent variable that kept showing up in closed-loop LLM Agent RL, but is hard to name cleanly: whether the agent’s reasoning is still tied to the input. Training can keep running while reasoning drifts into templates, inertia, and avoidance. Reward can still move while prompt discriminability quietly erodes. “More stable, more certain” can sometimes just mean “less sensitive, less distinctive”. Collapse is rarely a sudden crash. It’s usually a slow drift that looks fine from the outside. That’s what I mean by a quiet failure mode. Not bad news, just something we’d benefit from better gauges for. And on a personal note, learning to notice this earlier has changed how I work. The hits still come. I just recover faster, and keep moving. 5/ Then I looked back at the past year’s timeline and noticed another coincidence. DeepSeek-R1 landed on Jan 20, 2025 — the same date I happened to notice the AlphaGo/RAGEN alignment. I’ll treat it as coincidence, but it did make the moment feel unexpectedly vivid. Since then, I’ve been jokingly calling 01/20 my “dark mode day”.
Zihan "Zenus" Wang tweet media
English
2
12
51
52.1K
Linjie (Lindsey) Li retweetledi
LMSYS Org
LMSYS Org@lmsysorg·
How long have you been "planning to understand" how modern LLM inference works? We just gave you a readable version of SGLang you can finish over the weekend. Introducing mini-SGLang ⚡ We distilled SGLang from 300K into 5,000 lines. Kept the core design, cut the complexity. Without sacrificing performance — nearly identical to SGLang online. It is built for engineers, researchers, and students who want to see how inference really works and learn better from code than papers. ⭐ Star us on GitHub: github.com/sgl-project/mi… 🧵 (1/3)
LMSYS Org tweet mediaLMSYS Org tweet media
English
32
186
1.2K
238.2K
Linjie (Lindsey) Li retweetledi
Jiawei Gu
Jiawei Gu@Kuvvius·
🔥 Big news! Our ThinkMorph will be discussed at MAR 2025 @ NeurIPS! Two keynotes you don't want to miss: 1⃣ Lindsey Li @LINJIEFUN : Pictures think harder than words - Evaluating and Building Visual Thinking in Multimodal Models (9:45 AM, Dec 7) 2⃣Yu Cheng @YuCheng348997 : Towards Better Multimodal Reasoning: From Staged Reinforcement Learning to Visually-Perceptive Policy Optimization (2:45 PM, Dec 7) 👀 Let's talk about what it really takes to make see and think like humans! 📍 Room 11AB, San Diego Convention Center 🔗 marworkshop.github.io/neurips25/ #NeurIPS2025 #ThinkMorph #Multimodal
Jiawei Gu tweet media
English
9
4
20
1K
Linjie (Lindsey) Li retweetledi
Liwei Jiang
Liwei Jiang@liweijianglw·
Super happy to receive the Best Paper Award at #NeurIPS2025 for our Artificial Hivemind paper!! (Really enjoyed giving oral talk at NeurIPS as well!)
Liwei Jiang tweet mediaLiwei Jiang tweet media
Liwei Jiang@liweijianglw

⚠️Different models. Same thoughts.⚠️ Today’s AI models converge into an 𝐀𝐫𝐭𝐢𝐟𝐢𝐜𝐢𝐚𝐥 𝐇𝐢𝐯𝐞𝐦𝐢𝐧𝐝 🐝, a striking case of mode collapse that persists even across heterogeneous ensembles. Our #neurips2025 𝐃&𝐁 𝐎𝐫𝐚𝐥 𝐩𝐚𝐩𝐞𝐫 (✨𝐭𝐨𝐩 𝟎.𝟑𝟓%✨) dives deep into this phenomenon, introducing 𝐈𝐧𝐟𝐢𝐧𝐢𝐭𝐲-𝐂𝐡𝐚𝐭, a real-world dataset of 26K real-world open-ended user queries spanning 17 open-ended categories + 31K dense human annotations (𝟐𝟓 𝐢𝐧𝐝𝐞𝐩𝐞𝐧𝐝𝐞𝐧𝐭 𝐚𝐧𝐧𝐨𝐭𝐚𝐭𝐨𝐫𝐬 𝐩𝐞𝐫 𝐞𝐱𝐚𝐦𝐩𝐥𝐞) to push AI’s creative and discovery potential forward. Now you can build your favorite models to be truly original, diverse, and impactful in the open-ended real world. 📍Paper: arxiv.org/abs/2510.22954 📍Data: huggingface.co/collections/li… We also systematically reveal Artificial Hivemind across: 💥 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐚𝐛𝐢𝐥𝐢𝐭𝐢𝐞𝐬: not only do individual LLMs repeat themselves, but different models produce strikingly similar content, even when asked fully open-ended questions. 💥 𝐃𝐢𝐬𝐜𝐫𝐢𝐦𝐢𝐧𝐚𝐭𝐢𝐯𝐞 𝐚𝐛𝐢𝐥𝐢𝐭𝐢𝐞𝐬: LLMs, LM judges, and reward models are systematically miscalibrated when rating alternative responses to open-ended queries. (1/N)

English
37
67
781
80.1K
Linjie (Lindsey) Li retweetledi
Alisa Liu
Alisa Liu@alisawuffles·
Presenting Broken Tokens at the 4:30pm poster session today with @s_zhengbr! We'll demystify how LMs can understand brand new tokenizations ([␣, c, a, t] instead of ␣cat) entirely at test-time 😱
Alisa Liu@alisawuffles

It began from a 🤯🤯 observation: when giving LMs text tokenized at *character*-level, its generation seemed virtually unaffected — even tho these token seqs are provably never seen in training! Suggests: functional char-level understanding, and tokenization as test-time control.

English
1
3
32
4.9K
Linjie (Lindsey) Li retweetledi
Xueyan Zou
Xueyan Zou@xyz2maureen·
I will join Tsinghua University, College of AI, as an Assistant Professor in the coming month. I am actively looking for 2026 spring interns and future PhDs (ping me if you are in #NeurIPS). It has been an incredible journey of 10 years since I attended an activity organized by Tsinghua University and decided to change my undergraduate major from Economics to Computer Science, inspired by one of the teammates. During the 10 years, I met with appreciation of many wonderful researchers/professors who led me to continued growth. 🐿️ My research focus will continue to be AI & Robotics, with a specific emphasis on Interactive Embodied Intelligence. You can check my homepage to learn more: maureenzou.github.io/lab.html. I am currently local to San Diego and will be attending #NeurIPS. Please ping me over WeChat or Email if any old or new friends are interested in having a coffee chat! (Really looking forward to meeting as many friends as possible at #NeurIPS) [The photo is one of the places that I will miss a lot in the US]
Xueyan Zou tweet media
English
69
87
1.1K
111.1K
Linjie (Lindsey) Li retweetledi
Manling Li
Manling Li@ManlingLi_·
VAGEN poster at #NeurIPS: ⏲️11am-2pm Wed 📍Exhibit Hall C,D,E #5502 We look forward to discussing with you about: 1. MDP → POMDP 2. World modeling in agent internal belief 3. What is a good representation in agent internal belief for visual states? 4. How to use World Modeling to help reward shaping? 5. How to do turn-level critic learning? Drop by if you are interested in related topics!
Zihan "Zenus" Wang@wzenus

VAGEN poster 𝐭𝐨𝐦𝐨𝐫𝐫𝐨𝐰 at #NeurIPS! 🎮🧠 - 🕚 11am–2pm Wed - 📍 Exhibit Hall C,D,E #5502 We had much fun exploring: • How 𝐰𝐨𝐫𝐥𝐝 𝐦𝐨𝐝𝐞𝐥𝐢𝐧𝐠 helps VLM RL agents learn better policies • 𝐌𝐮𝐥𝐭𝐢-𝐭𝐮𝐫𝐧 𝐏𝐏𝐎 credit assignment via 𝐭𝐰𝐨-𝐥𝐞𝐯𝐞𝐥 𝐚𝐝𝐯𝐚𝐧𝐭𝐚𝐠𝐞 𝐞𝐬𝐭𝐢𝐦𝐚𝐭𝐨𝐫 (Bi-Level GAE) for turn-level and token-level critic learning Come chat about agents, RL, and world models 👀

English
3
19
119
15.5K
Linjie (Lindsey) Li retweetledi
dr. jack morris
dr. jack morris@jxmnop·
Wondering how to attend an ML conference the right way? ahead of NeurIPS 2025 (30k attendees!) here are ten pro tips: 1. Your main goals: (i) meet people (ii) regain excitement about work (iii) learn things – in that order. 2. Make a list of papers you like and seek them out at poster sessions. Try to talk to the authors– you can learn much more from them than from a PDF. 3. Pick one workshop and one tutorial that sounds most interesting. Skip the rest. 4. Cold email people you want to meet but haven't. Check Twitter and the accepted papers list. PhD students are especially responsive. 5. Practice a concise pitch of unpublished research you're working on for "what are you interested in rn?". Focus on big unanswered questions and exciting new directions, *not* papers. 6. Skip the orals. Posters are a higher-bandwidth, more engaging, more invigorating. Orals are a good time to go for a walk or talk in the hallway. 7. for the love of god, do NOT work on other research in your hotel room. Save mental bandwidth for the conference. (This may seem obvious; you'd be surprised.) 8. Talk to people outside your area. There are many smart people working on niches <10 people understand. Learn about one or two that won't help your own work. 9. Attend one social each night. Don't overthink it or get caught up in status games. They're all fun. 10. Take breaks. You can't go to everything, and conferences consume more energy than a normal workweek. hope this helps, and sad i'm not attending neurips, have fun :)
dr. jack morris tweet media
English
28
128
1.5K
136.3K
Linjie (Lindsey) Li retweetledi
Linjie (Lindsey) Li retweetledi