Linjie (Lindsey) Li

189 posts

Linjie (Lindsey) Li banner
Linjie (Lindsey) Li

Linjie (Lindsey) Li

@LINJIEFUN

researching @Microsoft, @UW, contributed to https://t.co/VzcJa9Skx3

Seattle, WA Beigetreten AฤŸustos 2012
503 Folgt2.8K Follower
Linjie (Lindsey) Li retweetet
Xuhui Zhou
Xuhui Zhou@nlpxuhuiยท
Creating user simulators is a key to evaluating and training models for user-facing agentic applications. But are stronger LLMs better user simulators? TL;DR: not really. We ran the largest sim2real study for AI agents to date: 31 LLM simulators vs. 451 real humans across 165 tasks. Here's what we found (co-lead with @sunweiwei12).
Xuhui Zhou tweet media
English
5
62
270
28.1K
Linjie (Lindsey) Li retweetet
Jae Sung Park
Jae Sung Park@jjaesungparkยท
VLMs todayโ€”including our own Molmoโ€”point via raw text strings (e.g. ""). What if pointing meant directly selecting the visual tokens instead? ๐Ÿค” Introducing MolmoPoint: Better Pointing for VLMs with Grounding Tokens ๐ŸŽฏ ๐Ÿ”“models, code, data, demo all OPEN ๐Ÿงต๐Ÿ‘‡ Paper: allenai.org/papers/molmopoโ€ฆ
English
10
34
342
45.7K
Linjie (Lindsey) Li retweetet
Yulu Gan
Yulu Gan@yule_ganยท
Simply adding Gaussian noise to LLMs (one stepโ€”no iterations, no learning rate, no gradients) and ensembling them can achieve performance comparable to or even better than standard GRPO/PPO on math reasoning, coding, writing, and chemistry tasks. We call this algorithm RandOpt. To verify that this is not limited to specific models, we tested it on Qwen, Llama, OLMo3, and VLMs. What's behind this? We find that in the Gaussian search neighborhood around pretrained LLMs, diverse task experts are densely distributed โ€” a regime we term Neural Thickets. Paper: arxiv.org/pdf/2603.12228 Code: github.com/sunrainyg/Randโ€ฆ Website: thickets.mit.edu
Yulu Gan tweet media
English
87
430
3K
671.5K
Linjie (Lindsey) Li retweetet
Zihan "Zenus" Wang
Zihan "Zenus" Wang@wzenusยท
In Agent RL, models suffer from Template Collapse. They generate vast, diverse outputs (High Entropy) that lose all meaningful connection to the input prompt (Low Mutual Information). In other words, agent learn different ways to say nothing. ๐Ÿš€ Introducing RAGEN-v2 -- Here's how we define and fix such silent failure modes in Agent RL. ๐Ÿงต
English
12
53
209
127.8K
Linjie (Lindsey) Li retweetet
Jiawei Gu
Jiawei Gu@Kuvviusยท
โ›”๏ธ Can MLLMs truly learn WHEN and HOW to use tools? (๐Ÿ› AdaReasoner says: yes!! Likeโ€ฆ actually decide: - โ€œShould I call a tool right now?โ€ - โ€œWhich one?โ€ - โ€œHow many times?โ€ What happened surprised us: a 7B model beats GPT-5 on visual tool-reasoningโ€”and shows adaptive behaviors we never programmed. (1/17)๐Ÿงต๐Ÿ‘‡ ๐Ÿ“„ arxiv.org/abs/2601.18631 ๐ŸŒ adareasoner.github.io
GIF
English
1
5
15
6.4K
Linjie (Lindsey) Li retweetet
Zihan "Zenus" Wang
Zihan "Zenus" Wang@wzenusยท
AlphaGoโ€™s 10-year anniversary today โ€” huge milestone for RL! Small serendipity: itโ€™s also 1 year since we released ๐‘๐€๐†๐„๐, our LLM Agent RL framework. Some thoughts on the past decade of RL, plus a major ๐‘๐€๐†๐„๐ update on reasoning collapse in Agent RL coming soon. 1/ Ten years ago, on Jan 27, DeepMind brought AlphaGo to the world. Back then, RL felt mythic. For the first time, it reached top professional-level in a domain that demands long-horizon planning -- already gone 5โ€“0 against the European champion. That moment made a lot of people truly believe this: a policy can โ€œgrow out of interactionโ€ instead of being hand-coded or hand-taught. One year ago, on Jan 27, we released RAGEN, an RL codebase for LLM agents. We started applying RL with verifiable rewards beyond โ€˜winning a gameโ€™ to large reasoning models that can plan and interact with the world. RL is no longer just about winning inside a closed board. It now plays out in a more open, long-horizon training loop that can resemble parts of the real world. But in this year, we also saw a quieter kind of collapse. It does not always look like failure. Sometimes it looks stable. Sometimes it even looks safer and more consistent. Yet the policy slowly turns into a โ€œpersonaโ€, a โ€œtemplateโ€, a โ€œlow-effort sense of securityโ€. So Iโ€™ve increasingly felt that ๐‘๐€๐†๐„๐ isnโ€™t just a system. For me, it reads more like the second half of a decade-long thread Iโ€™ve been watching unfold. The first half: โ€œRL can learn reasoning.โ€ The second half: โ€œRL can also quietly collapse if we donโ€™t have the right diagnostics.โ€ It feels like a time marker: ten years later, weโ€™re finally forced to look beyond reward and ask what stays input-conditionedโ€”and what drifts. 2/ If I use this coincidence as an anchor, I would split the last decade of RL into three chapters. The AlphaGo era: RL proved itself on long-horizon planning. It proved policies can emerge from interaction; The RLHF era: RL moved from winning games to alignment. It became a core mechanism that makes language models track human preferences. It became a key part behind many products today; The LLM Agent RL era: RL enters closed-loop, multi-turn self-training. The LLM agent learns more than answers. It learns plans, tools, revisions, reflection, and behavioral consistency across longer time scales. Put together, these chapters point to a missing piece for me: we still lack a clear, shared vocabulary and practical gauges for โ€œfailure modes in LLM Agent RLโ€. Progress has been fast on the capability side. But the language and gauges for how LLM agents degradeโ€”especially in closed-loop trainingโ€”still feel less settled. Thatโ€™s the piece weโ€™ve been trying to put words and measurements to this year. 3/ A decade after AlphaGo, a lot of the attention and resources in RL do seem to be shifting from closed worlds like board games toward systems like LLM agents. At the same time, closed-loop self-training can introduce a more systemic risk. In a loop of self-sampling and self-updating, a model can gradually settle into a โ€œtask-insensitive but cheaperโ€ strategy. It does not look terrible. It may even look safe and consistent. But it slowly loses prompt โ€œdiscriminabilityโ€. It can lose the property that makes reasoning actually change with the input. I like to define this with one sentence: โ€œtraining continues, but learning is idlingโ€. Rewards still move. Gradients still update. But the information is already dry. The policy solidifies toward templates, inertia, and risk-avoidance. One transferable takeaway from our year with ๐‘๐€๐†๐„๐ is this: In LLM Agent RL, itโ€™s not enough to only watch the reward or success rate. You must also watch whether โ€œinput-conditioned informationโ€ is still flowing. You must watch whether the LLM agent is still sensitive to the task. We are now preparing a new version of ๐‘๐€๐†๐„๐. You do not need to believe any result in advance. But we will make this line much clearer: how the battlefield shifts, how the new collapses happen, and which diagnosis view is the most actionable. 4/ Here I want to write something more personal, because this part wasnโ€™t โ€œthought upโ€. It was almost collided into. Right before writing this, I was sprinting on the new ๐‘๐€๐†๐„๐. After days of deadline pressure, I finally took a breath and noticed the date coincidence. Thinking about the past year, I started crying. When I actually began typing, the tears had just stopped. I looked at the time. It was 5pm, Jan 20, 2026, and my screen had gone dark. The contrast made the point feel sharper. This year wasnโ€™t about โ€œone more loss termโ€ or โ€œone more trickโ€. It was about a latent variable that kept showing up in closed-loop LLM Agent RL, but is hard to name cleanly: whether the agentโ€™s reasoning is still tied to the input. Training can keep running while reasoning drifts into templates, inertia, and avoidance. Reward can still move while prompt discriminability quietly erodes. โ€œMore stable, more certainโ€ can sometimes just mean โ€œless sensitive, less distinctiveโ€. Collapse is rarely a sudden crash. Itโ€™s usually a slow drift that looks fine from the outside. Thatโ€™s what I mean by a quiet failure mode. Not bad news, just something weโ€™d benefit from better gauges for. And on a personal note, learning to notice this earlier has changed how I work. The hits still come. I just recover faster, and keep moving. 5/ Then I looked back at the past yearโ€™s timeline and noticed another coincidence. DeepSeek-R1 landed on Jan 20, 2025 โ€” the same date I happened to notice the AlphaGo/RAGEN alignment. Iโ€™ll treat it as coincidence, but it did make the moment feel unexpectedly vivid. Since then, Iโ€™ve been jokingly calling 01/20 my โ€œdark mode dayโ€.
Zihan "Zenus" Wang tweet media
English
2
12
51
52.1K
Linjie (Lindsey) Li retweetet
Linjie (Lindsey) Li retweetet
LMSYS Org
LMSYS Org@lmsysorgยท
How long have you been "planning to understand" how modern LLM inference works? We just gave you a readable version of SGLang you can finish over the weekend. Introducing mini-SGLang โšก We distilled SGLang from 300K into 5,000 lines. Kept the core design, cut the complexity. Without sacrificing performance โ€” nearly identical to SGLang online. It is built for engineers, researchers, and students who want to see how inference really works and learn better from code than papers. โญ Star us on GitHub: github.com/sgl-project/miโ€ฆ ๐Ÿงต (1/3)
LMSYS Org tweet mediaLMSYS Org tweet media
English
32
186
1.2K
238.2K
Linjie (Lindsey) Li retweetet
Jiawei Gu
Jiawei Gu@Kuvviusยท
๐Ÿ”ฅ Big news! Our ThinkMorph will be discussed at MAR 2025 @ NeurIPS! Two keynotes you don't want to miss: 1โƒฃ Lindsey Li @LINJIEFUN : Pictures think harder than words - Evaluating and Building Visual Thinking in Multimodal Models (9:45 AM, Dec 7) 2โƒฃYu Cheng @YuCheng348997 : Towards Better Multimodal Reasoning: From Staged Reinforcement Learning to Visually-Perceptive Policy Optimization (2:45 PM, Dec 7) ๐Ÿ‘€ Let's talk about what it really takes to make see and think like humans! ๐Ÿ“ Room 11AB, San Diego Convention Center ๐Ÿ”— marworkshop.github.io/neurips25/ #NeurIPS2025 #ThinkMorph #Multimodal
Jiawei Gu tweet media
English
9
4
20
1K
Linjie (Lindsey) Li retweetet
Liwei Jiang
Liwei Jiang@liweijianglwยท
Super happy to receive the Best Paper Award at #NeurIPS2025 for our Artificial Hivemind paper!! (Really enjoyed giving oral talk at NeurIPS as well!)
Liwei Jiang tweet mediaLiwei Jiang tweet media
Liwei Jiang@liweijianglw

โš ๏ธDifferent models. Same thoughts.โš ๏ธ Todayโ€™s AI models converge into an ๐€๐ซ๐ญ๐ข๐Ÿ๐ข๐œ๐ข๐š๐ฅ ๐‡๐ข๐ฏ๐ž๐ฆ๐ข๐ง๐ ๐Ÿ, a striking case of mode collapse that persists even across heterogeneous ensembles. Our #neurips2025 ๐ƒ&๐ ๐Ž๐ซ๐š๐ฅ ๐ฉ๐š๐ฉ๐ž๐ซ (โœจ๐ญ๐จ๐ฉ ๐ŸŽ.๐Ÿ‘๐Ÿ“%โœจ) dives deep into this phenomenon, introducing ๐ˆ๐ง๐Ÿ๐ข๐ง๐ข๐ญ๐ฒ-๐‚๐ก๐š๐ญ, a real-world dataset of 26K real-world open-ended user queries spanning 17 open-ended categories + 31K dense human annotations (๐Ÿ๐Ÿ“ ๐ข๐ง๐๐ž๐ฉ๐ž๐ง๐๐ž๐ง๐ญ ๐š๐ง๐ง๐จ๐ญ๐š๐ญ๐จ๐ซ๐ฌ ๐ฉ๐ž๐ซ ๐ž๐ฑ๐š๐ฆ๐ฉ๐ฅ๐ž) to push AIโ€™s creative and discovery potential forward. Now you can build your favorite models to be truly original, diverse, and impactful in the open-ended real world. ๐Ÿ“Paper: arxiv.org/abs/2510.22954 ๐Ÿ“Data: huggingface.co/collections/liโ€ฆ We also systematically reveal Artificial Hivemind across: ๐Ÿ’ฅ ๐†๐ž๐ง๐ž๐ซ๐š๐ญ๐ข๐ฏ๐ž ๐š๐›๐ข๐ฅ๐ข๐ญ๐ข๐ž๐ฌ: not only do individual LLMs repeat themselves, but different models produce strikingly similar content, even when asked fully open-ended questions. ๐Ÿ’ฅ ๐ƒ๐ข๐ฌ๐œ๐ซ๐ข๐ฆ๐ข๐ง๐š๐ญ๐ข๐ฏ๐ž ๐š๐›๐ข๐ฅ๐ข๐ญ๐ข๐ž๐ฌ: LLMs, LM judges, and reward models are systematically miscalibrated when rating alternative responses to open-ended queries. (1/N)

English
37
67
781
80.1K
Linjie (Lindsey) Li retweetet
Alisa Liu
Alisa Liu@alisawufflesยท
Presenting Broken Tokens at the 4:30pm poster session today with @s_zhengbr! We'll demystify how LMs can understand brand new tokenizations ([โฃ, c, a, t] instead of โฃcat) entirely at test-time ๐Ÿ˜ฑ
Alisa Liu@alisawuffles

It began from a ๐Ÿคฏ๐Ÿคฏ observation: when giving LMs text tokenized at *character*-level, its generation seemed virtually unaffected โ€” even tho these token seqs are provably never seen in training! Suggests: functional char-level understanding, and tokenization as test-time control.

English
1
3
32
4.9K
Linjie (Lindsey) Li retweetet
Xueyan Zou
Xueyan Zou@xyz2maureenยท
I will join Tsinghua University, College of AI, as an Assistant Professor in the coming month. I am actively looking for 2026 spring interns and future PhDs (ping me if you are in #NeurIPS). It has been an incredible journey of 10 years since I attended an activity organized by Tsinghua University and decided to change my undergraduate major from Economics to Computer Science, inspired by one of the teammates. During the 10 years, I met with appreciation of many wonderful researchers/professors who led me to continued growth. ๐Ÿฟ๏ธ My research focus will continue to be AI & Robotics, with a specific emphasis on Interactive Embodied Intelligence. You can check my homepage to learn more: maureenzou.github.io/lab.html. I am currently local to San Diego and will be attending #NeurIPS. Please ping me over WeChat or Email if any old or new friends are interested in having a coffee chat! (Really looking forward to meeting as many friends as possible at #NeurIPS) [The photo is one of the places that I will miss a lot in the US]
Xueyan Zou tweet media
English
69
87
1.1K
111.1K
Linjie (Lindsey) Li retweetet
Manling Li
Manling Li@ManlingLi_ยท
VAGEN poster at #NeurIPS: โฒ๏ธ11am-2pm Wed ๐Ÿ“Exhibit Hall C,D,E #5502 We look forward to discussing with you about: 1. MDP โ†’ POMDP 2. World modeling in agent internal belief 3. What is a good representation in agent internal belief for visual states? 4. How to use World Modeling to help reward shaping? 5. How to do turn-level critic learning? Drop by if you are interested in related topics!
Zihan "Zenus" Wang@wzenus

VAGEN poster ๐ญ๐จ๐ฆ๐จ๐ซ๐ซ๐จ๐ฐ at #NeurIPS! ๐ŸŽฎ๐Ÿง  - ๐Ÿ•š 11amโ€“2pm Wed - ๐Ÿ“ Exhibit Hall C,D,E #5502 We had much fun exploring: โ€ข How ๐ฐ๐จ๐ซ๐ฅ๐ ๐ฆ๐จ๐๐ž๐ฅ๐ข๐ง๐  helps VLM RL agents learn better policies โ€ข ๐Œ๐ฎ๐ฅ๐ญ๐ข-๐ญ๐ฎ๐ซ๐ง ๐๐๐Ž credit assignment via ๐ญ๐ฐ๐จ-๐ฅ๐ž๐ฏ๐ž๐ฅ ๐š๐๐ฏ๐š๐ง๐ญ๐š๐ ๐ž ๐ž๐ฌ๐ญ๐ข๐ฆ๐š๐ญ๐จ๐ซ (Bi-Level GAE) for turn-level and token-level critic learning Come chat about agents, RL, and world models ๐Ÿ‘€

English
3
19
119
15.5K
Linjie (Lindsey) Li retweetet
dr. jack morris
dr. jack morris@jxmnopยท
Wondering how to attend an ML conference the right way? ahead of NeurIPS 2025 (30k attendees!) here are ten pro tips: 1. Your main goals: (i) meet people (ii) regain excitement about work (iii) learn things โ€“ in that order. 2. Make a list of papers you like and seek them out at poster sessions. Try to talk to the authorsโ€“ you can learn much more from them than from a PDF. 3. Pick one workshop and one tutorial that sounds most interesting. Skip the rest. 4. Cold email people you want to meet but haven't. Check Twitter and the accepted papers list. PhD students are especially responsive. 5. Practice a concise pitch of unpublished research you're working on for "what are you interested in rn?". Focus on big unanswered questions and exciting new directions, *not* papers. 6. Skip the orals. Posters are a higher-bandwidth, more engaging, more invigorating. Orals are a good time to go for a walk or talk in the hallway. 7. for the love of god, do NOT work on other research in your hotel room. Save mental bandwidth for the conference. (This may seem obvious; you'd be surprised.) 8. Talk to people outside your area. There are many smart people working on niches <10 people understand. Learn about one or two that won't help your own work. 9. Attend one social each night. Don't overthink it or get caught up in status games. They're all fun. 10. Take breaks. You can't go to everything, and conferences consume more energy than a normal workweek. hope this helps, and sad i'm not attending neurips, have fun :)
dr. jack morris tweet media
English
28
128
1.5K
136.3K
Linjie (Lindsey) Li retweetet
Zihan "Zenus" Wang
Zihan "Zenus" Wang@wzenusยท
VAGEN poster ๐ญ๐จ๐ฆ๐จ๐ซ๐ซ๐จ๐ฐ at #NeurIPS! ๐ŸŽฎ๐Ÿง  - ๐Ÿ•š 11amโ€“2pm Wed - ๐Ÿ“ Exhibit Hall C,D,E #5502 We had much fun exploring: โ€ข How ๐ฐ๐จ๐ซ๐ฅ๐ ๐ฆ๐จ๐๐ž๐ฅ๐ข๐ง๐  helps VLM RL agents learn better policies โ€ข ๐Œ๐ฎ๐ฅ๐ญ๐ข-๐ญ๐ฎ๐ซ๐ง ๐๐๐Ž credit assignment via ๐ญ๐ฐ๐จ-๐ฅ๐ž๐ฏ๐ž๐ฅ ๐š๐๐ฏ๐š๐ง๐ญ๐š๐ ๐ž ๐ž๐ฌ๐ญ๐ข๐ฆ๐š๐ญ๐จ๐ซ (Bi-Level GAE) for turn-level and token-level critic learning Come chat about agents, RL, and world models ๐Ÿ‘€
Zihan "Zenus" Wang tweet media
Zihan "Zenus" Wang@wzenus

๐Ÿš€Excited to share our NeurIPS 2025 paper VAGEN, a scalable RL framework that trains VLM agents to reason as world models. VLM agents often act without tracking the world: they lose state, fail to anticipate effects, and RL wobbles under sparse, late rewards. Our solution is clear: - VAGEN guides VLM to think into StateEstimation (what is the current visual state?) and TransitionModeling (what happens after an action). - Further, we optimize with Bi-Level GAE for long-horizon credit assignment (turn level to token level) and a WorldModeling Reward that scores each turnโ€™s observation and prediction against ground truth. On five agentic tasks from game, navigation to robotic manipulation, we boost a 3B model from 0.21 -> 0.82 and outperforms proprietary reasoning models like GPT-5! Paper & project: mll.lab.northwestern.edu/VAGEN

English
0
10
60
22.8K
Linjie (Lindsey) Li retweetet