Linjie (Lindsey) Li

189 posts

Linjie (Lindsey) Li

@LINJIEFUN

researching @Microsoft, @UW, contributed to https://t.co/VzcJa9Skx3

Seattle, WA Beigetreten Ağustos 2012

503 Folgt2.8K Follower

Linjie (Lindsey) Li retweetet

Xuhui Zhou@nlpxuhui·4d

Creating user simulators is a key to evaluating and training models for user-facing agentic applications. But are stronger LLMs better user simulators? TL;DR: not really. We ran the largest sim2real study for AI agents to date: 31 LLM simulators vs. 451 real humans across 165 tasks. Here's what we found (co-lead with @sunweiwei12).

English

270

28.1K

Linjie (Lindsey) Li retweetet

Jae Sung Park@jjaesungpark·5d

VLMs today—including our own Molmo—point via raw text strings (e.g. ""). What if pointing meant directly selecting the visual tokens instead? 🤔 Introducing MolmoPoint: Better Pointing for VLMs with Grounding Tokens 🎯 🔓models, code, data, demo all OPEN 🧵👇 Paper: allenai.org/papers/molmopo…

English

342

45.7K

Linjie (Lindsey) Li retweetet

Yulu Gan@yule_gan·13 Mar

Simply adding Gaussian noise to LLMs (one step—no iterations, no learning rate, no gradients) and ensembling them can achieve performance comparable to or even better than standard GRPO/PPO on math reasoning, coding, writing, and chemistry tasks. We call this algorithm RandOpt. To verify that this is not limited to specific models, we tested it on Qwen, Llama, OLMo3, and VLMs. What's behind this? We find that in the Gaussian search neighborhood around pretrained LLMs, diverse task experts are densely distributed — a regime we term Neural Thickets. Paper: arxiv.org/pdf/2603.12228 Code: github.com/sunrainyg/Rand… Website: thickets.mit.edu

English

430

671.5K

Linjie (Lindsey) Li retweetet

Licheng Liu@liulicheng10·13 Mar

x.com/i/article/2032…

ZXX

714

62.2K

Linjie (Lindsey) Li retweetet

Zihan "Zenus" Wang@wzenus·13 Mar

In Agent RL, models suffer from Template Collapse. They generate vast, diverse outputs (High Entropy) that lose all meaningful connection to the input prompt (Low Mutual Information). In other words, agent learn different ways to say nothing. 🚀 Introducing RAGEN-v2 -- Here's how we define and fix such silent failure modes in Agent RL. 🧵

English

209

127.8K

Linjie (Lindsey) Li retweetet

dr. jack morris@jxmnop·9 Mar

x.com/i/article/2031…

ZXX

159

1.9K

391.5K

Linjie (Lindsey) Li retweetet

Jiawei Gu@Kuvvius·28 Oca

⛔️ Can MLLMs truly learn WHEN and HOW to use tools? (🛠AdaReasoner says: yes!! Like… actually decide: - “Should I call a tool right now?” - “Which one?” - “How many times?” What happened surprised us: a 7B model beats GPT-5 on visual tool-reasoning—and shows adaptive behaviors we never programmed. (1/17)🧵👇 📄 arxiv.org/abs/2601.18631 🌐 adareasoner.github.io

GIF

English

6.4K

Linjie (Lindsey) Li retweetet

Zihan "Zenus" Wang@wzenus·27 Oca

AlphaGo’s 10-year anniversary today — huge milestone for RL! Small serendipity: it’s also 1 year since we released 𝐑𝐀𝐆𝐄𝐍, our LLM Agent RL framework. Some thoughts on the past decade of RL, plus a major 𝐑𝐀𝐆𝐄𝐍 update on reasoning collapse in Agent RL coming soon. 1/ Ten years ago, on Jan 27, DeepMind brought AlphaGo to the world. Back then, RL felt mythic. For the first time, it reached top professional-level in a domain that demands long-horizon planning -- already gone 5–0 against the European champion. That moment made a lot of people truly believe this: a policy can “grow out of interaction” instead of being hand-coded or hand-taught. One year ago, on Jan 27, we released RAGEN, an RL codebase for LLM agents. We started applying RL with verifiable rewards beyond ‘winning a game’ to large reasoning models that can plan and interact with the world. RL is no longer just about winning inside a closed board. It now plays out in a more open, long-horizon training loop that can resemble parts of the real world. But in this year, we also saw a quieter kind of collapse. It does not always look like failure. Sometimes it looks stable. Sometimes it even looks safer and more consistent. Yet the policy slowly turns into a “persona”, a “template”, a “low-effort sense of security”. So I’ve increasingly felt that 𝐑𝐀𝐆𝐄𝐍 isn’t just a system. For me, it reads more like the second half of a decade-long thread I’ve been watching unfold. The first half: “RL can learn reasoning.” The second half: “RL can also quietly collapse if we don’t have the right diagnostics.” It feels like a time marker: ten years later, we’re finally forced to look beyond reward and ask what stays input-conditioned—and what drifts. 2/ If I use this coincidence as an anchor, I would split the last decade of RL into three chapters. The AlphaGo era: RL proved itself on long-horizon planning. It proved policies can emerge from interaction; The RLHF era: RL moved from winning games to alignment. It became a core mechanism that makes language models track human preferences. It became a key part behind many products today; The LLM Agent RL era: RL enters closed-loop, multi-turn self-training. The LLM agent learns more than answers. It learns plans, tools, revisions, reflection, and behavioral consistency across longer time scales. Put together, these chapters point to a missing piece for me: we still lack a clear, shared vocabulary and practical gauges for “failure modes in LLM Agent RL”. Progress has been fast on the capability side. But the language and gauges for how LLM agents degrade—especially in closed-loop training—still feel less settled. That’s the piece we’ve been trying to put words and measurements to this year. 3/ A decade after AlphaGo, a lot of the attention and resources in RL do seem to be shifting from closed worlds like board games toward systems like LLM agents. At the same time, closed-loop self-training can introduce a more systemic risk. In a loop of self-sampling and self-updating, a model can gradually settle into a “task-insensitive but cheaper” strategy. It does not look terrible. It may even look safe and consistent. But it slowly loses prompt “discriminability”. It can lose the property that makes reasoning actually change with the input. I like to define this with one sentence: “training continues, but learning is idling”. Rewards still move. Gradients still update. But the information is already dry. The policy solidifies toward templates, inertia, and risk-avoidance. One transferable takeaway from our year with 𝐑𝐀𝐆𝐄𝐍 is this: In LLM Agent RL, it’s not enough to only watch the reward or success rate. You must also watch whether “input-conditioned information” is still flowing. You must watch whether the LLM agent is still sensitive to the task. We are now preparing a new version of 𝐑𝐀𝐆𝐄𝐍. You do not need to believe any result in advance. But we will make this line much clearer: how the battlefield shifts, how the new collapses happen, and which diagnosis view is the most actionable. 4/ Here I want to write something more personal, because this part wasn’t “thought up”. It was almost collided into. Right before writing this, I was sprinting on the new 𝐑𝐀𝐆𝐄𝐍. After days of deadline pressure, I finally took a breath and noticed the date coincidence. Thinking about the past year, I started crying. When I actually began typing, the tears had just stopped. I looked at the time. It was 5pm, Jan 20, 2026, and my screen had gone dark. The contrast made the point feel sharper. This year wasn’t about “one more loss term” or “one more trick”. It was about a latent variable that kept showing up in closed-loop LLM Agent RL, but is hard to name cleanly: whether the agent’s reasoning is still tied to the input. Training can keep running while reasoning drifts into templates, inertia, and avoidance. Reward can still move while prompt discriminability quietly erodes. “More stable, more certain” can sometimes just mean “less sensitive, less distinctive”. Collapse is rarely a sudden crash. It’s usually a slow drift that looks fine from the outside. That’s what I mean by a quiet failure mode. Not bad news, just something we’d benefit from better gauges for. And on a personal note, learning to notice this earlier has changed how I work. The hits still come. I just recover faster, and keep moving. 5/ Then I looked back at the past year’s timeline and noticed another coincidence. DeepSeek-R1 landed on Jan 20, 2025 — the same date I happened to notice the AlphaGo/RAGEN alignment. I’ll treat it as coincidence, but it did make the moment feel unexpectedly vivid. Since then, I’ve been jokingly calling 01/20 my “dark mode day”.

English

52.1K

Linjie (Lindsey) Li retweetet

Jiawei Gu@Kuvvius·27 Oca

Accepted to #ICLR2026 ! 💥ThinkMorph is heading to Brazil! 🇧🇷

Jiawei Gu@Kuvvius

🚨Sensational title alert: we may have cracked the code to true multimodal reasoning. Meet ThinkMorph — thinking in modalities, not just with them. And what we found was... unexpected. 👀 Emergent intelligence, strong gains, and …🫣 🧵 arxiv.org/abs/2510.27492 (1/16)

English

Linjie (Lindsey) Li retweetet

Jiawei Gu@Kuvvius·26 Oca

#ICLR2026: A cycle that will be remembered for years to come. 🌊 🤧A wild ride for the community. 🩷 So grateful to have my incredible co-authors by my side through it all. Let's continue exploring multimodal reasoning: 🧵👇 1. ThinkMorph thinkmorph.github.io 2. STARE arxiv.org/abs/2506.04633 3. AdaReasoner adareasoner.github.io @iclr_conf #iclr

English

114

10.8K

Linjie (Lindsey) Li retweetet

LMSYS Org@lmsysorg·17 Ara

How long have you been "planning to understand" how modern LLM inference works? We just gave you a readable version of SGLang you can finish over the weekend. Introducing mini-SGLang ⚡ We distilled SGLang from 300K into 5,000 lines. Kept the core design, cut the complexity. Without sacrificing performance — nearly identical to SGLang online. It is built for engineers, researchers, and students who want to see how inference really works and learn better from code than papers. ⭐ Star us on GitHub: github.com/sgl-project/mi… 🧵 (1/3)

English

186

1.2K

238.2K

Linjie (Lindsey) Li retweetet

Jiawei Gu@Kuvvius·6 Ara

🔥 Big news! Our ThinkMorph will be discussed at MAR 2025 @ NeurIPS! Two keynotes you don't want to miss: 1⃣ Lindsey Li @LINJIEFUN : Pictures think harder than words - Evaluating and Building Visual Thinking in Multimodal Models (9:45 AM, Dec 7) 2⃣Yu Cheng @YuCheng348997 : Towards Better Multimodal Reasoning: From Staged Reinforcement Learning to Visually-Perceptive Policy Optimization (2:45 PM, Dec 7) 👀 Let's talk about what it really takes to make see and think like humans! 📍 Room 11AB, San Diego Convention Center 🔗 marworkshop.github.io/neurips25/ #NeurIPS2025 #ThinkMorph #Multimodal

English

Linjie (Lindsey) Li retweetet

Liwei Jiang@liweijianglw·4 Ara

Super happy to receive the Best Paper Award at #NeurIPS2025 for our Artificial Hivemind paper!! (Really enjoyed giving oral talk at NeurIPS as well!)

Liwei Jiang@liweijianglw

⚠️Different models. Same thoughts.⚠️ Today’s AI models converge into an 𝐀𝐫𝐭𝐢𝐟𝐢𝐜𝐢𝐚𝐥 𝐇𝐢𝐯𝐞𝐦𝐢𝐧𝐝 🐝, a striking case of mode collapse that persists even across heterogeneous ensembles. Our #neurips2025 𝐃&𝐁 𝐎𝐫𝐚𝐥 𝐩𝐚𝐩𝐞𝐫 (✨𝐭𝐨𝐩 𝟎.𝟑𝟓%✨) dives deep into this phenomenon, introducing 𝐈𝐧𝐟𝐢𝐧𝐢𝐭𝐲-𝐂𝐡𝐚𝐭, a real-world dataset of 26K real-world open-ended user queries spanning 17 open-ended categories + 31K dense human annotations (𝟐𝟓 𝐢𝐧𝐝𝐞𝐩𝐞𝐧𝐝𝐞𝐧𝐭 𝐚𝐧𝐧𝐨𝐭𝐚𝐭𝐨𝐫𝐬 𝐩𝐞𝐫 𝐞𝐱𝐚𝐦𝐩𝐥𝐞) to push AI’s creative and discovery potential forward. Now you can build your favorite models to be truly original, diverse, and impactful in the open-ended real world. 📍Paper: arxiv.org/abs/2510.22954 📍Data: huggingface.co/collections/li… We also systematically reveal Artificial Hivemind across: 💥 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐚𝐛𝐢𝐥𝐢𝐭𝐢𝐞𝐬: not only do individual LLMs repeat themselves, but different models produce strikingly similar content, even when asked fully open-ended questions. 💥 𝐃𝐢𝐬𝐜𝐫𝐢𝐦𝐢𝐧𝐚𝐭𝐢𝐯𝐞 𝐚𝐛𝐢𝐥𝐢𝐭𝐢𝐞𝐬: LLMs, LM judges, and reward models are systematically miscalibrated when rating alternative responses to open-ended queries. (1/N)

English

781

80.1K

Linjie (Lindsey) Li retweetet

Alisa Liu@alisawuffles·4 Ara

Presenting Broken Tokens at the 4:30pm poster session today with @s_zhengbr! We'll demystify how LMs can understand brand new tokenizations ([␣, c, a, t] instead of ␣cat) entirely at test-time 😱

Alisa Liu@alisawuffles

It began from a 🤯🤯 observation: when giving LMs text tokenized at *character*-level, its generation seemed virtually unaffected — even tho these token seqs are provably never seen in training! Suggests: functional char-level understanding, and tokenization as test-time control.

English

4.9K

Linjie (Lindsey) Li retweetet

Xueyan Zou@xyz2maureen·2 Ara

I will join Tsinghua University, College of AI, as an Assistant Professor in the coming month. I am actively looking for 2026 spring interns and future PhDs (ping me if you are in #NeurIPS). It has been an incredible journey of 10 years since I attended an activity organized by Tsinghua University and decided to change my undergraduate major from Economics to Computer Science, inspired by one of the teammates. During the 10 years, I met with appreciation of many wonderful researchers/professors who led me to continued growth. 🐿️ My research focus will continue to be AI & Robotics, with a specific emphasis on Interactive Embodied Intelligence. You can check my homepage to learn more: maureenzou.github.io/lab.html. I am currently local to San Diego and will be attending #NeurIPS. Please ping me over WeChat or Email if any old or new friends are interested in having a coffee chat! (Really looking forward to meeting as many friends as possible at #NeurIPS) [The photo is one of the places that I will miss a lot in the US]

English

1.1K

111.1K

Linjie (Lindsey) Li retweetet

Manling Li@ManlingLi_·3 Ara

VAGEN poster at #NeurIPS: ⏲️11am-2pm Wed 📍Exhibit Hall C,D,E #5502 We look forward to discussing with you about: 1. MDP → POMDP 2. World modeling in agent internal belief 3. What is a good representation in agent internal belief for visual states? 4. How to use World Modeling to help reward shaping? 5. How to do turn-level critic learning? Drop by if you are interested in related topics!

Zihan "Zenus" Wang@wzenus

VAGEN poster 𝐭𝐨𝐦𝐨𝐫𝐫𝐨𝐰 at #NeurIPS! 🎮🧠 - 🕚 11am–2pm Wed - 📍 Exhibit Hall C,D,E #5502 We had much fun exploring: • How 𝐰𝐨𝐫𝐥𝐝 𝐦𝐨𝐝𝐞𝐥𝐢𝐧𝐠 helps VLM RL agents learn better policies • 𝐌𝐮𝐥𝐭𝐢-𝐭𝐮𝐫𝐧 𝐏𝐏𝐎 credit assignment via 𝐭𝐰𝐨-𝐥𝐞𝐯𝐞𝐥 𝐚𝐝𝐯𝐚𝐧𝐭𝐚𝐠𝐞 𝐞𝐬𝐭𝐢𝐦𝐚𝐭𝐨𝐫 (Bi-Level GAE) for turn-level and token-level critic learning Come chat about agents, RL, and world models 👀

English

119

15.5K

Linjie (Lindsey) Li retweetet

dr. jack morris@jxmnop·4 Ara

Wondering how to attend an ML conference the right way? ahead of NeurIPS 2025 (30k attendees!) here are ten pro tips: 1. Your main goals: (i) meet people (ii) regain excitement about work (iii) learn things – in that order. 2. Make a list of papers you like and seek them out at poster sessions. Try to talk to the authors– you can learn much more from them than from a PDF. 3. Pick one workshop and one tutorial that sounds most interesting. Skip the rest. 4. Cold email people you want to meet but haven't. Check Twitter and the accepted papers list. PhD students are especially responsive. 5. Practice a concise pitch of unpublished research you're working on for "what are you interested in rn?". Focus on big unanswered questions and exciting new directions, *not* papers. 6. Skip the orals. Posters are a higher-bandwidth, more engaging, more invigorating. Orals are a good time to go for a walk or talk in the hallway. 7. for the love of god, do NOT work on other research in your hotel room. Save mental bandwidth for the conference. (This may seem obvious; you'd be surprised.) 8. Talk to people outside your area. There are many smart people working on niches <10 people understand. Learn about one or two that won't help your own work. 9. Attend one social each night. Don't overthink it or get caught up in status games. They're all fun. 10. Take breaks. You can't go to everything, and conferences consume more energy than a normal workweek. hope this helps, and sad i'm not attending neurips, have fun :)

English

128

1.5K

136.3K

Linjie (Lindsey) Li retweetet

Zihan "Zenus" Wang@wzenus·3 Ara

Zihan "Zenus" Wang@wzenus

🚀Excited to share our NeurIPS 2025 paper VAGEN, a scalable RL framework that trains VLM agents to reason as world models. VLM agents often act without tracking the world: they lose state, fail to anticipate effects, and RL wobbles under sparse, late rewards. Our solution is clear: - VAGEN guides VLM to think into StateEstimation (what is the current visual state?) and TransitionModeling (what happens after an action). - Further, we optimize with Bi-Level GAE for long-horizon credit assignment (turn level to token level) and a WorldModeling Reward that scores each turn’s observation and prediction against ground truth. On five agentic tasks from game, navigation to robotic manipulation, we boost a 3B model from 0.21 -> 0.82 and outperforms proprietary reasoning models like GPT-5! Paper & project: mll.lab.northwestern.edu/VAGEN

English

22.8K

Linjie (Lindsey) Li retweetet

Jinpeng Wang@awinyimgprocess·3 Ara

Thanks AK for sharing our work. We speed up Qwen-image and Flux Inference speed from 50 steps to less than 10steps. 1 sample is enough for speeding up diffusion model in specific domain. Paper Link: arxiv.org/abs/2512.02899 Code Link: github.com/CSU-JPG/Glance

AK@_akhaliq

Glance Accelerating Diffusion Models with 1 Sample

English

19.5K

Linjie (Lindsey) Li retweetet

Wenhu Chen@WenhuChen·26 Kas

Glad to see the Claude Opus 4.5 adopted our BrowseComp-Plus eval. arxiv.org/abs/2508.06600

Xueguang Ma@xueguang_ma

🚀 Introducing BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent. It is a new Deep-Research evaluation benchmark built on top of BrowseComp. It features - 📚 a fixed, carefully curated corpus of web documents - ✅ human-verified positive documents - ⚔️ web-mined challenging hard negatives. With BrowseComp-Plus, you can thoroughly evaluate and compare the performance of different components in a deep-research system. e.g. GPT-5 + Qwen3-Embedding. Code, dataset, and leaderboard links are provided at the end of this thread.

English

5.9K

Entdecken

@sunweiwei12 @iclr_conf @YuCheng348997 @s_zhengbr @elonmusk @BarackObama @taylorswift13 @cristiano