

Linjie (Lindsey) Li
189 posts

@LINJIEFUN
researching @Microsoft, @UW, contributed to https://t.co/VzcJa9Skx3










๐จSensational title alert: we may have cracked the code to true multimodal reasoning. Meet ThinkMorph โ thinking in modalities, not just with them. And what we found was... unexpected. ๐ Emergent intelligence, strong gains, and โฆ๐ซฃ ๐งต arxiv.org/abs/2510.27492 (1/16)










โ ๏ธDifferent models. Same thoughts.โ ๏ธ Todayโs AI models converge into an ๐๐ซ๐ญ๐ข๐๐ข๐๐ข๐๐ฅ ๐๐ข๐ฏ๐๐ฆ๐ข๐ง๐ ๐, a striking case of mode collapse that persists even across heterogeneous ensembles. Our #neurips2025 ๐&๐ ๐๐ซ๐๐ฅ ๐ฉ๐๐ฉ๐๐ซ (โจ๐ญ๐จ๐ฉ ๐.๐๐%โจ) dives deep into this phenomenon, introducing ๐๐ง๐๐ข๐ง๐ข๐ญ๐ฒ-๐๐ก๐๐ญ, a real-world dataset of 26K real-world open-ended user queries spanning 17 open-ended categories + 31K dense human annotations (๐๐ ๐ข๐ง๐๐๐ฉ๐๐ง๐๐๐ง๐ญ ๐๐ง๐ง๐จ๐ญ๐๐ญ๐จ๐ซ๐ฌ ๐ฉ๐๐ซ ๐๐ฑ๐๐ฆ๐ฉ๐ฅ๐) to push AIโs creative and discovery potential forward. Now you can build your favorite models to be truly original, diverse, and impactful in the open-ended real world. ๐Paper: arxiv.org/abs/2510.22954 ๐Data: huggingface.co/collections/liโฆ We also systematically reveal Artificial Hivemind across: ๐ฅ ๐๐๐ง๐๐ซ๐๐ญ๐ข๐ฏ๐ ๐๐๐ข๐ฅ๐ข๐ญ๐ข๐๐ฌ: not only do individual LLMs repeat themselves, but different models produce strikingly similar content, even when asked fully open-ended questions. ๐ฅ ๐๐ข๐ฌ๐๐ซ๐ข๐ฆ๐ข๐ง๐๐ญ๐ข๐ฏ๐ ๐๐๐ข๐ฅ๐ข๐ญ๐ข๐๐ฌ: LLMs, LM judges, and reward models are systematically miscalibrated when rating alternative responses to open-ended queries. (1/N)

It began from a ๐คฏ๐คฏ observation: when giving LMs text tokenized at *character*-level, its generation seemed virtually unaffected โ even tho these token seqs are provably never seen in training! Suggests: functional char-level understanding, and tokenization as test-time control.



VAGEN poster ๐ญ๐จ๐ฆ๐จ๐ซ๐ซ๐จ๐ฐ at #NeurIPS! ๐ฎ๐ง - ๐ 11amโ2pm Wed - ๐ Exhibit Hall C,D,E #5502 We had much fun exploring: โข How ๐ฐ๐จ๐ซ๐ฅ๐ ๐ฆ๐จ๐๐๐ฅ๐ข๐ง๐ helps VLM RL agents learn better policies โข ๐๐ฎ๐ฅ๐ญ๐ข-๐ญ๐ฎ๐ซ๐ง ๐๐๐ credit assignment via ๐ญ๐ฐ๐จ-๐ฅ๐๐ฏ๐๐ฅ ๐๐๐ฏ๐๐ง๐ญ๐๐ ๐ ๐๐ฌ๐ญ๐ข๐ฆ๐๐ญ๐จ๐ซ (Bi-Level GAE) for turn-level and token-level critic learning Come chat about agents, RL, and world models ๐




๐Excited to share our NeurIPS 2025 paper VAGEN, a scalable RL framework that trains VLM agents to reason as world models. VLM agents often act without tracking the world: they lose state, fail to anticipate effects, and RL wobbles under sparse, late rewards. Our solution is clear: - VAGEN guides VLM to think into StateEstimation (what is the current visual state?) and TransitionModeling (what happens after an action). - Further, we optimize with Bi-Level GAE for long-horizon credit assignment (turn level to token level) and a WorldModeling Reward that scores each turnโs observation and prediction against ground truth. On five agentic tasks from game, navigation to robotic manipulation, we boost a 3B model from 0.21 -> 0.82 and outperforms proprietary reasoning models like GPT-5! Paper & project: mll.lab.northwestern.edu/VAGEN

Glance Accelerating Diffusion Models with 1 Sample


๐ Introducing BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent. It is a new Deep-Research evaluation benchmark built on top of BrowseComp. It features - ๐ a fixed, carefully curated corpus of web documents - โ human-verified positive documents - โ๏ธ web-mined challenging hard negatives. With BrowseComp-Plus, you can thoroughly evaluate and compare the performance of different components in a deep-research system. e.g. GPT-5 + Qwen3-Embedding. Code, dataset, and leaderboard links are provided at the end of this thread.