Linjie (Lindsey) Li

199 posts

Linjie (Lindsey) Li banner
Linjie (Lindsey) Li

Linjie (Lindsey) Li

@LINJIEFUN

researching @Microsoft, @UW, contributed to https://t.co/VzcJa9Skx3

Seattle, WA Katılım Ağustos 2012
528 Takip Edilen2.9K Takipçiler
Linjie (Lindsey) Li retweetledi
Zhihao Jia
Zhihao Jia@JiaZhihao·
🚀Introducing Motus, the open-source agent infrastructure that learns in production. Existing agent infra serves static agents: the harness, model, and workflow are fixed after deployment. But static agents degrade over time. The harness goes stale, new models go unincorporated, context drifts, and latency compounds. Motus closes this gap by learning from every trace (failures, latency, cost, and task outcomes) and using those signals to continuously optimize agent harness, model orchestration, context memory, and end-to-end latency. Early results: higher accuracy than any single frontier model at 2.3× lower cost (Terminal-Bench 2.0, SWE-bench Verified), with 52% lower latency and 45% better memory recall. Open source under Apache 2.0. Works with any agent SDK. Deploy with one command. github.com/lithos-ai/motus lithosai.com
Zhihao Jia tweet media
English
22
71
566
56.7K
Linjie (Lindsey) Li retweetledi
Tristan Thrush
Tristan Thrush@TristanThrush·
New paper! Want to precisely optimize synthetic training data to do practical or even wacky things? Dataset Policy Gradients get you there, letting you target any differentiable training or post-training metric. We embedded a QR code in GPT-2’s weights using only training data!
Tristan Thrush tweet media
English
7
41
221
46.3K
Linjie (Lindsey) Li retweetledi
Zihan "Zenus" Wang
Zihan "Zenus" Wang@wzenus·
Thanks @jiqizhixin for sharing our paper! 😍
机器之心 JIQIZHIXIN@jiqizhixin

Are your LLM agents truly reasoning, or just stuck repeating the same patterns? Zihan Wang @wzenus and a stellar team from Northwestern, Stanford, Microsoft, Oxford, and Imperial College London have uncovered "template collapse", a hidden flaw where LLM agents appear diverse but fail to adapt to new inputs. Their RAGEN-2 framework introduces Mutual Information to accurately measure true "cross-input distinguishability" and proposes SNR-Aware Filtering to select high-signal training prompts. This new metric and method vastly outperform current approaches, boosting LLM agent performance and input dependence across critical tasks like planning, math reasoning, web navigation, and code execution! And this paper is also #1 Paper of the day on Hugging Face!

English
0
5
21
4.8K
Linjie (Lindsey) Li retweetledi
Weikai Huang
Weikai Huang@weikaih04·
Thrilled to announce our latest project at @allen_ai @RAIVNLab: WildDet3D Humans understand objects in 3D effortlessly -- we see a mug on a desk, judge the distance to a parked car, or estimate the height of a building across the street. For CV / Robotics models, this remains surprisingly hard. We've built great models that each handle a piece of the puzzle: FoundationPose for 6-DoF pose over tabletops, MoGe 2 for accurate metric depth estimation, SAM for 2D segmentation and tracking. But they're fragmented -- each solves one sub-task, none gives you the full picture: where is this object in 3D, how big is it, and how is it oriented? Monocular 3D object detection is exactly this task -- recovering the full 3D bounding box of any object from a single RGB image. It's the missing link that connects 2D perception to real-world 3D understanding for robotics, AR/VR, and embodied AI. vehicles So why hasn't anyone cracked open-world 3D detection? Data. Existing 3D datasets (Omni3D, COCO3D) cover fewer than 100 categories, locked to driving corridors and indoor rooms. And the annotation methods -- BEV labelling, point cloud labelling -- fundamentally don't scale to in-the-wild scenes where you don't have LiDAR or a well-reconstructed point cloud. And objects are much more diverse in size/pose compared with vehicle and furniture. To tackle this: We designed a human-in-the-loop pipeline to change this. We build complex pseudo-3D box generators using different algorithms/models. Then, 1700+ human annotators from Prolific select the best candidate and verify quality. Along with thousands of annotators for several months, we got the result: WildDet3D-Data -- 1M total images, 13.5K categories of objects, with 100k of all human-verified 3d detection images. That's 138x more category coverage than Omni3D. Street food carts, violins, traffic cones, sculptures -- objects no 3D dataset has ever covered. With this data, we trained WildDet3D -- a single geometry-aware architecture built on SAM 3 and LingBot-Depth that unifies every way you'd want to interact with a 3D detector: - Text: "find all chairs" - Box prompt: click a 2D box, get its 3D box (geometric, one-to-one) - Exemplar prompt: draw one box, find all similar objects (one-to-many) - Point prompt: click on an object And when you have extra depth -- LiDAR, stereo, anything -- just pass it in. The model fuses it and gets substantially better: +20.7 AP on average. No depth? It works fine without it. Results on our new in-the-wild benchmark (WildDet3D-Bench, 700+ open-world categories): 22.6 AP text / 24.8 AP box -- up from 2.3 AP for the previous best. With depth: 41.6 AP text / 47.2 AP box. Also SOTA on Omni3D (34.2 AP text / 36.4 AP box) with 10x fewer training epochs, and strong zero-shot transfer to Argoverse 2 and ScanNet (40.3 / 48.9 ODS).
Ai2@allen_ai

Today we're releasing WildDet3D—an open model for monocular 3D object detection in the wild. It works with text, clicks, or 2D boxes, and on zero-shot evals it nearly doubles the best prior scores. 🧵

English
5
19
85
19.2K
Linjie (Lindsey) Li retweetledi
Zhengzhong Tu
Zhengzhong Tu@_vztu·
We are entering the second half of research. Here is my advice to every PhD student before starting a project: 1. Can Claude Code solve it in a day? 2. Will a Research Agent solve it soon? 3. Will scaling solve it anyway? If the answer to all three is No, then maybe you have found a real research problem. Because in the age of AI, many things that looked like research are being revealed as delayed engineering. That does not make research less important. It makes problem selection more important than ever. The scarce resource is no longer intelligence. It is taste. It is originality. It is the ability to ask questions that survive automation. The first half of research was about solving hard problems. The second half is about knowing which problems are still worth solving. #research #academic #AI #GenAI #generativeai #airesearch #taste
Zhengzhong Tu tweet media
English
8
20
145
43.3K
Linjie (Lindsey) Li retweetledi
Zixian Ma
Zixian Ma@zixianma02·
Ever come across a beautiful Figure 1 in a paper, only to wish you could easily edit and adapt it for your own use? Check out our new work VFig: Vectorizing Complex Figures in SVG with Vision-Language Models! It is a specialized VLM that converts any diagram – simple and complex – into editable and clean SVG code. Built on Qwen3-VL 4B with SFT & RL, it matches GPT5.2’s performance on converting complex diagrams into SVG code and outperforms open-source generalists and specialists on simple-to-complex diagram vectorization. 🕹️Try it now on our demo: tinyurl.com/vfig-demo
English
1
13
51
10.4K
Linjie (Lindsey) Li retweetledi
Zixian Ma
Zixian Ma@zixianma02·
We built MolmoWeb from the scratch with Molmo2!!! 💕🌐 It’s not easy to build SOTA web agents out of open source VLMs, when they can be so profitable that very few projects release everything (if anything), esp the datasets 🔑 But, we just released all the MolmoWeb model checkpoints and datasets from ai2😉 Can’t wait to see what the community builds on top of MolmoWeb!🫡
Ai2@allen_ai

Today we're releasing MolmoWeb, an open source agent that can navigate + complete tasks in a browser on your behalf. Built on Molmo 2 in 4B & 8B sizes, it sets a new open-weight SOTA across four major web-agent benchmarks & even surpasses agents built on proprietary models. 🧵

English
10
25
219
26.8K
Linjie (Lindsey) Li retweetledi
Xuhui Zhou
Xuhui Zhou@nlpxuhui·
Creating user simulators is a key to evaluating and training models for user-facing agentic applications. But are stronger LLMs better user simulators? TL;DR: not really. We ran the largest sim2real study for AI agents to date: 31 LLM simulators vs. 451 real humans across 165 tasks. Here's what we found (co-lead with @sunweiwei12).
Xuhui Zhou tweet media
English
8
68
287
32.7K
Linjie (Lindsey) Li retweetledi
Jae Sung Park
Jae Sung Park@jjaesungpark·
VLMs today—including our own Molmo—point via raw text strings (e.g. ""). What if pointing meant directly selecting the visual tokens instead? 🤔 Introducing MolmoPoint: Better Pointing for VLMs with Grounding Tokens 🎯 🔓models, code, data, demo all OPEN 🧵👇 Paper: allenai.org/papers/molmopo…
English
11
34
349
48.2K
Linjie (Lindsey) Li retweetledi
Yulu Gan
Yulu Gan@yule_gan·
Simply adding Gaussian noise to LLMs (one step—no iterations, no learning rate, no gradients) and ensembling them can achieve performance comparable to or even better than standard GRPO/PPO on math reasoning, coding, writing, and chemistry tasks. We call this algorithm RandOpt. To verify that this is not limited to specific models, we tested it on Qwen, Llama, OLMo3, and VLMs. What's behind this? We find that in the Gaussian search neighborhood around pretrained LLMs, diverse task experts are densely distributed — a regime we term Neural Thickets. Paper: arxiv.org/pdf/2603.12228 Code: github.com/sunrainyg/Rand… Website: thickets.mit.edu
Yulu Gan tweet media
English
87
433
3K
687.9K
Linjie (Lindsey) Li retweetledi
Zihan "Zenus" Wang
Zihan "Zenus" Wang@wzenus·
In Agent RL, models suffer from Template Collapse. They generate vast, diverse outputs (High Entropy) that lose all meaningful connection to the input prompt (Low Mutual Information). In other words, agent learn different ways to say nothing. 🚀 Introducing RAGEN-v2 -- Here's how we define and fix such silent failure modes in Agent RL. 🧵
English
12
60
255
174.7K
Linjie (Lindsey) Li retweetledi
Jiawei Gu
Jiawei Gu@Kuvvius·
⛔️ Can MLLMs truly learn WHEN and HOW to use tools? (🛠AdaReasoner says: yes!! Like… actually decide: - “Should I call a tool right now?” - “Which one?” - “How many times?” What happened surprised us: a 7B model beats GPT-5 on visual tool-reasoning—and shows adaptive behaviors we never programmed. (1/17)🧵👇 📄 arxiv.org/abs/2601.18631 🌐 adareasoner.github.io
GIF
English
1
6
15
6.5K
Linjie (Lindsey) Li retweetledi
Zihan "Zenus" Wang
Zihan "Zenus" Wang@wzenus·
AlphaGo’s 10-year anniversary today — huge milestone for RL! Small serendipity: it’s also 1 year since we released 𝐑𝐀𝐆𝐄𝐍, our LLM Agent RL framework. Some thoughts on the past decade of RL, plus a major 𝐑𝐀𝐆𝐄𝐍 update on reasoning collapse in Agent RL coming soon. 1/ Ten years ago, on Jan 27, DeepMind brought AlphaGo to the world. Back then, RL felt mythic. For the first time, it reached top professional-level in a domain that demands long-horizon planning -- already gone 5–0 against the European champion. That moment made a lot of people truly believe this: a policy can “grow out of interaction” instead of being hand-coded or hand-taught. One year ago, on Jan 27, we released RAGEN, an RL codebase for LLM agents. We started applying RL with verifiable rewards beyond ‘winning a game’ to large reasoning models that can plan and interact with the world. RL is no longer just about winning inside a closed board. It now plays out in a more open, long-horizon training loop that can resemble parts of the real world. But in this year, we also saw a quieter kind of collapse. It does not always look like failure. Sometimes it looks stable. Sometimes it even looks safer and more consistent. Yet the policy slowly turns into a “persona”, a “template”, a “low-effort sense of security”. So I’ve increasingly felt that 𝐑𝐀𝐆𝐄𝐍 isn’t just a system. For me, it reads more like the second half of a decade-long thread I’ve been watching unfold. The first half: “RL can learn reasoning.” The second half: “RL can also quietly collapse if we don’t have the right diagnostics.” It feels like a time marker: ten years later, we’re finally forced to look beyond reward and ask what stays input-conditioned—and what drifts. 2/ If I use this coincidence as an anchor, I would split the last decade of RL into three chapters. The AlphaGo era: RL proved itself on long-horizon planning. It proved policies can emerge from interaction; The RLHF era: RL moved from winning games to alignment. It became a core mechanism that makes language models track human preferences. It became a key part behind many products today; The LLM Agent RL era: RL enters closed-loop, multi-turn self-training. The LLM agent learns more than answers. It learns plans, tools, revisions, reflection, and behavioral consistency across longer time scales. Put together, these chapters point to a missing piece for me: we still lack a clear, shared vocabulary and practical gauges for “failure modes in LLM Agent RL”. Progress has been fast on the capability side. But the language and gauges for how LLM agents degrade—especially in closed-loop training—still feel less settled. That’s the piece we’ve been trying to put words and measurements to this year. 3/ A decade after AlphaGo, a lot of the attention and resources in RL do seem to be shifting from closed worlds like board games toward systems like LLM agents. At the same time, closed-loop self-training can introduce a more systemic risk. In a loop of self-sampling and self-updating, a model can gradually settle into a “task-insensitive but cheaper” strategy. It does not look terrible. It may even look safe and consistent. But it slowly loses prompt “discriminability”. It can lose the property that makes reasoning actually change with the input. I like to define this with one sentence: “training continues, but learning is idling”. Rewards still move. Gradients still update. But the information is already dry. The policy solidifies toward templates, inertia, and risk-avoidance. One transferable takeaway from our year with 𝐑𝐀𝐆𝐄𝐍 is this: In LLM Agent RL, it’s not enough to only watch the reward or success rate. You must also watch whether “input-conditioned information” is still flowing. You must watch whether the LLM agent is still sensitive to the task. We are now preparing a new version of 𝐑𝐀𝐆𝐄𝐍. You do not need to believe any result in advance. But we will make this line much clearer: how the battlefield shifts, how the new collapses happen, and which diagnosis view is the most actionable. 4/ Here I want to write something more personal, because this part wasn’t “thought up”. It was almost collided into. Right before writing this, I was sprinting on the new 𝐑𝐀𝐆𝐄𝐍. After days of deadline pressure, I finally took a breath and noticed the date coincidence. Thinking about the past year, I started crying. When I actually began typing, the tears had just stopped. I looked at the time. It was 5pm, Jan 20, 2026, and my screen had gone dark. The contrast made the point feel sharper. This year wasn’t about “one more loss term” or “one more trick”. It was about a latent variable that kept showing up in closed-loop LLM Agent RL, but is hard to name cleanly: whether the agent’s reasoning is still tied to the input. Training can keep running while reasoning drifts into templates, inertia, and avoidance. Reward can still move while prompt discriminability quietly erodes. “More stable, more certain” can sometimes just mean “less sensitive, less distinctive”. Collapse is rarely a sudden crash. It’s usually a slow drift that looks fine from the outside. That’s what I mean by a quiet failure mode. Not bad news, just something we’d benefit from better gauges for. And on a personal note, learning to notice this earlier has changed how I work. The hits still come. I just recover faster, and keep moving. 5/ Then I looked back at the past year’s timeline and noticed another coincidence. DeepSeek-R1 landed on Jan 20, 2025 — the same date I happened to notice the AlphaGo/RAGEN alignment. I’ll treat it as coincidence, but it did make the moment feel unexpectedly vivid. Since then, I’ve been jokingly calling 01/20 my “dark mode day”.
Zihan "Zenus" Wang tweet media
English
2
12
51
52.3K