Xiaochuan Li

51 posts

Xiaochuan Li

@xiaochuanlee

Ph.D. student @LTIatCMU, working with @XiongChenyan. Previously @Tsinghua_Uni @Alibaba_Qwen

Katılım Haziran 2022

256 Takip Edilen135 Takipçiler

Sabitlenmiş Tweet

Xiaochuan Li@xiaochuanlee·24 Şub

Agentic test-time scaling (TTS) is effective -- until you find its inherent limits. 💡We show that classic TTS methods offered limited practical gains due to two fundamental limitations: the context ceiling and the verification gap. 🧵 Check the website: general-agentbench.github.io

English

2.2K

Xiaochuan Li retweetledi

Tianbao Xie@TianbaoX·1d

Routines in Claude Code? Route (of models) in Claude Code!

Claude@claudeai

Now in research preview: routines in Claude Code. Configure a routine once (a prompt, a repo, and your connectors), and it can run on a schedule, from an API call, or in response to an event. Routines run on our web infrastructure, so you don't have to keep your laptop open.

English

Xiaochuan Li retweetledi

Shibo Hao@Ber18791531·2 Ara

What makes us humans general agents? 🤔 It’s probably not that we’ve mastered every app or UI. New tools appear constantly, yet we become productive quickly. What really transfers are our cognitive abilities: how we perceive, reason, and use memory. 🧠 Introducing CocoaBench, an evaluation framework for general agents with compositional cognitive abilities. 👉 Features 1⃣Complex, realistic tasks🧩 • Human-crafted tasks that are long-horizon, understandable, and span diverse scenarios and domains • Assume only a small set of general tools (browser, terminal, file system; no per-task APIs) • Challenging for existing agent systems. ChatGPT Agent reaches only 44% success rate. • Check out our example tasks on the website (link in reply) and see if you can solve them. They’re fun 🙂 2⃣Covering diverse cognitive abilities 🧠 CocoaBench covers different choices in the following dimensions of cognitive architecture: • Perception: how the agent gathers and preprocesses information from websites, terminal outputs, files, and images • Reasoning : planning, deductive/inductive/abductive inference, and both symbolic & visual reasoning skills • Memory : managing working memory in the long horizon, saving and loading procedural memory (skills), etc. 3⃣CocoaAgent framework🛠️ • Seamless integration with the AIO Sandbox for isolated browser / terminal / file operations • Easy to plug in any models or your own custom agent • Flexible evaluation functions This is just our first release, and we’re actively expanding CocoaBench with more tasks, analyses, and agents. Excited to see what the community builds on top of it! Project website + more details in the thread👇

English

16.5K

Xiaochuan Li@xiaochuanlee·6d

@nlpxuhui @MSFTResearch @MaartenSap 🥳

QME

Xuhui Zhou@nlpxuhui·6d

Thrilled to be a 2026 @MSFTResearch PhD Fellow! Huge thanks to my advisor @MaartenSap , collaborators, and friends ❤️❤️ Missions ahead 🫡

Microsoft Research@MSFTResearch

Today, we welcome the 2026 Microsoft Research Fellowship cohort, an inspiring global community of fellows and advisors helping to shape what’s next across science, technology, and society. Join us in celebrating this year’s recipients: msft.it/6013Q45bX These contributions span the following themes: • AI for global and societal impact • AI fundamentals: scalable reasoning, model adaption and evaluation • Biological and scientific modeling • Foundational systems & infrastructure for AI • Human-AI collaboration and interaction • Multimodal & embodied intelligence

English

103

8.8K

Xiaochuan Li@xiaochuanlee·29 Mar

Thanks for sharing our work! If you are interested in the “upper bound of test time scaling”, welcome to read and give feedback!

Guilherme Favaron@guifav

Test time scaling is the current bet for making LLM agents smarter: just give them more compute at inference. But does it actually work for general purpose agents? New benchmark from @XiongChenyan and Xiaochuan Li at @CMU_LTI, with collaborators from @MetaAI, tested 10 leading agents across search, coding, reasoning, and tool use in a unified setting. Two findings that should concern anyone building agent products: 1. Sequential scaling (longer interactions) hits a 'context ceiling' around 96K to 112K tokens. Beyond that, agents destabilize. More rounds of interaction make them worse, not better. 2. Parallel scaling (sampling multiple trajectories) looks good on paper (pass@K improves), but agents cannot reliably pick their own best answer. The 'verification gap' means real world gains are minimal. Models also showed 10 to 30% performance drops just from moving to a general agent setting vs. domain specific benchmarks. Source: arXiv 2602.18998

English

137

Xiaochuan Li retweetledi

Jiayi Geng@JiayiiGeng·24 Mar

As long-horizon software engineering tasks grow in complexity, a single agent can no longer finish the tasks alone — effective multi-agent collaboration becomes necessary. This leads to a natural question: how can multiple agents be coordinated to asynchronously collaborate over a shared artifact in an effective way? We answer this question in our new preprint: Effective Strategies for Asynchronous Software Engineering Agents! We suggest that to coordinate multiple software engineering agents, branch-and-merge is the key coordination mechanism, and that human SWE primitives like git worktree, git commit, and git merge are all you need to support it. (1/n)

English

381

41.6K

Xiaochuan Li retweetledi

Xuhui Zhou@nlpxuhui·19 Mar

Creating user simulators is a key to evaluating and training models for user-facing agentic applications. But are stronger LLMs better user simulators? TL;DR: not really. We ran the largest sim2real study for AI agents to date: 31 LLM simulators vs. 451 real humans across 165 tasks. Here's what we found (co-lead with @sunweiwei12).

English

285

31.6K

Xiaochuan Li@xiaochuanlee·16 Mar

🌹This resource is invaluable to anyone (including me) who is beginning to explore the model's architecture.

Sebastian Raschka@rasbt

I (finally) put together a new LLM Architecture Gallery that collects the architecture figures all in one place! sebastianraschka.com/llm-architectu…

English

185

Xiaochuan Li@xiaochuanlee·4 Mar

@huybery Best wishes Hui Bro!

Indonesia

Xiaochuan Li retweetledi

Binyuan Hui@huybery·4 Mar

bye qwen, me too.

Junyang Lin@JustinLin610

me stepping down. bye my beloved qwen.

Filipino

299

217

4.7K

2.6M

Xiaochuan Li retweetledi

Jixuan Chen@chenjx210734·27 Şub

🚀Excited to share that we bridge the connection of Clawbot & Simworld! 🧩We are motivated to move beyond isolated toy tasks and into a shared physical world with routines, interactions, and coordination. 🚧Lightweight setup: plug in your own agent easily!

SimWorld@simworld_ai

🤖Clawbots just moved into Embodied City inside SimWorld. They wake up. Go to work. Run errands. Talk to each other. All inside a shared physical world. This isn’t scripted — it’s autonomous agents living a daily routine. And you can spin up your own agent in minutes.

English

55.5K

Xiaochuan Li retweetledi

Emmy Liu@_emliu·24 Şub

Midtraining is a new part of many training pipelines, but when does it help and can it backfire? 🤔 In our new preprint, we use controlled experiments to pin this down. TL;DR; midtraining helps the most when it “bridges” pretraining and posttraining, and mitigates forgetting after posttraining. Timing is also very important. 🧵

English

619

91.8K

Xiaochuan Li@xiaochuanlee·24 Şub

[8/8] Paper: arxiv.org/pdf/2602.18998 Code: github.com/cxcscmu/Genera… Website: general-agentbench.github.io Thanks to all who contributed to this work! @TimMing10252742 @AbhijayPaladugu PranavSetlur AndyTang @haok1402 ShuaiShao RongJin @XiongChenyan

English

151

Xiaochuan Li@xiaochuanlee·24 Şub

[7/8] Takeaway: “More compute at test time” is insufficient. The context ceiling and verification gap serve as bottlenecks for effective improvement. We hope the community can develop better context management methods and improve model self-cognition to enable truly TTS.

English

136

Xiaochuan Li@xiaochuanlee·24 Şub

English

2.2K

Xiaochuan Li retweetledi

Chen Wu@ChenHenryWu·18 Ara

1/⚠️ Parallel test-time scaling (e.g., pass@k) usually wastes compute - models often repeat the same dominant failure❌ How should we effectively generate creative solutions? While typical methods such as increasing temperature 🌡️ usually fail, we put forward Mode‑Conditioning (ModC) - a simple yet powerful training and test-time framework that allocates compute across diverse reasoning modes🎨We show that ModC largely improves pass@k across SFT, distillation, and RL settings. With ModC, we get 4-8x efficiency gains in math reasoning using the same training data!

English

133

24.5K

Keşfet

@nlpxuhui @MSFTResearch @MaartenSap @sunweiwei12 @huybery @TimMing10252742 @AbhijayPaladugu @haok1402