Xiaochuan Li

51 posts

Xiaochuan Li banner
Xiaochuan Li

Xiaochuan Li

@xiaochuanlee

Ph.D. student @LTIatCMU, working with @XiongChenyan. Previously @Tsinghua_Uni @Alibaba_Qwen

Katılım Haziran 2022
256 Takip Edilen135 Takipçiler
Sabitlenmiş Tweet
Xiaochuan Li
Xiaochuan Li@xiaochuanlee·
Agentic test-time scaling (TTS) is effective -- until you find its inherent limits. 💡We show that classic TTS methods offered limited practical gains due to two fundamental limitations: the context ceiling and the verification gap. 🧵 Check the website: general-agentbench.github.io
Xiaochuan Li tweet mediaXiaochuan Li tweet media
English
2
11
27
2.2K
Xiaochuan Li retweetledi
Shibo Hao
Shibo Hao@Ber18791531·
What makes us humans general agents? 🤔 It’s probably not that we’ve mastered every app or UI. New tools appear constantly, yet we become productive quickly. What really transfers are our cognitive abilities: how we perceive, reason, and use memory. 🧠 Introducing CocoaBench, an evaluation framework for general agents with compositional cognitive abilities. 👉 Features 1⃣Complex, realistic tasks🧩 • Human-crafted tasks that are long-horizon, understandable, and span diverse scenarios and domains • Assume only a small set of general tools (browser, terminal, file system; no per-task APIs) • Challenging for existing agent systems. ChatGPT Agent reaches only 44% success rate. • Check out our example tasks on the website (link in reply) and see if you can solve them. They’re fun 🙂 2⃣Covering diverse cognitive abilities 🧠 CocoaBench covers different choices in the following dimensions of cognitive architecture: • Perception: how the agent gathers and preprocesses information from websites, terminal outputs, files, and images • Reasoning : planning, deductive/inductive/abductive inference, and both symbolic & visual reasoning skills • Memory : managing working memory in the long horizon, saving and loading procedural memory (skills), etc. 3⃣CocoaAgent framework🛠️ • Seamless integration with the AIO Sandbox for isolated browser / terminal / file operations • Easy to plug in any models or your own custom agent • Flexible evaluation functions This is just our first release, and we’re actively expanding CocoaBench with more tasks, analyses, and agents. Excited to see what the community builds on top of it! Project website + more details in the thread👇
Shibo Hao tweet media
English
1
22
57
16.5K
Xuhui Zhou
Xuhui Zhou@nlpxuhui·
Thrilled to be a 2026 @MSFTResearch PhD Fellow! Huge thanks to my advisor @MaartenSap , collaborators, and friends ❤️❤️ Missions ahead 🫡
Microsoft Research@MSFTResearch

Today, we welcome the 2026 Microsoft Research Fellowship cohort, an inspiring global community of fellows and advisors helping to shape what’s next across science, technology, and society. Join us in celebrating this year’s recipients: msft.it/6013Q45bX These contributions span the following themes: • AI for global and societal impact • AI fundamentals: scalable reasoning, model adaption and evaluation • Biological and scientific modeling • Foundational systems & infrastructure for AI • Human-AI collaboration and interaction • Multimodal & embodied intelligence

English
9
3
103
8.8K
Xiaochuan Li
Xiaochuan Li@xiaochuanlee·
Thanks for sharing our work! If you are interested in the “upper bound of test time scaling”, welcome to read and give feedback!
Guilherme Favaron@guifav

Test time scaling is the current bet for making LLM agents smarter: just give them more compute at inference. But does it actually work for general purpose agents? New benchmark from @XiongChenyan and Xiaochuan Li at @CMU_LTI, with collaborators from @MetaAI, tested 10 leading agents across search, coding, reasoning, and tool use in a unified setting. Two findings that should concern anyone building agent products: 1. Sequential scaling (longer interactions) hits a 'context ceiling' around 96K to 112K tokens. Beyond that, agents destabilize. More rounds of interaction make them worse, not better. 2. Parallel scaling (sampling multiple trajectories) looks good on paper (pass@K improves), but agents cannot reliably pick their own best answer. The 'verification gap' means real world gains are minimal. Models also showed 10 to 30% performance drops just from moving to a general agent setting vs. domain specific benchmarks. Source: arXiv 2602.18998

English
0
0
1
137
Xiaochuan Li retweetledi
Jiayi Geng
Jiayi Geng@JiayiiGeng·
As long-horizon software engineering tasks grow in complexity, a single agent can no longer finish the tasks alone — effective multi-agent collaboration becomes necessary. This leads to a natural question: how can multiple agents be coordinated to asynchronously collaborate over a shared artifact in an effective way? We answer this question in our new preprint: Effective Strategies for Asynchronous Software Engineering Agents! We suggest that to coordinate multiple software engineering agents, branch-and-merge is the key coordination mechanism, and that human SWE primitives like git worktree, git commit, and git merge are all you need to support it. (1/n)
Jiayi Geng tweet media
English
14
83
381
41.6K
Xiaochuan Li retweetledi
Xuhui Zhou
Xuhui Zhou@nlpxuhui·
Creating user simulators is a key to evaluating and training models for user-facing agentic applications. But are stronger LLMs better user simulators? TL;DR: not really. We ran the largest sim2real study for AI agents to date: 31 LLM simulators vs. 451 real humans across 165 tasks. Here's what we found (co-lead with @sunweiwei12).
Xuhui Zhou tweet media
English
8
68
285
31.6K
Xiaochuan Li retweetledi
Jixuan Chen
Jixuan Chen@chenjx210734·
🚀Excited to share that we bridge the connection of Clawbot & Simworld! 🧩We are motivated to move beyond isolated toy tasks and into a shared physical world with routines, interactions, and coordination. 🚧Lightweight setup: plug in your own agent easily!
SimWorld@simworld_ai

🤖Clawbots just moved into Embodied City inside SimWorld. They wake up. Go to work. Run errands. Talk to each other. All inside a shared physical world. This isn’t scripted — it’s autonomous agents living a daily routine. And you can spin up your own agent in minutes.

English
4
29
85
55.5K
Xiaochuan Li retweetledi
Emmy Liu
Emmy Liu@_emliu·
Midtraining is a new part of many training pipelines, but when does it help and can it backfire? 🤔 In our new preprint, we use controlled experiments to pin this down. TL;DR; midtraining helps the most when it “bridges” pretraining and posttraining, and mitigates forgetting after posttraining. Timing is also very important. 🧵
Emmy Liu tweet media
English
5
88
619
91.8K
Xiaochuan Li
Xiaochuan Li@xiaochuanlee·
[7/8] Takeaway: “More compute at test time” is insufficient. The context ceiling and verification gap serve as bottlenecks for effective improvement. We hope the community can develop better context management methods and improve model self-cognition to enable truly TTS.
English
1
0
1
136
Xiaochuan Li
Xiaochuan Li@xiaochuanlee·
Agentic test-time scaling (TTS) is effective -- until you find its inherent limits. 💡We show that classic TTS methods offered limited practical gains due to two fundamental limitations: the context ceiling and the verification gap. 🧵 Check the website: general-agentbench.github.io
Xiaochuan Li tweet mediaXiaochuan Li tweet media
English
2
11
27
2.2K
Xiaochuan Li retweetledi
Chen Wu
Chen Wu@ChenHenryWu·
1/⚠️ Parallel test-time scaling (e.g., pass@k) usually wastes compute - models often repeat the same dominant failure❌ How should we effectively generate creative solutions? While typical methods such as increasing temperature 🌡️ usually fail, we put forward Mode‑Conditioning (ModC) - a simple yet powerful training and test-time framework that allocates compute across diverse reasoning modes🎨We show that ModC largely improves pass@k across SFT, distillation, and RL settings. With ModC, we get 4-8x efficiency gains in math reasoning using the same training data!
Chen Wu tweet media
English
13
30
133
24.5K