Xudong Guo
15 posts

Xudong Guo
@_traceur__
Researcher @Alibaba_Qwen | Ph.D. @Tsinghua_Uni | Prev. @Microsoft @UN @Princeton

🚀 Introducing DeepPlanning — a new benchmark for long-horizon agent planning in real-world scenarios. Unlike step-by-step reasoning tasks, we focus on verifiable global constraints: time budgets, cost limits, and combinatorial optimization that must hold across the entire plan. ✈️ Multi-day travel w/ minute-level scheduling + hard time/budget caps 🛒 Complex shopping w/ coupon stacking & item bundling 🧠 Requires active info gathering, local constraint satisfaction & global optimality Even GPT-5.2, Claude 4.5, Gemini & Qwen3 struggle significantly. Perfect for evaluating Agent Planning / Tool Use / Long-Horizon Reasoning. Paper: arxiv.org/pdf/2601.18137 Leaderboard: qwenlm.github.io/Qwen-Agent/en/… Hugging Face Dataset: huggingface.co/datasets/Qwen/… ModelScope Dataset: modelscope.cn/datasets/Qwen/…


🚀 Introducing Qwen3-Max-Thinking, our most capable reasoning model yet. Trained with massive scale and advanced RL, it delivers strong performance across reasoning, knowledge, tool use, and agent capabilities. ✨ Key innovations: ✅ Adaptive tool-use: intelligently leverages Search, Memory & Code Interpreter without manual selection ✅ Test-time scaling: multi-round self-reflection beats Gemini 3 Pro on reasoning ✅ From complex math (98.0 on HMMT Feb) to agentic search (49.8 on HLE)—it just thinks better. 🧠 Think deeper. Solve harder. Try the adaptive reasoning experience now: chat.qwen.ai Completions API: modelstudio.console.alibabacloud.com/ap-southeast-1… Responses API: alibabacloud.com/help/en/model-… blog: qwen.ai/blog?id=qwen3-…

RoboFinals is Lightwheel's industrial grade evaluation platform for measuring Embodied AI model capabilities beyond academic benchmarks. It enables faster iteration, bottleneck diagnosis, and reliable measurement of real capability gains as models move toward real-world deployment. We’re excited to have @Alibaba_Qwen using RoboFinals for high-throughput, industry-aligned evaluation of its frontier embodied AI models. RoboFinals enables Qwen to rapidly iterate, diagnose bottlenecks, and measure real capability gains beyond academic benchmarks. Besides, Qwen plays a partner in stress-testing RoboFinals and shaping its evolution into an industry standard benchmark for evaluating robotics foundation models. #Lightwheel #Qwen #RoboFinals #EmbodiedAI #Robotics #Simulation #Evaluation



10,000,000 users creating with Qwen Chat — and we’re just getting started. From here, let’s begin — chat.qwen.ai 🚀














