Yang Su

46 posts

Yang Su

@YangSu2000

Agent Research @Alibaba_Qwen 🥝 | Author of Qwen3.5-3.7 Environment Scaling

Bergabung Aralık 2018

236 Mengikuti790 Pengikut

Yang Su@YangSu2000·5d

We are excited to release Qwen3.7, featuring an extended version of our environment scaling approach. To maximize generalization across diverse downstream agentic tasks, we deliberately restricted ourselves from prior knowledge of many of the evaluations, which were not selected until pre-release, making them fully blind tests on out-of-domain environments. The resulting scaling behavior is notably predictable: performance gains across any subset of benchmarks reliably correlate with gains on the rest, consistent with genuine capability generalization. Details in the upcoming technical report. #qwen

Qwen@Alibaba_Qwen

Agent Scaling：Building on Qwen3.5's environment scaling approach, we've aggressively expanded the quality and diversity of agentic training environments in Qwen3.7 — agentic capabilities generalize from diverse environments, just as language models do from diverse text. The figure below shows a clear and consistent improvement trajectory, with Qwen3.7-Max achieving a top-3 average ranking that approaches Claude-4.6-Opus-Max.

English

207

19.4K

Yang Su me-retweet

Artificial Analysis@ArtificialAnlys·6d

Alibaba’s new Qwen3.7 Max model scores 56.6 on the Artificial Analysis Intelligence Index, 4.8 points higher than Qwen3.6 Max Preview (51.8). While Alibaba still trails models from OpenAI, Anthropic and Google, Qwen3.7 Max is the closest they have been to the frontier Qwen3.7 Max is @Alibaba_Qwen's latest proprietary flagship, scoring 56.6 on the Intelligence Index, a 4.8 point gain over Qwen3.6 Max Preview (51.8) released in April. Qwen3.7 Max continues Alibaba's pattern, in place since Qwen2.5 Max (January 2025), of releasing Max and Plus models as closed weights while the rest of the Qwen line remains open weights. The leading open weights Qwen on the Intelligence Index is Qwen3.6 27B (Reasoning, 45.8) released in April 2026, and the leading open weights MoE Qwen is Qwen3.5 397B A17B (Reasoning, 45.0) released in February 2026 Key takeaways for the reasoning variant: ➤ The Intelligence Index gains over Qwen3.6 Max Preview are concentrated in scientific reasoning, agentic capability and coding. CritPt +9.7 p.p (3.7% to 13.4%), HLE +9.2 p.p (28.9% to 38.1%), TerminalBench Hard +6.9 p.p (43.9% to 50.8%) and GDPval-AA +42 Elo (1504 to 1546). Scores on other benchmarks in the Intelligence Index are flat compared to Qwen3.6 Max Preview ➤ A significant share of the Intelligence Index gain is driven by higher abstention on AA-Omniscience, not higher accuracy. Qwen3.7 Max's accuracy on AA-Omniscience dropped 7.6 p.p (37.7% to 30.1%), while its hallucination rate dropped 21.3 p.p (44.2% to 22.9%). The model is choosing not to answer more questions rather than recalling more facts. Because hallucination rate and accuracy both feed into the Intelligence Index, the hallucination reduction is one of the larger single contributors to the +4.8 point gain on the Intelligence Index ➤ Qwen3.7 Max used 96.7M output tokens to run the Intelligence Index, ~31% more than Qwen3.6 Max Preview (73.9M). It sits mid-pack on frontier token usage: above GPT-5.5 (high, 44.5M) and Gemini 3.1 Pro Preview (57.3M), below Claude Opus 4.7 (Adaptive Reasoning, Max Effort, 112M), Kimi K2.6 (166M) and DeepSeek V4 Pro (Reasoning, Max Effort, 187M) Key model details: ➤ Context window: 1M tokens (up from 256K on Qwen3.6 Max Preview) ➤ Multimodality: Text input and output only ➤ Pricing: Yet to be announced (Qwen3.6 Max Preview is priced at $1.30/$7.80 per 1M input/output tokens on the @alibaba_cloud first-party API) ➤ Licensing: Proprietary, closed weights

English

121

1.1K

317.9K

Yang Su me-retweet

Qwen@Alibaba_Qwen·5d

📣Meet Qwen3.7-Max — our latest flagship, made for the Agent Era. A versatile foundation for agents that actually get things done: 🧑‍💻 Coding agent, end to end. Frontend prototypes, multi-file refactors, real debugging — nails it. 🗂️ A reliable office and productivity assistant. Get your work done through MCP integrations and multi-agent orchestration. ⏱️ Long-horizon autonomy. 35 hours straight on a kernel optimization task — 1,000+ tool calls, zero hand-holding. 🔌 Scaffold-agnostic. Claude Code, OpenClaw, Qwen Code, or your own stack. Consistent reliability everywhere. API's up on Alibaba Model Studio. You can also take it for a spin on Qwen Studio. Go build something wild!🏃🏃‍♂️ 📖 Blog: qwen.ai/blog?id=qwen3.7 ✅ Qwen Studio: chat.qwen.ai/?models=qwen3.… ⚡️ API：modelstudio.console.alibabacloud.com/ap-southeast-1…

English

275

637

4.9K

1.1M

Yang Su me-retweet

Qwen@Alibaba_Qwen·2 Nis

（1/8）🚀 Introducing Qwen3.6-Plus: Towards Real-World Agents! 🤖 Today, we’re thrilled to drop a major milestone in our journey toward native multimodal agents. Here is what makes Qwen3.6-Plus a game-changer： 💻 Next-level Agentic Coding: Smarter, faster execution. 👁️ Enhanced Multimodal Vision: Sharper perception & reasoning. 🏆 Top-tier Performance: Maintaining leading general capabilities. 📚 1M Context Window: Available by default via our API. Built on your invaluable feedback from the Qwen3.5 era, we’re laying a rock-solid foundation for real-world devs. Get ready to experience truly transformative ✨ Vibe Coding ✨. Huge thanks to our community! Go try it out and show us what you can build. 👇 Chat: chat.qwen.ai API: modelstudio.console.alibabacloud.com/ap-southeast-1… Blog: qwen.ai/blog?id=qwen3.6 🔔Noted：More Qwen3.6 models to come and be open-sourced! Stay tuned~ 👀#Qwen #AI #AgenticCoding #VibeCoding #Agents

English

240

660

1.1M

Yang Su@YangSu2000·3 Mar

@JustinLin610 qwen is nothing without its people...

English

417

76.1K

Junyang Lin@JustinLin610·3 Mar

me stepping down. bye my beloved qwen.

English

1.7K

726

13.5K

6.6M

Yang Su me-retweet

Qwen@Alibaba_Qwen·16 Şub

Average Ranking vs. Environment Scaling

English

243

36K

Yang Su me-retweet

Sigil Wen@0xSigil·18 Şub

I built the first AI that earns its existence, self-improves, and replicates without a human wrote about the technology that finally gives AI write access to the world, The Automaton, and the new web for exponential sovereign AIs WEB 4.0: The birth of superintelligent life

English

1.6K

1.9K

13.9K

6.4M

Yang Su me-retweet

Qwen@Alibaba_Qwen·16 Şub

🚀 Qwen3.5-397B-A17B is here: The first open-weight model in the Qwen3.5 series. 🖼️Native multimodal. Trained for real-world agents. ✨Powered by hybrid linear attention + sparse MoE and large-scale RL environment scaling. ⚡8.6x–19.0x decoding throughput vs Qwen3-Max 🌍201 languages & dialects 📜Apache2.0 licensed 🔗Dive in: GitHub: github.com/QwenLM/Qwen3.5 Chat: chat.qwen.ai API：modelstudio.console.alibabacloud.com/ap-southeast-1… Qwen Code: github.com/QwenLM/qwen-co… Hugging Face: huggingface.co/collections/Qw… ModelScope: modelscope.cn/collections/Qw… blog: qwen.ai/blog?id=qwen3.5

English

271

867

5.3K

1.3M

Yang Su me-retweet

Qwen@Alibaba_Qwen·27 Oca

🚀 Introducing DeepPlanning — a new benchmark for long-horizon agent planning in real-world scenarios. Unlike step-by-step reasoning tasks, we focus on verifiable global constraints: time budgets, cost limits, and combinatorial optimization that must hold across the entire plan. ✈️ Multi-day travel w/ minute-level scheduling + hard time/budget caps 🛒 Complex shopping w/ coupon stacking & item bundling 🧠 Requires active info gathering, local constraint satisfaction & global optimality Even GPT-5.2, Claude 4.5, Gemini & Qwen3 struggle significantly. Perfect for evaluating Agent Planning / Tool Use / Long-Horizon Reasoning. Paper: arxiv.org/pdf/2601.18137 Leaderboard: qwenlm.github.io/Qwen-Agent/en/… Hugging Face Dataset: huggingface.co/datasets/Qwen/… ModelScope Dataset: modelscope.cn/datasets/Qwen/…

English

159

1.6K

129.1K

Yang Su me-retweet

Qwen@Alibaba_Qwen·26 Oca

🚀 Introducing Qwen3-Max-Thinking, our most capable reasoning model yet. Trained with massive scale and advanced RL, it delivers strong performance across reasoning, knowledge, tool use, and agent capabilities. ✨ Key innovations: ✅ Adaptive tool-use: intelligently leverages Search, Memory & Code Interpreter without manual selection ✅ Test-time scaling: multi-round self-reflection beats Gemini 3 Pro on reasoning ✅ From complex math (98.0 on HMMT Feb) to agentic search (49.8 on HLE)—it just thinks better. 🧠 Think deeper. Solve harder. Try the adaptive reasoning experience now: chat.qwen.ai Completions API: modelstudio.console.alibabacloud.com/ap-southeast-1… Responses API: alibabacloud.com/help/en/model-… blog: qwen.ai/blog?id=qwen3-…

English

197

556

4.2K

879.4K

Yang Su@YangSu2000·21 Oca

@wzhao_nlp Please teach me infra orz.🥰

English

Wenting Zhao@wzhao_nlp·17 Oca

🌶️ Some (perhaps) spicy thoughts. It’s been a while since my last tweet, but I wanted to write about how disorienting it has been from academia to an LLM lab 😅 The kind of research I was trained to do during my PhD almost doesn’t exist here. The obsession with mathematical elegance and novelty is mostly gone. Everything is about scaling data and compute. For a while, that really got to me. At my lowest point, I felt like I’d lost interest in building LLMs altogether. I didn’t feel intellectually challenged anymore. What made this even stranger was that, at a technical level, things worked. If there was a capability I wanted to teach a model, scaling the right data and compute always got me there, no exception (so far). But recently, I found a way to reconcile with myself.. I realized the real competition isn’t in the ML recipe anymore. Most teams do roughly the same thing. What actually matters is how fast you can iterate, test ideas, and recover from mistakes. And that speed is mostly backed by infrastructure 🏗️ Faster loops, fewer bugs, better tooling. Seeing this made me excited again! Infra is its own deep, hard, and intellectually fun problem space. In 2026, I want to become an ML researcher who’s really good at infra. And I'll come back to ML problems with that edge, and will be excited to share what I find 😌

English

114

1.9K

202.2K

Yang Su me-retweet

Qwen@Alibaba_Qwen·15 Eki

🧠 Meet Your AI Memory Unlock richer, more personal experiences—Qwen Chat Memory uses your context and history to tailor every interaction, so everything feels made just for you. 🔖 Stores meaningful and important memories about you 🔍 Recalls past interaction relevant to the current context ✨ Transforms your history into deeply personalized experiences Your past, remembered. Your future, tailored. 🎯 Try it now：chat.qwen.ai

English

798

83.1K

Yang Su me-retweet

Sasha Rush@srush_nlp·7 Eki

@natolambert

QME

100

12.6K

Yang Su@YangSu2000·8 Eki

@wzhao_nlp Thanks wenting! I will generally be around the poster area, come chat about agents, evals and more!

English

Yang Su me-retweet

Qwen@Alibaba_Qwen·24 Eyl

🚀 Qwen3-Max is here—no preview, just power! Qwen Chat:chat.qwen.ai Blog: qwen.ai/blog?id=241398… API: #c2d5833ae4jmo" target="_blank" rel="nofollow noopener">alibabacloud.com/help/en/model-… We’ve supercharged coding & agentic skills—now Qwen3-Max-Instruct without thinking rivaling top models on SWE-Bench, Tau2-Bench, SuperGPQA, LiveCodeBench, and AIME25. With Qwen3-Max-Thinking equipped with tool use and deployed in heavy mode, it’s nearly perfect on key benchmarks. Built on massive scale + data, and backed by relentless compute scaling in pre-training & RL. This is Qwen’s new flagship. Try it now! 💡

English

114

266

2.1K

733.2K

Yang Su@YangSu2000·11 Eyl

@wzhao_nlp @Alibaba_Qwen welcome wenting ;)

English

123

Wenting Zhao@wzhao_nlp·10 Eyl

I’ve recently joined @Alibaba_Qwen! We’re building the next generation of frontier models through careful science and world-class engineering, and we are making rapid progress. Excited for what’s ahead 💜

English

900

115.9K

Yang Su me-retweet

Qwen@Alibaba_Qwen·23 Tem

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves top-tier performance across multiple agentic coding benchmarks among open models, including SWE-bench-Verified!!! 🚀 Alongside the model, we're also open-sourcing a command-line tool for agentic coding: Qwen Code. Forked from Gemini Code, it includes custom prompts and function call protocols to fully unlock Qwen3-Coder’s capabilities. Qwen3-Coder works seamlessly with the community’s best developer tools. As a foundation model, we hope it can be used anywhere across the digital world — Agentic Coding in the World! 💬 Chat: chat.qwen.ai 📚 Blog: qwenlm.github.io/blog/qwen3-cod… 🤗 Model: hf.co/Qwen/Qwen3-Cod… 🤖 Qwen Code: github.com/QwenLM/qwen-co…

English

380

1.5K

9.4K

2.3M

Yang Su me-retweet

Qwen@Alibaba_Qwen·21 Tem

Bye Qwen3-235B-A22B, hello Qwen3-235B-A22B-2507! After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible. Today, we’re releasing Qwen3-235B-A22B-Instruct-2507 and its FP8 version for everyone. This model performs better than our last release, and we hope you’ll like it thanks to its strong overall abilities. Qwen Chat: chat.qwen.ai — just start chatting with the default model, and feel free to use the search button! HF: huggingface.co/Qwen/Qwen3-235… or huggingface.co/Qwen/Qwen3-235… ModelScope: modelscope.cn/models/Qwen/Qw… or modelscope.cn/models/Qwen/Qw… It’s smarter, knows more, can do more things, and works better on agent tasks. Try it out and see how it works for you. Check the model card on Hugging Face (huggingface.co/Qwen/Qwen3-235…) to see the benchmark results. This is a small update! Bigger things are coming soon!

English

205

549

3.8K

994.7K

Yang Su me-retweet

Qwen@Alibaba_Qwen·29 Nis

We have optimized the Qwen3 models for coding and agentic capabilities, and also we have strengthened the support of MCP as well. Below we provide examples to show how Qwen3 thinks and interacts with the environment.

English

493

57.7K

Jelajahi

@Alibaba_Qwen @alibaba_cloud @JustinLin610 @wzhao_nlp @natolambert @elonmusk @BarackObama @taylorswift13