Junlin Yang

228 posts

Junlin Yang

@junlin45300

Incoming PhD @Tsinghua_Uni advised by Bowen Zhou and @stingning Research intern @haopeng_uiuc @taoyds

Katılım Nisan 2024

1.1K Takip Edilen143 Takipçiler

Junlin Yang@junlin45300·13h

Towards unified digital agents! 👊 Once again, this shows that code has already become a powerful and natural interface for digital agents.

Shibo Hao@Ber18791531

🍫 CocoaBench v1.0 is out! CocoaBench is a benchmark for unified digital agents, built around open-world tasks that require composing 💻 coding, 👀 vision, 🌐 search. Since our first research preview last December, we have expanded the benchmark substantially with community contributed tasks, and spent months testing and refining the tasks, evaluations, and agent runs. Some takeaways: • Even the best agent system reaches only 45.1% on CocoaBench v1.0. • Coding agents like Codex are already surprisingly strong on general tasks beyond software engineering. • Stronger agents tend to push more of the work into code. • Open source models still lag behind leading frontier models on these general tasks. 👇More on the website and in the paper #AI #Agents #LLM #Benchmark #CocoaBench

English

138

Junlin Yang retweetledi

Anthropic@AnthropicAI·7 Nis

Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans. anthropic.com/glasswing

English

6.6K

43.7K

30.5M

Junlin Yang retweetledi

Ning Ding@stingning·6d

Surviving scale is becoming a real problem. Will AI labs eat everything? Does each domain still need its own answer or just dissolve in the scaling pot?

Anthropic@AnthropicAI

English

673

Junlin Yang@junlin45300·28 Mar

In 2026, coders stopped being synchronous executors and became **async schedulers**: less time doing every step themselves, more time organizing context, allocating attention, and validating outputs.

English

Junlin Yang@junlin45300·26 Mar

@iScienceLuvr 👀

QME

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·26 Mar

if you can reply to this, you're awesome and cool :)

English

170

359

42.7K

Junlin Yang retweetledi

Ning Ding@stingning·22 Mar

The weight of a yes is measured by the no’s before it.

WeChat@Weixin_WeChat

Today, we are officially opening the capability to integrate #OpenClaw into #Weixin. With the launch of the #WeixinClawBot, users can use Weixin as a dedicated messaging channel for OpenClaw. Now, you can send and receive messages with OpenClaw just like texting a friend. #AIAutomation #AI

English

2.1K

Junlin Yang retweetledi

Sebastian Raschka@rasbt·15 Mar

I (finally) put together a new LLM Architecture Gallery that collects the architecture figures all in one place! sebastianraschka.com/llm-architectu…

English

202

1.4K

8.2K

720.4K

Junlin Yang retweetledi

Bingxiang He@HBX_hbx·10 Mar

✨ [ICLR 2026] How Far Can Unsupervised RLVR Scale LLM Training? The dream: models can improve themselves without human supervision. The reality: sometimes they can only sharpen what they already believe. Intrinsic rewards struggle to scale LLM training because they follow a rise-then-fall pattern that makes collapse mathematically inevitable. But that's not the end of the story. We find unsupervised RLVR (URLVR) is particularly well-suited for test-time training and quantifying model priors. The full picture 👇 📄 Paper: arxiv.org/abs/2603.08660 🧪 GitHub: github.com/PRIME-RL/TTRL (1/n)

English

6.8K

Junlin Yang@junlin45300·25 Şub

@xiaochuanlee 🔥

QME

Xiaochuan Li@xiaochuanlee·24 Şub

Agentic test-time scaling (TTS) is effective -- until you find its inherent limits. 💡We show that classic TTS methods offered limited practical gains due to two fundamental limitations: the context ceiling and the verification gap. 🧵 Check the website: general-agentbench.github.io

English

2.2K

Junlin Yang@junlin45300·19 Şub

@stingning @OpenAI @openclaw @steipete Totally agree! It’s getting clear that systems/evals are backward-shaping how models evolve. As we enter AI's 'second half,' the power to define what a 'good problem' or 'good application' actually looks like is really important.

English

154

Junlin Yang retweetledi

Amy Tam@amytam01·17 Şub

x.com/i/article/2023…

ZXX

186

819

7.5K

2.7M

Junlin Yang@junlin45300·18 Şub

@JunliWang2021 You’re advised by GOATS

English

Junli Wang@JunliWang2021·18 Şub

Goat

Hao Zhang@haozhangml

Can’t believe I get to say this -- deeply honored to be named a 2026 Sloan Research Fellow: today.ucsd.edu/story/2026-slo… Early faculty life is… "hyper-intense": teaching, advising, hiring, papers, grants; and trying to build a lab culture you’ll still be proud of years later. There were many weeks where it felt like we were building the plane mid-flight, burning plenty of midnight oil along the way. Over the past few years, I’ve been incredibly lucky to work with amazing students and collaborators on a chain of OSS project: Vicuna → Chatbot Arena → vLLM → DistServe → LMGame → FastVideo; each one then pushed forward way further by people far beyond our lab. This award feels less like a finish line and more like fuel for the lab, for our students, and for the next set of systems we haven’t built yet. A core principle of us is building "open-source research that ships." At the same time, it’s hard not to feel a mix of excitement + uncertainty + anxiety about where CS is heading. Coding agents are improving so fast that I am feeling the AGI first-handedly. I have gone back to builder mode -- only more productive than ever -- outside of my faculty admin work. I’ve watched friends and colleagues hit numbers that would’ve sounded like science fiction a year ago (e.g., 100+ commits/day). So what does it mean to “do great computer science” when baseline productivity keeps jumping? For me, it makes “research that ships” more important, and even raises the bar. The leverage shifts toward taste and problem selection, principled system design, and translating ideas into reliable artifacts. We're excited to keep proving that through real systems people can use! Deeply grateful to: - My students and collaborators — for the ideas, execution, and drive. - @HDSIUCSD , Dean @GuptaUcsd, and my @UCSanDiego colleagues — for building an environment where ambitious work can happen. - @nvidia and @mbzuai (and other compute sponsors) — for support that helped us move faster and turn ideas into real artifacts. Even as the interface changes, the need for efficient compute and solid infrastructure only grows. Most of all: credit to the students at @haoailab. You’re the reason any of this is worth doing. Keep building and shipping!

English

277

Junlin Yang@junlin45300·16 Şub

Exactly

Ning Ding@stingning

When your project gets too good, your next PR is either a legal letter or a payroll.

English

171

Junlin Yang retweetledi

Ning Ding@stingning·15 Şub

Bytedance is being restrained in marketing Seed 2.0 with almost zero hype. But reality is clear: it is a globally top-tier model. No qualifiers needed.

English

191

16.5K

Junlin Yang retweetledi

Yuxuan Li@YuxuanL_·13 Şub

🚨New paper: "What Makes LLM Agent Simulations Useful for Policy Practice?" @simile_ai is making a great launch, but can these LLM agent simulations actually help real institutions make better decisions? We spent a year working with policymakers to answer this simple question. The answer is yes—but perhaps not how you'd expect. 👇 THREAD 👇 [Link to paper: arxiv.org/abs/2509.21868] [1/n]

English

191

14.8K

Junlin Yang@junlin45300·12 Şub

"Collective reasoning under distributed information" is a fascinating and under-explored topic. It represents a core capability for **real-world multi-agent collaboration** in open environments, and this work provides an excellent evaluation for it!

Yuxuan Li@YuxuanL_

We’ve expanded our benchmark into HiddenBench: a 65-task, theory-grounded, extensible benchmark for evaluating collective reasoning under distributed information. We tested 15 frontier models (🚨spoiler🚨: Gemini is the clear winner) and uncovered key bottlenecks in multi-agent LLMs coordination. Check out the full update on arXiv! 📄 Paper: arxiv.org/abs/2505.11556

English

108

Junlin Yang retweetledi

Weize Chen@JeffreyChen_THU·9 Şub

Everyone is talking about Self-Evolving 🤖. But here’s the hard question: How do we actually evaluate it? Are models truly learning new skills, or just recalling? 🚀We propose a new benchmark SE-Bench, targeting a core primitive of evolution: Knowledge Internalization. 🧵

English

145

11.6K

Junlin Yang retweetledi

Junli Wang@JunliWang2021·3 Şub

Finally it's here!

Qwen@Alibaba_Qwen

🚀 Introducing Qwen3-Coder-Next, an open-weight LM built for coding agents & local development. What’s new: 🤖 Scaling agentic training: 800K verifiable tasks + executable envs 📈 Efficiency–Performance Tradeoff: achieves strong results on SWE-Bench Pro with 80B total params and 3B active ✨ Supports OpenClaw, Qwen Code, Claude Code, web dev, browser use, Cline, etc 🤗 Hugging Face: huggingface.co/collections/Qw… 🤖 ModelScope: modelscope.cn/collections/Qw… 📝 Blog: qwen.ai/blog?id=qwen3-… 📄 Tech report: github.com/QwenLM/Qwen3-C…

English

772

Junlin Yang@junlin45300·27 Oca

The era of AI is an unprecedentedly contradictory era.

Ning Ding@stingning

𝗧𝗵𝗲 𝗙𝗹𝗮𝗶𝗹𝗶𝗻𝗴 𝗔𝗴𝗲 Although AI has long been the defining theme of this era, watching it recently sweep through people’s minds with such overwhelming force still feels deeply worth reflecting on. Everyone is anxious; everyone is in FOMO; everyone is flailing to catch something (or a Mac Mini); everyone is searching for meaning. It has introduced so many counterintuitive and conflicting realities. - AI is the most powerful technology ever built is also the most accessible. - It enables everyone to do more, yet leaves everyone uncertain about what they’re still needed for - It multiplies productivity while making “production” itself feel questionable. - It makes knowledge nearly free, yet makes “knowing things” nearly worthless. - A decade of accumulated expertise may matter less than a tool released next month. - It turns “seniority” from an asset into baggage, and “inexperience” from a disadvantage into a neutral trait. - It offers unlimited companionship while making us acutely aware of what “not human” means. - Its outputs appear supremely confident, yet no one can fully predict what it will say. - It makes many things more controllable while making the future itself less so. - It distributes capability to everyone, yet concentrates real power in fewer hands.

English

154

Junlin Yang@junlin45300·21 Oca

It's great to see scaling laws for compute allocation in LLM RL! Solid Experiments!

Zhoujun (Jorge) Cheng@ChengZhoujun

Pretraining has scaling laws to guide compute allocation. But for RL on LLMs, we lack a practical guide on how to spend compute wisely. We show the optimal compute allocation in LLM RL scales predictably. ↓ Key takeaways below

English

324

Keşfet

@iScienceLuvr @xiaochuanlee @stingning @OpenAI @openclaw @steipete @JunliWang2021 @simile_ai