Junlin Yang

230 posts

Junlin Yang

Junlin Yang

@junlin45300

Incoming PhD @Tsinghua_Uni advised by Bowen Zhou and @stingning Research intern @haopeng_uiuc @taoyds

เข้าร่วม Nisan 2024
1.1K กำลังติดตาม143 ผู้ติดตาม
Junlin Yang รีทวีตแล้ว
Ofir Press
Ofir Press@OfirPress·
code is everything and everything is code
English
3
1
10
898
Junlin Yang
Junlin Yang@junlin45300·
OPD has been blowing up recently, yet many open questions still remain. What does it really take to make OPD work in practice? This work goes deep into the details — highly recommended and well worth a read!🔥
Bingxiang He@HBX_hbx

1/n ✨ Introducing our new work: Rethinking On-Policy Distillation of Large Language Models OPD is now a core technique in LLM post-training (Qwen3, MiMo, GLM-5...). But here's the uncomfortable truth: it often doesn't work. We systematically study the phenomenology, mechanism, and recipe of OPD. Here's what we discovered 👇

English
1
0
2
123
Junlin Yang รีทวีตแล้ว
Anthropic
Anthropic@AnthropicAI·
Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans. anthropic.com/glasswing
English
2K
6.7K
43.8K
30.6M
Junlin Yang รีทวีตแล้ว
Junlin Yang
Junlin Yang@junlin45300·
In 2026, coders stopped being synchronous executors and became **async schedulers**: less time doing every step themselves, more time organizing context, allocating attention, and validating outputs.
English
0
0
0
51
Junlin Yang รีทวีตแล้ว
Bingxiang He
Bingxiang He@HBX_hbx·
✨ [ICLR 2026] How Far Can Unsupervised RLVR Scale LLM Training? The dream: models can improve themselves without human supervision. The reality: sometimes they can only sharpen what they already believe. Intrinsic rewards struggle to scale LLM training because they follow a rise-then-fall pattern that makes collapse mathematically inevitable. But that's not the end of the story. We find unsupervised RLVR (URLVR) is particularly well-suited for test-time training and quantifying model priors. The full picture 👇 📄 Paper: arxiv.org/abs/2603.08660 🧪 GitHub: github.com/PRIME-RL/TTRL (1/n)
Bingxiang He tweet media
English
1
13
70
7K
Xiaochuan Li
Xiaochuan Li@xiaochuanlee·
Agentic test-time scaling (TTS) is effective -- until you find its inherent limits. 💡We show that classic TTS methods offered limited practical gains due to two fundamental limitations: the context ceiling and the verification gap. 🧵 Check the website: general-agentbench.github.io
Xiaochuan Li tweet mediaXiaochuan Li tweet media
English
2
11
27
2.2K
Junlin Yang
Junlin Yang@junlin45300·
@stingning @OpenAI @openclaw @steipete Totally agree! It’s getting clear that systems/evals are backward-shaping how models evolve. As we enter AI's 'second half,' the power to define what a 'good problem' or 'good application' actually looks like is really important.
English
0
0
1
154
Junli Wang
Junli Wang@JunliWang2021·
Goat
Hao Zhang@haozhangml

Can’t believe I get to say this -- deeply honored to be named a 2026 Sloan Research Fellow: today.ucsd.edu/story/2026-slo… Early faculty life is… "hyper-intense": teaching, advising, hiring, papers, grants; and trying to build a lab culture you’ll still be proud of years later. There were many weeks where it felt like we were building the plane mid-flight, burning plenty of midnight oil along the way. Over the past few years, I’ve been incredibly lucky to work with amazing students and collaborators on a chain of OSS project: Vicuna → Chatbot Arena → vLLM → DistServe → LMGame → FastVideo; each one then pushed forward way further by people far beyond our lab. This award feels less like a finish line and more like fuel for the lab, for our students, and for the next set of systems we haven’t built yet. A core principle of us is building "open-source research that ships." At the same time, it’s hard not to feel a mix of excitement + uncertainty + anxiety about where CS is heading. Coding agents are improving so fast that I am feeling the AGI first-handedly. I have gone back to builder mode -- only more productive than ever -- outside of my faculty admin work. I’ve watched friends and colleagues hit numbers that would’ve sounded like science fiction a year ago (e.g., 100+ commits/day). So what does it mean to “do great computer science” when baseline productivity keeps jumping? For me, it makes “research that ships” more important, and even raises the bar. The leverage shifts toward taste and problem selection, principled system design, and translating ideas into reliable artifacts. We're excited to keep proving that through real systems people can use! Deeply grateful to: - My students and collaborators — for the ideas, execution, and drive. - @HDSIUCSD , Dean @GuptaUcsd, and my @UCSanDiego colleagues — for building an environment where ambitious work can happen. - @nvidia and @mbzuai (and other compute sponsors) — for support that helped us move faster and turn ideas into real artifacts. Even as the interface changes, the need for efficient compute and solid infrastructure only grows. Most of all: credit to the students at @haoailab. You’re the reason any of this is worth doing. Keep building and shipping!

English
1
0
5
277
Junlin Yang รีทวีตแล้ว
Ning Ding
Ning Ding@stingning·
Bytedance is being restrained in marketing Seed 2.0 with almost zero hype. But reality is clear: it is a globally top-tier model. No qualifiers needed.
English
13
13
191
16.6K
Junlin Yang รีทวีตแล้ว
Yuxuan Li
Yuxuan Li@YuxuanL_·
🚨New paper: "What Makes LLM Agent Simulations Useful for Policy Practice?" @simile_ai is making a great launch, but can these LLM agent simulations actually help real institutions make better decisions? We spent a year working with policymakers to answer this simple question. The answer is yes—but perhaps not how you'd expect. 👇 THREAD 👇 [Link to paper: arxiv.org/abs/2509.21868] [1/n]
English
3
34
191
14.8K
Junlin Yang
Junlin Yang@junlin45300·
"Collective reasoning under distributed information" is a fascinating and under-explored topic. It represents a core capability for **real-world multi-agent collaboration** in open environments, and this work provides an excellent evaluation for it!
Yuxuan Li@YuxuanL_

We’ve expanded our benchmark into HiddenBench: a 65-task, theory-grounded, extensible benchmark for evaluating collective reasoning under distributed information. We tested 15 frontier models (🚨spoiler🚨: Gemini is the clear winner) and uncovered key bottlenecks in multi-agent LLMs coordination. Check out the full update on arXiv! 📄 Paper: arxiv.org/abs/2505.11556

English
0
0
1
108
Junlin Yang รีทวีตแล้ว
Weize Chen
Weize Chen@JeffreyChen_THU·
Everyone is talking about Self-Evolving 🤖. But here’s the hard question: How do we actually evaluate it? Are models truly learning new skills, or just recalling? 🚀We propose a new benchmark SE-Bench, targeting a core primitive of evolution: Knowledge Internalization. 🧵
Weize Chen tweet media
English
4
29
146
11.7K
Junlin Yang รีทวีตแล้ว