Zonghan Yang

1.3K posts

Zonghan Yang banner
Zonghan Yang

Zonghan Yang

@yang_zonghan

PhD student at Tsinghua NLP & AIR, studying agents that automate tasks ranging from daily activities to creative endeavors. Two drifters with the world to see.

Katılım Temmuz 2017
2.2K Takip Edilen2.2K Takipçiler
Sabitlenmiş Tweet
Zonghan Yang
Zonghan Yang@yang_zonghan·
💪🦾Agentless Training as Skill Prior for SWE-Agents We recently released the technical report for Kimi-Dev. Here is the story we’d love to share behind it: (1/7)
Zonghan Yang tweet media
English
9
64
355
104K
Zonghan Yang retweetledi
Stella Li
Stella Li@StellaLisy·
Millions of users now have months-long conversation histories with AI assistants💬 But this data is proprietary and unavailable to the academic community for research, training, or benchmarking. We introduce HorizonBench🌅, a benchmark and data generator for long-horizon personalization: tracking a user's current preferences across a history where life events have silently changed them.
Stella Li tweet media
English
6
38
247
18.5K
Zonghan Yang retweetledi
Tianyu Liu
Tianyu Liu@rogerliuty·
Grateful for my experiences in labs at two top Chinese tech firms, and special thanks to @JustinLin610 for recommending me to the early Qwen team. These shaped my career profoundly, both technically and beyond. In the fast-evolving field of LLMs/Agents/Claws, org challenges are complex yet simple. Complexity arises from balancing intricate goals under constraints (GPU, time, infra, inference costs), rapid iteration, and scaling teams amid limited resources. But ultimately, it's simple: Trust, inclusion, respect, and empathy are the most effective—and timeless—solutions.
Junyang Lin@JustinLin610

me stepping down. bye my beloved qwen.

English
2
5
90
8.7K
Zonghan Yang retweetledi
Zora Wang
Zora Wang@ZhiruoW·
AI agents are tackling more and more "human work" But are they benchmarked on the work people actually do? tl;dr: Not really Most benchmarks focus on math & coding, while most human labor and capital lie elsewhere. 📒 We built a database linking agent benchmarks & real-world work Submit new tasks + agent trajectories today 🧵
Zora Wang tweet media
English
21
79
401
61.5K
Zonghan Yang retweetledi
CLS
CLS@ChengleiSi·
Dimitris has been demonstrating the new way to do AI research in this agent era: find a neat problem, reason about why it’s interesting and tractable, offload the execution work to agents, analyze the results and write up a fun post about it. It’s still important for us human researchers to have the expertise to be able to identify the problem and judge the findings (eg, the background knowledge on matrix completion and SVD, and making the connection to LLM benchmarking); but even just automating the execution alone is already a massive acceleration and I think as a community we should really embrace this new form of AI-assisted research.
Dimitris Papailiopoulos@DimitrisPapail

x.com/i/article/2026…

English
5
9
178
28.4K
Zonghan Yang retweetledi
Cursor
Cursor@cursor_ai·
Cursor now shows you demos, not diffs. Agents can use the software they build and send you videos of their work.
English
404
602
8.4K
4.1M
Zonghan Yang retweetledi
Oscar Yinn
Oscar Yinn@yinn_oscar·
Many people are using RL to make models smarter. We used RL to pull training data out of the models themselves. Our results show that models know a lot more about their training data than most people think. We develop Active Data Reconstruction Attack (ADRA) — a data detection method that uses RL to induce models to reconstruct data seen during training. ADRA beats existing methods by an average of >10% across pre-training, post-training, and distillation. Our paper, with @uwnlp, @Cornell, and @BerkeleyNLP @Berkeleyai, is now available. Arxiv: arxiv.org/pdf/2602.19020 Joint work with @jxmnop @shmatikov @sewon__min @HannaHajishirzi
Oscar Yinn tweet media
English
4
38
181
11.6K
Zonghan Yang retweetledi
Ziqian Zhong
Ziqian Zhong@fjzzq2002·
🔭 We’re releasing Hodoscope: an open-source tool for unsupervised behavior discovery. It lets you visually explore and compare agent behaviors at scale. It helped us discover a novel reward hacking vulnerability in Commit0 - with just a couple minutes of human effort.
English
28
156
1.1K
72.7K
Zonghan Yang retweetledi
Xiangyi Li
Xiangyi Li@xdotli·
Agent Skills are everywhere - Claude Code, Gemini CLI, Codex all support them. But do they actually work? 105 domain experts from Stanford, CMU, Berkeley, Oxford, Amazon, ByteDance & more built SkillsBench to find that out. 86 tasks. 11 domains. 7,308 trajectories. 🧵👇
Xiangyi Li tweet media
English
30
97
706
86.4K
Zonghan Yang retweetledi
Kimi Product
Kimi Product@KimiProduct·
One prompt = A publication-ready research paper with academic charts. If you need pro academic formatting, latex formulas and figures, Kimi is all you need. You can get a word file just by chatting.
Kimi Product tweet mediaKimi Product tweet mediaKimi Product tweet mediaKimi Product tweet media
English
68
253
3.3K
426.5K
Zonghan Yang retweetledi
Cursor
Cursor@cursor_ai·
We've been working on very long-running coding agents. In a recent week-long run, our system peaked at over 1,000 commits per hour across hundreds of agents. We're sharing our findings and an early research preview inside Cursor.
English
127
166
2.3K
814K
Zonghan Yang retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
A lot of people quote tweeted this as 1 year anniversary of vibe coding. Some retrospective - I've had a Twitter account for 17 years now (omg) and I still can't predict my tweet engagement basically at all. This was a shower of thoughts throwaway tweet that I just fired off without thinking but somehow it minted a fitting name at the right moment for something that a lot of people were feeling at the same time, so here we are: vibe coding is now mentioned on my Wikipedia as a major memetic "contribution" and even its article is longer. lol The one thing I'd add is that at the time, LLM capability was low enough that you'd mostly use vibe coding for fun throwaway projects, demos and explorations. It was good fun and it almost worked. Today (1 year later), programming via LLM agents is increasingly becoming a default workflow for professionals, except with more oversight and scrutiny. The goal is to claim the leverage from the use of agents but without any compromise on the quality of the software. Many people have tried to come up with a better name for this to differentiate it from vibe coding, personally my current favorite "agentic engineering": - "agentic" because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight. - "engineering" to emphasize that there is an art & science and expertise to it. It's something you can learn and become better at, with its own depth of a different kind. In 2026, we're likely to see continued improvements on both the model layer and the new agent layer. I feel excited about the product of the two and another year of progress.
Andrej Karpathy@karpathy

There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.

English
644
820
8.8K
1.3M
Zonghan Yang retweetledi
Anthropic
Anthropic@AnthropicAI·
New Engineering blog: We tasked Opus 4.6 using agent teams to build a C compiler. Then we (mostly) walked away. Two weeks later, it worked on the Linux kernel. Here's what it taught us about the future of autonomous software development. Read more: anthropic.com/engineering/bu…
English
861
2.5K
21.4K
8.5M
Zonghan Yang retweetledi
Tianyu Liu
Tianyu Liu@rogerliuty·
Congrats to @stefan_fee on this excellent work! Really impressed by the daVinci-Agency paper. It shows how chain-of-PRs (global PR context) substantially boosts SWE performance—a natural yet underexplored signal. Even more surprising: agentic trajectories from these PRs yield big gains on Toolathlon (toolathlon.xyz/introduction), a tough real-world tool-use benchmark beyond SWE. Definitely worth diving deeper!
Pengfei Liu@stefan_fee

🚀 daVinci-Agency: Mining chain-of-pull-requests for long-horizon agents 📊 239 samples → 47% gain ⚡ Data efficiency redefined Paper: arxiv.org/abs/2602.02619

English
0
2
20
3.7K
Zonghan Yang retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
Proud to support @stanfordnlp CS224N this year. 🌲 Students are building their final projects using the Kimi K2.5 API. We can’t wait to see what this next generation of NLP researchers creates by the March 16 poster session. web.stanford.edu/class/cs224n/ Enjoy building with Kimi 💫
Kimi.ai tweet media
English
21
34
414
66.9K
Zonghan Yang retweetledi
Quoc Le
Quoc Le@quocleix·
Excited to share our latest work: "Semi-Autonomous Mathematics Discovery with Gemini." We used Gemini to systematically evaluate 700 "open" conjectures in the Erdős Problems database. The result? We addressed 13 problems marked as open—finding 5 novel autonomous solutions and identifying 8 existing solutions missed by previous literature. Read the full case study here: arxiv.org/abs/2601.22401
Quoc Le tweet media
English
45
207
1.3K
246.7K
Zonghan Yang retweetledi
Yinjie Wang
Yinjie Wang@YinjieW2024·
RL Anything! Your environment, reward model and policy can be improved in a closed-loop optimization. They provide feedback for each other to enhance the training signals and benefit the whole system. Check this out.
Yinjie Wang tweet media
English
14
109
652
32.8K
Zonghan Yang retweetledi
Yinghui He
Yinghui He@yinghui_he_·
STAT has been accepted to ICLR 2026! See you in Brazil 🇧🇷 Skill-Targeted Adaptive Training (STAT) is a continual learning method that squeezes out 🚨 7~10% more performance on extensively trained models like Qwen. It constructs a 🧩 Missing-Skill-Profile for each model based on what skills the model lacks in their responses, and adaptively curates post-training data accordingly. Check out our Blog Post 👉 ying-hui-he.github.io/Skill-Targeted… 🔗arXiv : arxiv.org/abs/2510.10023 💻GitHub: github.com/princeton-pli/…
Yinghui He tweet media
English
8
16
201
28.7K
Zonghan Yang retweetledi
idan shenfeld
idan shenfeld@IdanShenfeld·
People keep saying 2026 will be the year of continual learning. But there are still major technical challenges to making it a reality. Today we take the next step towards that goal — a new on-policy learning algorithm, suitable for continual learning! (1/n)
idan shenfeld tweet media
English
49
221
1.5K
236.9K
Zonghan Yang retweetledi
Jonas Hübotter
Jonas Hübotter@jonashubotter·
Training LLMs with verifiable rewards uses 1bit signal per generated response. This hides why the model failed. Today, we introduce a simple algorithm that enables the model to learn from any rich feedback! And then turns it into dense supervision. (1/n)
Jonas Hübotter tweet media
English
22
139
1.1K
209.5K