Zonghan Yang

1.3K posts

Zonghan Yang

@yang_zonghan

PhD student at Tsinghua NLP & AIR, studying agents that automate tasks ranging from daily activities to creative endeavors. Two drifters with the world to see.

Katılım Temmuz 2017

2.2K Takip Edilen2.2K Takipçiler

Sabitlenmiş Tweet

Zonghan Yang@yang_zonghan·11 Eki

💪🦾Agentless Training as Skill Prior for SWE-Agents We recently released the technical report for Kimi-Dev. Here is the story we’d love to share behind it: (1/7)

English

355

104K

Zonghan Yang retweetledi

Stella Li@StellaLisy·24 Nis

Millions of users now have months-long conversation histories with AI assistants💬 But this data is proprietary and unavailable to the academic community for research, training, or benchmarking. We introduce HorizonBench🌅, a benchmark and data generator for long-horizon personalization: tracking a user's current preferences across a history where life events have silently changed them.

English

247

18.5K

Zonghan Yang retweetledi

Tianyu Liu@rogerliuty·4 Mar

Grateful for my experiences in labs at two top Chinese tech firms, and special thanks to @JustinLin610 for recommending me to the early Qwen team. These shaped my career profoundly, both technically and beyond. In the fast-evolving field of LLMs/Agents/Claws, org challenges are complex yet simple. Complexity arises from balancing intricate goals under constraints (GPU, time, infra, inference costs), rapid iteration, and scaling teams amid limited resources. But ultimately, it's simple: Trust, inclusion, respect, and empathy are the most effective—and timeless—solutions.

Junyang Lin@JustinLin610

me stepping down. bye my beloved qwen.

English

8.7K

Zonghan Yang retweetledi

Zora Wang@ZhiruoW·3 Mar

AI agents are tackling more and more "human work" But are they benchmarked on the work people actually do? tl;dr: Not really Most benchmarks focus on math & coding, while most human labor and capital lie elsewhere. 📒 We built a database linking agent benchmarks & real-world work Submit new tasks + agent trajectories today 🧵

English

401

61.5K

Zonghan Yang retweetledi

CLS@ChengleiSi·25 Şub

Dimitris has been demonstrating the new way to do AI research in this agent era: find a neat problem, reason about why it’s interesting and tractable, offload the execution work to agents, analyze the results and write up a fun post about it. It’s still important for us human researchers to have the expertise to be able to identify the problem and judge the findings (eg, the background knowledge on matrix completion and SVD, and making the connection to LLM benchmarking); but even just automating the execution alone is already a massive acceleration and I think as a community we should really embrace this new form of AI-assisted research.

Dimitris Papailiopoulos@DimitrisPapail

x.com/i/article/2026…

English

178

28.4K

Zonghan Yang retweetledi

Cursor@cursor_ai·24 Şub

Cursor now shows you demos, not diffs. Agents can use the software they build and send you videos of their work.

English

404

602

8.4K

4.1M

Zonghan Yang retweetledi

Oscar Yinn@yinn_oscar·24 Şub

Many people are using RL to make models smarter. We used RL to pull training data out of the models themselves. Our results show that models know a lot more about their training data than most people think. We develop Active Data Reconstruction Attack (ADRA) — a data detection method that uses RL to induce models to reconstruct data seen during training. ADRA beats existing methods by an average of >10% across pre-training, post-training, and distillation. Our paper, with @uwnlp, @Cornell, and @BerkeleyNLP @Berkeleyai, is now available. Arxiv: arxiv.org/pdf/2602.19020 Joint work with @jxmnop @shmatikov @sewon__min @HannaHajishirzi

English

181

11.6K

Zonghan Yang retweetledi

Ziqian Zhong@fjzzq2002·20 Şub

🔭 We’re releasing Hodoscope: an open-source tool for unsupervised behavior discovery. It lets you visually explore and compare agent behaviors at scale. It helped us discover a novel reward hacking vulnerability in Commit0 - with just a couple minutes of human effort.

English

156

1.1K

72.7K

Zonghan Yang retweetledi

Xiangyi Li@xdotli·13 Şub

Agent Skills are everywhere - Claude Code, Gemini CLI, Codex all support them. But do they actually work? 105 domain experts from Stanford, CMU, Berkeley, Oxford, Amazon, ByteDance & more built SkillsBench to find that out. 86 tasks. 11 domains. 7,308 trajectories. 🧵👇

English

706

86.4K

Zonghan Yang retweetledi

Kimi Product@KimiProduct·6 Şub

One prompt = A publication-ready research paper with academic charts. If you need pro academic formatting, latex formulas and figures, Kimi is all you need. You can get a word file just by chatting.

English

253

3.3K

426.5K

Zonghan Yang retweetledi

Cursor@cursor_ai·5 Şub

We've been working on very long-running coding agents. In a recent week-long run, our system peaked at over 1,000 commits per hour across hundreds of agents. We're sharing our findings and an early research preview inside Cursor.

English

127

166

2.3K

814K

Zonghan Yang retweetledi

Karel@KarelDoostrlnck·5 Şub

x.com/i/article/2018…

ZXX

131

269

3.7K

1.2M

Zonghan Yang retweetledi

Andrej Karpathy@karpathy·4 Şub

A lot of people quote tweeted this as 1 year anniversary of vibe coding. Some retrospective - I've had a Twitter account for 17 years now (omg) and I still can't predict my tweet engagement basically at all. This was a shower of thoughts throwaway tweet that I just fired off without thinking but somehow it minted a fitting name at the right moment for something that a lot of people were feeling at the same time, so here we are: vibe coding is now mentioned on my Wikipedia as a major memetic "contribution" and even its article is longer. lol The one thing I'd add is that at the time, LLM capability was low enough that you'd mostly use vibe coding for fun throwaway projects, demos and explorations. It was good fun and it almost worked. Today (1 year later), programming via LLM agents is increasingly becoming a default workflow for professionals, except with more oversight and scrutiny. The goal is to claim the leverage from the use of agents but without any compromise on the quality of the software. Many people have tried to come up with a better name for this to differentiate it from vibe coding, personally my current favorite "agentic engineering": - "agentic" because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight. - "engineering" to emphasize that there is an art & science and expertise to it. It's something you can learn and become better at, with its own depth of a different kind. In 2026, we're likely to see continued improvements on both the model layer and the new agent layer. I feel excited about the product of the two and another year of progress.

Andrej Karpathy@karpathy

There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.

English

644

820

8.8K

1.3M

Zonghan Yang retweetledi

Anthropic@AnthropicAI·5 Şub

New Engineering blog: We tasked Opus 4.6 using agent teams to build a C compiler. Then we (mostly) walked away. Two weeks later, it worked on the Linux kernel. Here's what it taught us about the future of autonomous software development. Read more: anthropic.com/engineering/bu…

English

861

2.5K

21.4K

8.5M

Zonghan Yang retweetledi

Tianyu Liu@rogerliuty·5 Şub

Congrats to @stefan_fee on this excellent work! Really impressed by the daVinci-Agency paper. It shows how chain-of-PRs (global PR context) substantially boosts SWE performance—a natural yet underexplored signal. Even more surprising: agentic trajectories from these PRs yield big gains on Toolathlon (toolathlon.xyz/introduction), a tough real-world tool-use benchmark beyond SWE. Definitely worth diving deeper!

Pengfei Liu@stefan_fee

🚀 daVinci-Agency: Mining chain-of-pull-requests for long-horizon agents 📊 239 samples → 47% gain ⚡ Data efficiency redefined Paper: arxiv.org/abs/2602.02619

English

3.7K

Zonghan Yang retweetledi

Kimi.ai@Kimi_Moonshot·4 Şub

Proud to support @stanfordnlp CS224N this year. 🌲 Students are building their final projects using the Kimi K2.5 API. We can’t wait to see what this next generation of NLP researchers creates by the March 16 poster session. web.stanford.edu/class/cs224n/ Enjoy building with Kimi 💫

English

414

66.9K

Zonghan Yang retweetledi

Quoc Le@quocleix·2 Şub

Excited to share our latest work: "Semi-Autonomous Mathematics Discovery with Gemini." We used Gemini to systematically evaluate 700 "open" conjectures in the Erdős Problems database. The result? We addressed 13 problems marked as open—finding 5 novel autonomous solutions and identifying 8 existing solutions missed by previous literature. Read the full case study here: arxiv.org/abs/2601.22401

English

207

1.3K

246.7K

Zonghan Yang retweetledi

Yinjie Wang@YinjieW2024·3 Şub

RL Anything! Your environment, reward model and policy can be improved in a closed-loop optimization. They provide feedback for each other to enhance the training signals and benefit the whole system. Check this out.

English

109

652

32.8K

Zonghan Yang retweetledi

Yinghui He@yinghui_he_·2 Şub

STAT has been accepted to ICLR 2026! See you in Brazil 🇧🇷 Skill-Targeted Adaptive Training (STAT) is a continual learning method that squeezes out 🚨 7~10% more performance on extensively trained models like Qwen. It constructs a 🧩 Missing-Skill-Profile for each model based on what skills the model lacks in their responses, and adaptively curates post-training data accordingly. Check out our Blog Post 👉 ying-hui-he.github.io/Skill-Targeted… 🔗arXiv : arxiv.org/abs/2510.10023 💻GitHub: github.com/princeton-pli/…

English

201

28.7K

Zonghan Yang retweetledi

idan shenfeld@IdanShenfeld·29 Oca

People keep saying 2026 will be the year of continual learning. But there are still major technical challenges to making it a reality. Today we take the next step towards that goal — a new on-policy learning algorithm, suitable for continual learning! (1/n)

English

221

1.5K

236.9K

Zonghan Yang retweetledi

Jonas Hübotter@jonashubotter·29 Oca

Training LLMs with verifiable rewards uses 1bit signal per generated response. This hides why the model failed. Today, we introduce a simple algorithm that enables the model to learn from any rich feedback! And then turns it into dense supervision. (1/n)

English

139

1.1K

209.5K

Keşfet

@JustinLin610 @uwnlp @Cornell @BerkeleyNLP @jxmnop @shmatikov @sewon__min @HannaHajishirzi