Andrew Zhao

1.7K posts

Andrew Zhao banner
Andrew Zhao

Andrew Zhao

@_AndrewZhao

PhD @Tsinghua_Uni Absolute Zero,ExpeL Ex. intern @MSFTResearch,@ BIGAI

เข้าร่วม Eylül 2020
3.7K กำลังติดตาม4.7K ผู้ติดตาม
Andrew Zhao รีทวีตแล้ว
Yifan Zhang
Yifan Zhang@yifan_zhang_·
Thrilled to announce that I have been selected to receive the William G. Bowen Merit Fellowship for my Princeton PhD studies @EPrinceton (one fellowship in each academic division), hyped! 🚀
Yifan Zhang tweet media
English
16
4
238
12.7K
Andrew Zhao รีทวีตแล้ว
Hanze Dong
Hanze Dong@hendrydong·
Between theorem recognition and theorem proving lies theorem understanding. We introduce LiveMathematicianBench: a live, contamination-resistant testbed for research-level mathematical reasoning, built from post-cutoff arXiv theorems. It probes a capability that existing benchmarks rarely isolate: whether models can understand theorem statements, track delicate assumptions, reason over logical structure, and leverage proof-level guidance. livemathematicianbench.github.io
Hanze Dong tweet media
English
10
35
176
16.7K
Andrew Zhao รีทวีตแล้ว
Shenzhi Wang🌟
Shenzhi Wang🌟@ShenzhiWang_THU·
🚨New Article Alert🚨 Anthropic says: separate your generator and evaluator, reset context instead of compacting, simplify your harness as models improve. I read 510K lines of leaked Claude Code source to see if they actually do this. Then compared it against OpenClaw. They solve the same 3 problems completely differently. The short version: Claude Code hardcodes 5 layers of context management, enforces evaluation separation via a 370-line system prompt, and deletes components when A/B tests prove them useless. Every threshold cites a BigQuery date. OpenClaw abstracts context into a pluggable interface, provides agent lifecycle + DAG orchestration but doesn't enforce quality roles, and simplifies by swapping implementations instead of deleting code. One optimizes for depth in a single session. The other optimizes for flexibility across devices and channels. Full breakdown with source-level code comparisons:
Shenzhi Wang🌟@ShenzhiWang_THU

x.com/i/article/2040…

English
0
6
21
5.3K
elie
elie@eliebakouch·
update: joining @PrimeIntellect 🦋 i'm super excited to join the team. i really admire what they've been building and i love the mission of pushing the frontier in the open i'll be working on pre/mid training, there's so much left to figure out and i truly believe a small group with the right people, resources and focus can do sooo much 🚀
elie tweet media
English
172
46
1.2K
99.5K
will brown
will brown@willccbb·
@_mchenco ppl keep asking me how to get a hoodie. you can’t. they’re sparse
English
14
1
107
5.6K
Junyang Lin
Junyang Lin@JustinLin610·
mountain climbing is so funny
English
14
1
108
15.8K
Andrew Zhao รีทวีตแล้ว
Xin Eric Wang
Xin Eric Wang@xwang_lk·
🎉 Introducing PARE: a new framework for evaluating proactive AI agents. Today’s agents are reactive. The next wave? Proactive agents that anticipate your needs, like adding “soap” to your shopping list when your roommate texts you. 🚧 The challenge: you can’t evaluate this with static benchmarks. 🍐 PARE: active user simulation with realistic mobile interactions 📱 Asymmetric design: agent ≠ user view (just like real life) 👀 Observe → Execute: assist only when it matters 📋 PARE-Bench: 143 tasks, 9 apps, real-world complexity 📊 Result: even top models hit just 42% success Built on Meta’s ARE, PARE brings scalable, realistic evaluation to proactive AI.
Xin Eric Wang tweet media
Deepak Nathani@deepaknathani11

🎉 Excited to share 🍐 PARE and PARE-Bench - a framework and benchmark for evaluating proactive assistants through active user simulation in mobile environments. Current LM agents are reactive: they wait for you to tell them what to do. Proactive agents flip this. They observe what you're doing and figure out how to help. Imagine your assistant notices you got a text from your roommate saying "we're out of soap" while you're editing your shopping list, and adds soap to your list. 🚧 Evaluating these agents is challenging because they must observe realistic user behavior to infer goals. You can't do this with static benchmarks or passive users. Our key contributions: 🍐 PARE: an active user simulation framework where users navigate apps through Finite State Machine (FSM) based stateful interfaces, just like on a real phone 📱 Asymmetric design: users and assistants observe different information and interact through different interfaces, matching real-world deployment 👀 Observe-Execute architecture: lightweight observer monitors continuously, executor acts only after user approval 📋 PARE-Bench: 143 tasks across 9 app categories testing goal inference, intervention timing, and multi-app orchestration 📊 Evaluation of 7 LLMs reveals that even frontier models achieve only 42% success rate PARE is built on top of Meta's Agent Research Environment (ARE) and enables scalable, repeatable evaluation of proactive agents. In PARE, the simulated user goes about their day on the phone: accomplishing goals, navigating between apps, and responding to notifications. The proactive agent watches all of this unfold and uses the user's actions and environment signals to build context about what the user might need help with. Huge thanks to my advisors @xwang_lk @WilliamWangNLP and my amazing collaborators @JasonZ118707 @HuanCC2002 Jiaming Shan @yinfeiy Alkesh Patel @zhegan4 @m2saxon 🙏

English
0
17
84
14.6K
Andrew Zhao รีทวีตแล้ว
Haider.
Haider.@haider1·
Google's Jeff Dean says current pre-training is passive: initialize a model, stream the internet past it, let it observe But models need to learn not just from data, but by acting, predicting, and choosing what to learn from next "we have this artificial distinction now between pre and post-training, and it shouldn't exist long term"
English
12
37
404
54.9K
Andrew Zhao รีทวีตแล้ว
Jeremy Berman
Jeremy Berman@jeremyberman·
The most interesting RL is turning pass@1,000,000 into pass@4
English
0
2
68
5.1K
Andrew Zhao รีทวีตแล้ว
Claude
Claude@claudeai·
Computer use is now in Claude Code. Claude can open your apps, click through your UI, and test what it built, right from the CLI. Now in research preview on Pro and Max plans.
English
2.6K
4.8K
59.3K
15.9M
Andrew Zhao รีทวีตแล้ว
Jack Zhang
Jack Zhang@jcz42·
We made Muon run up to 2x faster for free! Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition. Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs. Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else. This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵
Jack Zhang tweet media
English
17
165
1K
206.9K
Andrew Zhao รีทวีตแล้ว
idan shenfeld
idan shenfeld@IdanShenfeld·
Diversity collapse is not only problematic during RL training but can also be a real issue in the deployed model. In many real-world scenarios, there is more than one correct answer, and we should train our model to capture it!
Isha Puri@ishapuri101

ChatGPT several times where's best to go for spring break? It recommends Barcelona almost every time. This isn't a fluke. RL training rewards one best answer, so the model learns to commit to one mode and repeat it. Meet Multi-Answer RL: a simple RL method that trains LMs to reason through and output a distribution of answers in a single generation. [1/N]

English
1
9
69
11K
Andrew Zhao รีทวีตแล้ว
Jenny Zhang
Jenny Zhang@jennyzhangzt·
Introducing Hyperagents: an AI system that not only improves at solving tasks, but also improves how it improves itself. The Darwin Gödel Machine (DGM) demonstrated that open-ended self-improvement is possible by iteratively generating and evaluating improved agents, yet it relies on a key assumption: that improvements in task performance (e.g., coding ability) translate into improvements in the self-improvement process itself. This alignment holds in coding, where both evaluation and modification are expressed in the same domain, but breaks down more generally. As a result, prior systems remain constrained by fixed, handcrafted meta-level procedures that do not themselves evolve. We introduce Hyperagents – self-referential agents that can modify both their task-solving behavior and the process that generates future improvements. This enables what we call metacognitive self-modification: learning not just to perform better, but to improve at improving. We instantiate this framework as DGM-Hyperagents (DGM-H), an extension of the DGM in which both task-solving behavior and the self-improvement procedure are editable and subject to evolution. Across diverse domains (coding, paper review, robotics reward design, and Olympiad-level math solution grading), hyperagents enable continuous performance improvements over time and outperform baselines without self-improvement or open-ended exploration, as well as prior self-improving systems (including DGM). DGM-H also improves the process by which new agents are generated (e.g. persistent memory, performance tracking), and these meta-level improvements transfer across domains and accumulate across runs. This work was done during my internship at Meta (@AIatMeta), in collaboration with Bingchen Zhao (@BingchenZhao), Wannan Yang (@winnieyangwn), Jakob Foerster (@j_foerst), Jeff Clune (@jeffclune), Minqi Jiang (@MinqiJiang), Sam Devlin (@smdvln), and Tatiana Shavrina (@rybolos).
Jenny Zhang tweet media
English
154
648
3.6K
492.4K
Andrew Zhao รีทวีตแล้ว
Shenzhi Wang🌟
Shenzhi Wang🌟@ShenzhiWang_THU·
When training Qwen3.5, we kept asking ourselves: 🧐What kind of multimodal RLVR data actually leads to generalizable gains? 💡We believe the answer may not lie only in data tightly tailored to specific benchmarks, but also in OOD proxy tasks that train the foundational abilities behind long-chain visual reasoning. The motivation is simple: VLMs are still unreliable in long-CoT settings. Small mistakes in perception, reasoning, knowledge use, or grounding can compound across intermediate steps and eventually lead to much larger final errors. However, much of today’s RLVR data still does not require complex reasoning chains grounded in visual evidence throughout, meaning these failure modes are often not sufficiently stressed during training. 🚀Excited to share our new work from Qwen and Tsinghua LeapLab: HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning This is also one of the training task sources used in Qwen3.5 VL RLVR. To study this question, we propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data for RLVR training. The key idea is to build each query as a chain of logically dependent hops: earlier hops establish the instances, sets, or conditions needed for later hops, while the model must repeatedly return to the image for fresh visual grounding along the way. At the same time, each query ends with a specific, unambiguous numerical answer, making it naturally suitable for verifiable rewards. Concretely, HopChain combines two complementary structures: perception-level hops and instance-chain hops. We require each synthesized example to involve both, so the model cannot simply continue reasoning from language inertia. Instead, it is forced to keep grounding intermediate steps in the image, maintain cross-step dependencies, and control error accumulation across long reasoning trajectories. Our goal is not to mimic any specific downstream benchmark, but to strengthen the more fundamental abilities that long-CoT vision-language reasoning depends on. We add HopChain-synthesized data into RLVR training for Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and evaluate on 24 benchmarks spanning diverse domains. Despite not being designed for any particular benchmark, HopChain improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. We also find that full chained multi-hop queries are crucial: replacing them with half-multi-hop or single-hop variants reduces performance substantially. Most notably, the gains are especially strong on long-CoT and ultra-long-CoT vision-language reasoning, peaking at more than 50 accuracy points in the ultra-long-CoT regime. Our main takeaway is simple: beyond benchmark-aligned data, OOD proxy tasks that systematically train the core mechanics of long-chain visual reasoning can be a powerful and scalable source of RLVR supervision for VLMs — and can lead to more generalizableimprovements. 🔗 huggingface.co/papers/2603.17…
Shenzhi Wang🌟 tweet mediaShenzhi Wang🌟 tweet mediaShenzhi Wang🌟 tweet media
English
2
55
434
58.1K
Andrew Zhao รีทวีตแล้ว
Dwarkesh Patel
Dwarkesh Patel@dwarkesh_sp·
The Terence Tao episode. We begin with the absolutely ingenious and surprising way in which Kepler discovered the laws of planetary motion. People sometimes say that AI will make especially fast progress at scientific discovery because of tight verification loops. But the story of how we discovered the shape of our solar system shows how the verification loop for correct ideas can be decades (or even millennia) long. During this time, what we know today as the better theory can often actually make worse predictions (Copernicus's model of circular orbits around the sun was actually less accurate than Ptolemy's geocentric model). And the reasons it survives this epistemic hell is some mixture of judgment and heuristics that we don’t even understand well enough to actually articulate, much less codify into an RL loop. Hope you enjoy! 0:00:00 – Kepler was a high temperature LLM 0:11:44 – How would we know if there’s a new unifying concept within heaps of AI slop? 0:26:10 – The deductive overhang 0:30:31 – Selection bias in reported AI discoveries 0:46:43 – AI makes papers richer and broader, but not deeper 0:53:00 – If AI solves a problem, can humans get understanding out of it? 0:59:20 – We need a semi-formal language for the way that scientists actually talk to each other 1:09:48 – How Terry uses his time 1:17:05 – Human-AI hybrids will dominate math for a lot longer Look up Dwarkesh Podcast on YouTube, Apple Podcasts, or Spotify.
English
104
555
3.9K
832K