Zihao Wang

52 posts

Zihao Wang

Zihao Wang

@wzihao12

Research Scientist at Scale AI

Katılım Mart 2022
383 Takip Edilen229 Takipçiler
Zihao Wang retweetledi
Scale AI
Scale AI@scale_AI·
Can AI handle the kind of reasoning professionals rely on daily? Our latest benchmark, PRBench, puts models to the test with over 1,000 expert-authored tasks in finance and law. Even the strongest models scored below 40% on the hardest tasks, highlighting the gap between potential and practice.
English
3
19
77
9.6K
Zihao Wang retweetledi
Bing Liu
Bing Liu@vbingliu·
Excited to share our new paper “ResearchRubrics”, a benchmark for evaluating deep-research agents through fine-grained, human-authored rubrics! Instead of focusing only on factual correctness, we evaluate responses along multiple dimensions: completeness, reasoning soundness, source usage, and clarity, using expert-written rubric criteria with mandatory and optional requirements. This granularity exposes capability gaps that aggregate metrics cannot detect and enables evaluating agents on a spectrum from minimum viable sufficiency to true excellence. Experiments show that even top Deep Research agents achieve only ~67% compliance, struggling especially with cross-document integration, rigorous justification, and citation quality. The consistent failure patterns on implicit reasoning, multi-document synthesis, and sustained sequential reasoning, indicate that overcoming current limitations will require architectural advances, not just prompt tuning or incremental improvements. Paper: arxiv.org/pdf/2511.07685
Manasi Sharma @ ICLR 2026@ManasiSharma_

🚀New @scale_AI paper: 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵𝗥𝘂𝗯𝗿𝗶𝗰𝘀, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <𝟲𝟴% 𝗿𝘂𝗯𝗿𝗶𝗰 𝗰𝗼𝗺𝗽𝗹𝗶𝗮𝗻𝗰𝗲. We built 𝟮.𝟱𝗞+ expert rubrics with 𝟮.𝟴𝗞+ hrs of human labor to measure why.

English
0
4
20
2.3K
Zihao Wang retweetledi
Bing Liu
Bing Liu@vbingliu·
Can AI actually automate jobs? @Scale_AI and @cais are launching the Remote Labor Index (RLI), the first benchmark and public leaderboard that test how well AI agents can complete real, paid freelance work in domains like software engineering, design, architecture, data analysis, and more. Early results show the limits of today’s models. The top AI agent completed just 2.5% of real freelance jobs better than humans. AI is powerful, but not yet reliable enough to replace skilled labor. RLI gives us a transparent way to track progress over time and bring clarity to the future of work.
English
21
75
467
424.2K
Zihao Wang
Zihao Wang@wzihao12·
On-policy distillation with reverse KL as reward works great—IF you have access to teacher logits. But what if you don't? What if you want to distill from multiple teachers? Our solution: distill teacher guidance into rubrics, then do on-policy RL. Check out our work: arxiv.org/abs/2509.21500
Thinking Machines@thinkymachines

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-policy…

English
2
4
24
3.2K
Zihao Wang retweetledi
Scale AI
Scale AI@scale_AI·
We launched SWE-Bench Pro last month to incredible feedback, and we’ve now updated the leaderboard with the latest models and no cost caps. SoTA models now break 40% pass rate. Congrats to @Anthropic for sweeping the top spots! 🥇Claude 4.5 Sonnet 🥈Claude 4 Sonnet 🥉Claude 4.5 Haiku
Scale AI tweet media
Bing Liu@vbingliu

🚀 Introducing SWE-Bench Pro — a new benchmark to evaluate LLM coding agents on real, enterprise-grade software engineering tasks. This is the next step beyond SWE-Bench: harder, contamination-resistant, and closer to real-world repos.

English
39
55
560
259.8K
Zihao Wang retweetledi
Bing Liu
Bing Liu@vbingliu·
🔄RLHF → RLVR → Rubrics → OnlineRubrics 👤 Human feedback = noisy & coarse 🧮 Verifiable rewards = too narrow 📋 Static rubrics = rigid, easy to hack, miss emergent behaviors 💡We introduce OnlineRubrics: elicited rubrics that evolve as models train. arxiv.org/abs/2510.07284
Bing Liu tweet media
English
5
42
283
46.1K
Zihao Wang retweetledi
Bing Liu
Bing Liu@vbingliu·
New @Scale_AI paper! The culprit behind reward hacking? We trace it to misspecification in high-reward tail. Our fix: rubric-based rewards to tell “excellent” responses apart from “great.” The result: Less hacking, stronger post-training!  arxiv.org/pdf/2509.21500
Bing Liu tweet mediaBing Liu tweet media
English
4
40
177
17.6K
Zihao Wang retweetledi
Victor Veitch 🔸
Victor Veitch 🔸@victorveitch·
Semantics in language is naturally hierarchical, but attempts to interpret LLMs often ignore this. Turns out: baking semantic hierarchy into sparse autoencoders can give big jumps in interpretability and efficiency. Thread + bonus musings on the value of SAEs:
Victor Veitch 🔸 tweet mediaVictor Veitch 🔸 tweet media
English
11
63
304
27.3K
Zihao Wang
Zihao Wang@wzihao12·
@yibophd @jiahaoyu04 @metasec Our Fix: Position-enhanced Fine-Tuning (PFT)! 💡 Instead of just patching data, we strengthen the role signal itself. PFT modifies token position IDs during fine-tuning, creating a clearer numerical "gap" between system & user tokens.
Zihao Wang tweet media
English
0
0
0
177
Zihao Wang
Zihao Wang@wzihao12·
@yibophd @jiahaoyu04 @metasec Why these shortcuts? We hypothesize current concatenated prompt formats don't provide strong enough invariant signals to differentiate roles robustly. Models latch onto easier, spurious correlations like position instead. (5/N)
English
0
0
0
139
Zihao Wang
Zihao Wang@wzihao12·
@yibophd @jiahaoyu04 @metasec Shortcut #2: Proximity to Begin-of-Text! Models heavily rely on how close instructions are to the start of the prompt. Inserting any text before the main system instruction caused a dramatic drop in following the correct instruction! (4/N)
English
0
0
0
126
Zihao Wang
Zihao Wang@wzihao12·
@yibophd @jiahaoyu04 @metasec We find models learn shortcuts! Shortcut #1: Task-Type Association! We found models often identify roles based on task types seen during training (e.g., always treating "grammar check" as an instruction, even if from the user!). (3/N)
English
0
0
0
122
Zihao Wang
Zihao Wang@wzihao12·
@yibophd @jiahaoyu04 @metasec How did we test this? We used a controlled framework, training models on benign data where user input could be mistaken for instructions, but evaluating on OOD adversarial attacks. This setup isolates true role learning from just memorizing attack patterns. (2/N)
English
0
1
0
124
Zihao Wang retweetledi
David Reber
David Reber@davidpreber·
🧵 RATE: Score Reward Models with Imperfect Rewrites of Rewrites 1/ How do you measure whether a reward model incentivizes helpfulness without accidentally measuring length, complexity, etc? Rewrites of rewrites give good counterfactuals, without needing to list all confounders!
David Reber tweet media
English
1
10
15
2.1K
Zihao Wang retweetledi
Yibo Jiang
Yibo Jiang@yibophd·
Are LLMs just doing next token predictions? It is believed that if an LLM can accurately predict the next tokens in a Wikipedia entry, it essentially "learns" the information. But do pre-trained LLMs actually need to understand context sentences to solve this task? The answer is no!
Yibo Jiang tweet mediaYibo Jiang tweet media
English
5
42
192
40.4K
Zihao Wang retweetledi
Victor Veitch 🔸
Victor Veitch 🔸@victorveitch·
Fundamentally, high-level concepts group into categorical variables---mammal, reptile, fish, bird---with a semantic hierarchy---poodle is a dog is a mammal is an animal. How do LLMs internally represent this structure? arxiv.org/abs/2406.01506
English
11
117
611
81.2K
Zihao Wang retweetledi
Victor Veitch 🔸
Victor Veitch 🔸@victorveitch·
LLM best-of-n sampling works great in practice---but why? Turns out: it's the best possible policy for maximizing win rate over the base model! Then: we use this to get a truly sweet alignment scheme: easy tweaks, huge gains w @ybnbxb @ggarbacea arxiv.org/abs/2406.00832
Victor Veitch 🔸 tweet mediaVictor Veitch 🔸 tweet media
English
5
20
83
16.2K