Zihao Wang

52 posts

Zihao Wang

@wzihao12

Research Scientist at Scale AI

Katılım Mart 2022

383 Takip Edilen229 Takipçiler

Zihao Wang retweetledi

Scale AI@scale_AI·14 Kas

Can AI handle the kind of reasoning professionals rely on daily? Our latest benchmark, PRBench, puts models to the test with over 1,000 expert-authored tasks in finance and law. Even the strongest models scored below 40% on the hardest tasks, highlighting the gap between potential and practice.

English

9.6K

Zihao Wang retweetledi

Bing Liu@vbingliu·13 Kas

Excited to share our new paper “ResearchRubrics”, a benchmark for evaluating deep-research agents through fine-grained, human-authored rubrics! Instead of focusing only on factual correctness, we evaluate responses along multiple dimensions: completeness, reasoning soundness, source usage, and clarity, using expert-written rubric criteria with mandatory and optional requirements. This granularity exposes capability gaps that aggregate metrics cannot detect and enables evaluating agents on a spectrum from minimum viable sufficiency to true excellence. Experiments show that even top Deep Research agents achieve only ~67% compliance, struggling especially with cross-document integration, rigorous justification, and citation quality. The consistent failure patterns on implicit reasoning, multi-document synthesis, and sustained sequential reasoning, indicate that overcoming current limitations will require architectural advances, not just prompt tuning or incremental improvements. Paper: arxiv.org/pdf/2511.07685

Manasi Sharma @ ICLR 2026@ManasiSharma_

🚀New @scale_AI paper: 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵𝗥𝘂𝗯𝗿𝗶𝗰𝘀, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <𝟲𝟴% 𝗿𝘂𝗯𝗿𝗶𝗰 𝗰𝗼𝗺𝗽𝗹𝗶𝗮𝗻𝗰𝗲. We built 𝟮.𝟱𝗞+ expert rubrics with 𝟮.𝟴𝗞+ hrs of human labor to measure why.

English

2.3K

Zihao Wang retweetledi

Bing Liu@vbingliu·29 Eki

Can AI actually automate jobs? @Scale_AI and @cais are launching the Remote Labor Index (RLI), the first benchmark and public leaderboard that test how well AI agents can complete real, paid freelance work in domains like software engineering, design, architecture, data analysis, and more. Early results show the limits of today’s models. The top AI agent completed just 2.5% of real freelance jobs better than humans. AI is powerful, but not yet reliable enough to replace skilled labor. RLI gives us a transparent way to track progress over time and bring clarity to the future of work.

English

467

424.2K

Zihao Wang@wzihao12·28 Eki

@JunkaiZZ, @wzihao12, @ybnbxb, @swarnashree_ms, @jaehwanj6, @victorveitch, @WeiWang1973, @_yunzhong, @vbingliu, and lifeng

English

284

Zihao Wang@wzihao12·28 Eki

On-policy distillation with reverse KL as reward works great—IF you have access to teacher logits. But what if you don't? What if you want to distill from multiple teachers? Our solution: distill teacher guidance into rubrics, then do on-policy RL. Check out our work: arxiv.org/abs/2509.21500

Thinking Machines@thinkymachines

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-policy…

English

3.2K

Zihao Wang retweetledi

Scale AI@scale_AI·21 Eki

We launched SWE-Bench Pro last month to incredible feedback, and we’ve now updated the leaderboard with the latest models and no cost caps. SoTA models now break 40% pass rate. Congrats to @Anthropic for sweeping the top spots! 🥇Claude 4.5 Sonnet 🥈Claude 4 Sonnet 🥉Claude 4.5 Haiku

Bing Liu@vbingliu

🚀 Introducing SWE-Bench Pro — a new benchmark to evaluate LLM coding agents on real, enterprise-grade software engineering tasks. This is the next step beyond SWE-Bench: harder, contamination-resistant, and closer to real-world repos.

English

560

259.8K

Zihao Wang retweetledi

Bing Liu@vbingliu·9 Eki

🔄RLHF → RLVR → Rubrics → OnlineRubrics 👤 Human feedback = noisy & coarse 🧮 Verifiable rewards = too narrow 📋 Static rubrics = rigid, easy to hack, miss emergent behaviors 💡We introduce OnlineRubrics: elicited rubrics that evolve as models train. arxiv.org/abs/2510.07284

English

283

46.1K

Zihao Wang retweetledi

Bing Liu@vbingliu·1 Eki

New @Scale_AI paper! The culprit behind reward hacking? We trace it to misspecification in high-reward tail. Our fix: rubric-based rewards to tell “excellent” responses apart from “great.” The result: Less hacking, stronger post-training! arxiv.org/pdf/2509.21500

English

177

17.6K

Zihao Wang retweetledi

Victor Veitch 🔸@victorveitch·6 Haz

Semantics in language is naturally hierarchical, but attempts to interpret LLMs often ignore this. Turns out: baking semantic hierarchy into sparse autoencoders can give big jumps in interpretability and efficiency. Thread + bonus musings on the value of SAEs:

English

304

27.3K

Zihao Wang@wzihao12·7 May

@yibophd @jiahaoyu04 @metasec Our Fix: Position-enhanced Fine-Tuning (PFT)! 💡 Instead of just patching data, we strengthen the role signal itself. PFT modifies token position IDs during fine-tuning, creating a clearer numerical "gap" between system & user tokens.

English

177

Zihao Wang@wzihao12·7 May

Secure LLMs must separate roles. Finetuning improves security benchmark scores, but do models really learn role separation? 🤔 Our paper reveals an 'Illusion of Role Separation'! 🧵 (1/N) #AISafety w @yibophd @jiahaoyu04 @metasec arxiv.org/pdf/2505.00626

English

1.2K

Zihao Wang@wzihao12·7 May

@yibophd @jiahaoyu04 @metasec Why these shortcuts? We hypothesize current concatenated prompt formats don't provide strong enough invariant signals to differentiate roles robustly. Models latch onto easier, spurious correlations like position instead. (5/N)

English

139

Zihao Wang@wzihao12·7 May

@yibophd @jiahaoyu04 @metasec Shortcut #2: Proximity to Begin-of-Text! Models heavily rely on how close instructions are to the start of the prompt. Inserting any text before the main system instruction caused a dramatic drop in following the correct instruction! (4/N)

English

126

Zihao Wang@wzihao12·7 May

@yibophd @jiahaoyu04 @metasec We find models learn shortcuts! Shortcut #1: Task-Type Association! We found models often identify roles based on task types seen during training (e.g., always treating "grammar check" as an instruction, even if from the user!). (3/N)

English

122

Zihao Wang@wzihao12·7 May

@yibophd @jiahaoyu04 @metasec How did we test this? We used a controlled framework, training models on benign data where user input could be mistaken for instructions, but evaluating on OOD adversarial attacks. This setup isolates true role learning from just memorizing attack patterns. (2/N)

English

124

Zihao Wang retweetledi

David Reber@davidpreber·18 Eki

🧵 RATE: Score Reward Models with Imperfect Rewrites of Rewrites 1/ How do you measure whether a reward model incentivizes helpfulness without accidentally measuring length, complexity, etc? Rewrites of rewrites give good counterfactuals, without needing to list all confounders!

English

2.1K

Zihao Wang@wzihao12·22 Tem

Excited to present my paper at #ICML2024 on transforming and combining reward models for RLHF! Join me on Wed, July 24, 11:30 a.m. - 1 p.m. CEST at Hall C 4-9 #2710.

Zihao Wang@wzihao12

Transforming the reward used in RLHF gives big wins in LLM alignment and makes it easy to combine multiple reward functions! arxiv.org/pdf/2402.00742… @nagpalchirag @JonathanBerant @jacobeisenstein @alexdamour @sanmikoyejo @victorveitch @GoogleDeepMind

English

2.7K

Zihao Wang retweetledi

Yibo Jiang@yibophd·16 Tem

Are LLMs just doing next token predictions? It is believed that if an LLM can accurately predict the next tokens in a Wikipedia entry, it essentially "learns" the information. But do pre-trained LLMs actually need to understand context sentences to solve this task? The answer is no!

English

192

40.4K

Zihao Wang retweetledi

Victor Veitch 🔸@victorveitch·10 Haz

Fundamentally, high-level concepts group into categorical variables---mammal, reptile, fish, bird---with a semantic hierarchy---poodle is a dog is a mammal is an animal. How do LLMs internally represent this structure? arxiv.org/abs/2406.01506

English

117

611

81.2K

Zihao Wang retweetledi

Victor Veitch 🔸@victorveitch·4 Haz

LLM best-of-n sampling works great in practice---but why? Turns out: it's the best possible policy for maximizing win rate over the base model! Then: we use this to get a truly sweet alignment scheme: easy tweaks, huge gains w @ybnbxb @ggarbacea arxiv.org/abs/2406.00832

English

16.2K

Keşfet

@Scale_AI @cais @JunkaiZZ @ybnbxb @swarnashree_ms @jaehwanj6 @victorveitch @WeiWang1973