Jacky Kwok

37 posts

Jacky Kwok

Jacky Kwok

@jackyk02

Stanford CS PhD | Berkeley EECS

Palo Alto, CA Se unió Haziran 2025
625 Siguiendo438 Seguidores
Tweet fijado
Jacky Kwok
Jacky Kwok@jackyk02·
We release LLM-as-a-Verifier 🧠: A general-purpose verification framework that achieves SOTA 👑 on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%) by scaling: - scoring granularity - repeated verification - criteria decomposition 📄 Blog & Code: llm-as-a-verifier.notion.site
Jacky Kwok tweet media
English
8
52
430
44.6K
Jacky Kwok
Jacky Kwok@jackyk02·
We release LLM-as-a-Verifier 🧠: A general-purpose verification framework that achieves SOTA 👑 on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%) by scaling: - scoring granularity - repeated verification - criteria decomposition 📄 Blog & Code: llm-as-a-verifier.notion.site
Jacky Kwok tweet media
English
8
52
430
44.6K
Jacky Kwok
Jacky Kwok@jackyk02·
Hi Harman, V1 is great! We compared LLM-as-a-Verifier against V1 using the same setup as your codebase on AIME and HMMT. Our results show that leveraging the full probability distribution (instead of a single discrete score) leads to better performance while using less budget. We’ll definitely include this in the final paper and cite V1 :) Also, in the blog post, we note that for agentic setups like Terminal-Bench, the coarse-grained scoring used in V1 may be insufficient. Check out the first table in the results section for more details!
Jacky Kwok tweet media
English
1
0
1
157
Jacky Kwok retuiteado
Azalia Mirhoseini
Azalia Mirhoseini@Azaliamirh·
Turns out we can get SOTA on agentic benchmarks with a simple test-time method! Excited to introduce LLM-as-a-Verifier. Test-time scaling is effective, but picking the "winner" among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the model: 1️⃣ Ask the LLM to rank results on a scale of 1-k 2️⃣ Use the log-probs of those rank tokens to calculate an expected score You can get a verification score in a single sampling pass per candidate pair. Blog: llm-as-a-verifier.notion.site Code: llm-as-a-verifier.github.io Led by @jackyk02 and in collaboration with a great team: @shululi256, @pranav_atreya, @liu_yuejiang, @drmapavone, @istoica05
Azalia Mirhoseini tweet media
English
33
113
975
111.3K
Jacky Kwok
Jacky Kwok@jackyk02·
@local0ptimist @Azaliamirh We’ve seen promising results in domains without oracle verifiers (e.g., robotics) and will release them in the final paper!
English
0
0
1
198
kenneth
kenneth@local0ptimist·
@Azaliamirh wouldn’t this only work in domains with lots of certainty? which means domains we mostly already have verifiers for?
English
2
0
0
2.1K
Jacky Kwok
Jacky Kwok@jackyk02·
@ChenMoneyQ @Azaliamirh LLM-as-a-Verifier can be used as a trajectory, process, or outcome-level reward model. For Terminal-Bench and SWE-Bench, we run N agent harnesses in parallel and score the resulting trajectories using our method. More details are in the blog post :)
English
0
0
2
147
Chen Qian
Chen Qian@ChenMoneyQ·
@Azaliamirh Great work! I am curious how this fits in multi-turn trajectory, do we only score the last turn?
English
1
0
0
1.3K
Jacky Kwok retuiteado
Jacky Kwok retuiteado
Marco Pavone
Marco Pavone@drmapavone·
Excited to share CoVer-VLA — a contrastive verifier and hierarchical test-time scaling framework that bridges the intention–action gap in generalist robot policies. We show that allocating compute to reasoning and verification at deployment can be more effective than scaling policy training alone. 🌐 Website: cover-vla.github.io 📄 Paper: arxiv.org/abs/2602.12281 🤗 Models: huggingface.co/cover-vla 💻 Code: github.com/cover-vla/cove… Work led by @jackyk02, in collaboration with @Azaliamirh and @chelseabfinn
Marco Pavone tweet media
English
1
13
89
6.4K
Jacky Kwok
Jacky Kwok@jackyk02·
📋 Takeaways We demonstrate that scaling verification can be more effective than scaling policy learning alone, providing a promising path toward more robust and generalizable robotics foundation models.
English
1
0
2
515