Jacky Kwok

40 posts

Jacky Kwok

Jacky Kwok

@jackyk02

Stanford CS PhD | Berkeley EECS

Palo Alto, CA Entrou em Haziran 2025
629 Seguindo438 Seguidores
Tweet fixado
Jacky Kwok
Jacky Kwok@jackyk02·
We release LLM-as-a-Verifier 🧠: A general-purpose verification framework that achieves SOTA 👑 on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%) by scaling: - scoring granularity - repeated verification - criteria decomposition 📄 Blog & Code: llm-as-a-verifier.notion.site
Jacky Kwok tweet media
English
8
52
430
44.7K
Jacky Kwok retweetou
Bing Xu
Bing Xu@bingxu_·
Coding bench is easier to boost: Harness like LLM-as-Verifier can boost Terminal Bench to be higher than Mythos: x.com/jackyk02/statu… And forge harness data into model you will get a better model on these. One step back, swe bench is nothing related to engineering ability, see how much Spaghetti code and slop have been generated.
Jacky Kwok@jackyk02

We release LLM-as-a-Verifier 🧠: A general-purpose verification framework that achieves SOTA 👑 on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%) by scaling: - scoring granularity - repeated verification - criteria decomposition 📄 Blog & Code: llm-as-a-verifier.notion.site

English
0
0
1
734
Cameron R. Wolfe, Ph.D.
Cameron R. Wolfe, Ph.D.@cwolferesearch·
Strongly recommend the LLM-as-a-Verifier writeup. Biggest takeaway for me is that increasing scoring granularity makes the verifier more effective. This indicates that LLM judges / verifiers are developing new (and better) capabilities. This did not work well 1-2 years ago. In fact, LLM-as-a-Judge best practice was that lower scoring granularity (e.g., binary, ternary, or 1-5 Likert score) worked way better than granular scores (e.g., 1-100 scale). This was a constant recommendation I gave for setting up LLM judges properly. It seems like recent frontier LLMs now are better at scoring at finer granularities, making this best practice (potentially) obsolete. One caveat to this finding is that the scoring setup used in this writeup is a specific setup based upon logprobs. Instead of just using the score token outputted by the LLM as the result, they compute the logprob of each possible score token and take a weighted average of scores (with weights given by probabilities). Then, they go further by expanding this weighted average across repeated verifications and multiple criterion: Reward = (1 / CK) * ∑_{c=1}^{C} ∑_{k=1}^{K} ∑_{g=1}^{G} score_logprob * score_value where C is the total number of evaluation criterion, K is the number of repeated verifications, and G is the scoring granularity (i.e., number of unique scoring output options). The reward determines if a particular output passes verification across criteria. When using this logprob setup, we see consistent gains in verifier accuracy by: - Increasing scoring granularity G. - Increasing repeated verifications K. - Increasing the number of evaluation criterion C. The last two findings are in line with prior work, but the fact that higher scoring granularity is helpful is interesting! In the LLM-as-a-Verifier paper, this system is used at inference time in a pairwise fashion as described below. "To pick the best trajectory among N candidates for a given task, a round-robin tournament is conducted. For every pair (i, j) the verifier produces Reward(i) and Reward(j) using the formula above. The trajectory with the higher reward receives a win, and the trajectory with the most wins across all \binom{N}{2} pairs is selected."
Cameron R. Wolfe, Ph.D. tweet media
English
13
83
791
138.6K
Jacky Kwok
Jacky Kwok@jackyk02·
We release LLM-as-a-Verifier 🧠: A general-purpose verification framework that achieves SOTA 👑 on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%) by scaling: - scoring granularity - repeated verification - criteria decomposition 📄 Blog & Code: llm-as-a-verifier.notion.site
Jacky Kwok tweet media
English
8
52
430
44.7K
Jacky Kwok
Jacky Kwok@jackyk02·
Hi Harman, V1 is great! We compared LLM-as-a-Verifier against V1 using the same setup as your codebase on AIME and HMMT. Our results show that leveraging the full probability distribution (instead of a single discrete score) leads to better performance while using less budget. We’ll definitely include this in the final paper and cite V1 :) Also, in the blog post, we note that for agentic setups like Terminal-Bench, the coarse-grained scoring used in V1 may be insufficient. Check out the first table in the results section for more details!
Jacky Kwok tweet media
English
1
0
1
157
Jacky Kwok retweetou
Azalia Mirhoseini
Azalia Mirhoseini@Azaliamirh·
Turns out we can get SOTA on agentic benchmarks with a simple test-time method! Excited to introduce LLM-as-a-Verifier. Test-time scaling is effective, but picking the "winner" among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the model: 1️⃣ Ask the LLM to rank results on a scale of 1-k 2️⃣ Use the log-probs of those rank tokens to calculate an expected score You can get a verification score in a single sampling pass per candidate pair. Blog: llm-as-a-verifier.notion.site Code: llm-as-a-verifier.github.io Led by @jackyk02 and in collaboration with a great team: @shululi256, @pranav_atreya, @liu_yuejiang, @drmapavone, @istoica05
Azalia Mirhoseini tweet media
English
33
113
975
111.6K
Jacky Kwok
Jacky Kwok@jackyk02·
@local0ptimist @Azaliamirh We’ve seen promising results in domains without oracle verifiers (e.g., robotics) and will release them in the final paper!
English
0
0
1
198
kenneth
kenneth@local0ptimist·
@Azaliamirh wouldn’t this only work in domains with lots of certainty? which means domains we mostly already have verifiers for?
English
2
0
0
2.1K
Jacky Kwok
Jacky Kwok@jackyk02·
@ChenMoneyQ @Azaliamirh LLM-as-a-Verifier can be used as a trajectory, process, or outcome-level reward model. For Terminal-Bench and SWE-Bench, we run N agent harnesses in parallel and score the resulting trajectories using our method. More details are in the blog post :)
English
0
0
2
149
Chen Qian
Chen Qian@ChenMoneyQ·
@Azaliamirh Great work! I am curious how this fits in multi-turn trajectory, do we only score the last turn?
English
1
0
0
1.3K
Jacky Kwok retweetou
Jacky Kwok retweetou
Marco Pavone
Marco Pavone@drmapavone·
Excited to share CoVer-VLA — a contrastive verifier and hierarchical test-time scaling framework that bridges the intention–action gap in generalist robot policies. We show that allocating compute to reasoning and verification at deployment can be more effective than scaling policy training alone. 🌐 Website: cover-vla.github.io 📄 Paper: arxiv.org/abs/2602.12281 🤗 Models: huggingface.co/cover-vla 💻 Code: github.com/cover-vla/cove… Work led by @jackyk02, in collaboration with @Azaliamirh and @chelseabfinn
Marco Pavone tweet media
English
1
13
89
6.4K
Jacky Kwok
Jacky Kwok@jackyk02·
📋 Takeaways We demonstrate that scaling verification can be more effective than scaling policy learning alone, providing a promising path toward more robust and generalizable robotics foundation models.
English
1
0
2
517