Jacky Kwok (@jackyk02) - Perfil do Twitter | Zamantika Mersobahis Locabet

Tweet fixado

Jacky Kwok@jackyk02·10 Nis

We release LLM-as-a-Verifier 🧠: A general-purpose verification framework that achieves SOTA 👑 on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%) by scaling: - scoring granularity - repeated verification - criteria decomposition 📄 Blog & Code: llm-as-a-verifier.notion.site

English

8

52

430

44.7K

Jacky Kwok retweetou

Bing Xu@bingxu_·5d

Coding bench is easier to boost: Harness like LLM-as-Verifier can boost Terminal Bench to be higher than Mythos: x.com/jackyk02/statu… And forge harness data into model you will get a better model on these. One step back, swe bench is nothing related to engineering ability, see how much Spaghetti code and slop have been generated.

Jacky Kwok@jackyk02

We release LLM-as-a-Verifier 🧠: A general-purpose verification framework that achieves SOTA 👑 on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%) by scaling: - scoring granularity - repeated verification - criteria decomposition 📄 Blog & Code: llm-as-a-verifier.notion.site

English

0

1

734

Cameron R. Wolfe, Ph.D.@cwolferesearch·1d

Strongly recommend the LLM-as-a-Verifier writeup. Biggest takeaway for me is that increasing scoring granularity makes the verifier more effective. This indicates that LLM judges / verifiers are developing new (and better) capabilities. This did not work well 1-2 years ago. In fact, LLM-as-a-Judge best practice was that lower scoring granularity (e.g., binary, ternary, or 1-5 Likert score) worked way better than granular scores (e.g., 1-100 scale). This was a constant recommendation I gave for setting up LLM judges properly. It seems like recent frontier LLMs now are better at scoring at finer granularities, making this best practice (potentially) obsolete. One caveat to this finding is that the scoring setup used in this writeup is a specific setup based upon logprobs. Instead of just using the score token outputted by the LLM as the result, they compute the logprob of each possible score token and take a weighted average of scores (with weights given by probabilities). Then, they go further by expanding this weighted average across repeated verifications and multiple criterion: Reward = (1 / CK) * ∑_{c=1}^{C} ∑_{k=1}^{K} ∑_{g=1}^{G} score_logprob * score_value where C is the total number of evaluation criterion, K is the number of repeated verifications, and G is the scoring granularity (i.e., number of unique scoring output options). The reward determines if a particular output passes verification across criteria. When using this logprob setup, we see consistent gains in verifier accuracy by: - Increasing scoring granularity G. - Increasing repeated verifications K. - Increasing the number of evaluation criterion C. The last two findings are in line with prior work, but the fact that higher scoring granularity is helpful is interesting! In the LLM-as-a-Verifier paper, this system is used at inference time in a pairwise fashion as described below. "To pick the best trajectory among N candidates for a given task, a round-robin tournament is conducted. For every pair (i, j) the verifier produces Reward(i) and Reward(j) using the formula above. The trajectory with the higher reward receives a win, and the trajectory with the most wins across all \binom{N}{2} pairs is selected."

English

13

83

791

138.6K

Jacky Kwok@jackyk02·1h

@cwolferesearch Thanks for sharing our blog post! Paper coming soon 👀

English

0

11

Jacky Kwok@jackyk02·15h

@lihanc02 Very cool! Thanks for sharing this :)

English

0

19

Hanchen Li @ ICLR@lihanc02·22h

@jackyk02 btw just realized you guys had the eval breakdown. This is another way to break down eval criteria that we wrote before: arxiv.org/abs/2504.07174

English

1

0

1

51

Jacky Kwok@jackyk02·10 Nis

We release LLM-as-a-Verifier 🧠: A general-purpose verification framework that achieves SOTA 👑 on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%) by scaling: - scoring granularity - repeated verification - criteria decomposition 📄 Blog & Code: llm-as-a-verifier.notion.site

English

8

52

430

44.7K

Jacky Kwok@jackyk02·2d

Hi Harman, V1 is great! We compared LLM-as-a-Verifier against V1 using the same setup as your codebase on AIME and HMMT. Our results show that leveraging the full probability distribution (instead of a single discrete score) leads to better performance while using less budget. We’ll definitely include this in the final paper and cite V1 :) Also, in the blog post, we note that for agentic setups like Terminal-Bench, the coarse-grained scoring used in V1 may be insufficient. Check out the first table in the results section for more details!

English

1

0

1

157

Harman Singh@Harman26Singh·2d

@Azaliamirh nice work! It would be great if you could compare with the V1 algorithm, where we explore a very similar pairwise verification algorithm with a fine-grained scoring scale for the problem of picking the winner: x.com/Harman26Singh/… Using log-probs is a great addition to the V1 inference algorithm, but it would be nice to see the performance improvement over it as well on agentic setups like these.

Harman Singh@Harman26Singh

Can LLMs Self-Verify? Much better than you'd expect. LLMs are increasingly used as parallel reasoners, sampling many solutions at once. Choosing the right answer is the real bottleneck. We show that pairwise self-verification is a powerful primitive. Introducing V1, a framework that unifies generation and self-verification: 💡 Pairwise self-verification beats pointwise scoring, improving test-time scaling 💡 V1-Infer: Efficient tournament-style ranking that improves self-verification 💡 V1-PairRL: RL training where generation and verification co-evolve for developing better self-verifiers 🧵👇

English

1

0

11

1.8K

Jacky Kwok retweetou

Azalia Mirhoseini@Azaliamirh·3d

Turns out we can get SOTA on agentic benchmarks with a simple test-time method! Excited to introduce LLM-as-a-Verifier. Test-time scaling is effective, but picking the "winner" among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the model: 1️⃣ Ask the LLM to rank results on a scale of 1-k 2️⃣ Use the log-probs of those rank tokens to calculate an expected score You can get a verification score in a single sampling pass per candidate pair. Blog: llm-as-a-verifier.notion.site Code: llm-as-a-verifier.github.io Led by @jackyk02 and in collaboration with a great team: @shululi256, @pranav_atreya, @liu_yuejiang, @drmapavone, @istoica05

English

33

113

975

111.6K

Jacky Kwok@jackyk02·3d

@local0ptimist @Azaliamirh We’ve seen promising results in domains without oracle verifiers (e.g., robotics) and will release them in the final paper!

English

0

1

198

kenneth@local0ptimist·3d

@Azaliamirh wouldn’t this only work in domains with lots of certainty? which means domains we mostly already have verifiers for?

English

2

0

2.1K

Jacky Kwok@jackyk02·3d

@ChenMoneyQ @Azaliamirh LLM-as-a-Verifier can be used as a trajectory, process, or outcome-level reward model. For Terminal-Bench and SWE-Bench, we run N agent harnesses in parallel and score the resulting trajectories using our method. More details are in the blog post :)

English

0

2

149

Chen Qian@ChenMoneyQ·3d

@Azaliamirh Great work! I am curious how this fits in multi-turn trajectory, do we only score the last turn?

English

1

0

1.3K

Jacky Kwok retweetou

Chelsea Finn@chelseabfinn·25 Şub

We can improve VLA generalization to new instructions at test time by: 1. run rephrased lang instructions through model 2. use verifier to select best action Paper: arxiv.org/abs/2602.12281 Code: github.com/cover-vla/cove…

Jacky Kwok@jackyk02

Introducing CoVer-VLA💫— a contrastive verifier + hierarchical test-time scaling framework for VLAs! - Lightweight 1B verifier 🧠 - Outperforms π₀ & π₀.₅ 🦾 - Trained on Bridge & DROID 🤖 Turns out scaling verification > scaling policy learning for VLA alignment! 🧵👇 🌐 Website: cover-vla.github.io 📄 Paper: arxiv.org/abs/2602.12281 🤗 Models: huggingface.co/cover-vla 💻 Code: github.com/cover-vla/cove…

English

2

15

185

20.6K

Jacky Kwok retweetou

Marco Pavone@drmapavone·4 Mar

Excited to share CoVer-VLA — a contrastive verifier and hierarchical test-time scaling framework that bridges the intention–action gap in generalist robot policies. We show that allocating compute to reasoning and verification at deployment can be more effective than scaling policy training alone. 🌐 Website: cover-vla.github.io 📄 Paper: arxiv.org/abs/2602.12281 🤗 Models: huggingface.co/cover-vla 💻 Code: github.com/cover-vla/cove… Work led by @jackyk02, in collaboration with @Azaliamirh and @chelseabfinn

English

1

13

89

6.4K

Jacky Kwok@jackyk02·26 Şub

@prodarhan @XilunZhang1999 PolaRiS is amazing🚀

English

0

1

77

Arhan Jain@prodarhan·26 Şub

Very cool use of PolaRiS by @jackyk02 and @XilunZhang1999 as a testbed for their new work in scaling verifiers for robot policies‼️

Jacky Kwok@jackyk02

🧵(6) DROID Eval CoVer-VLA achieves 14% gains in task progress and 9% in success rate on the challenging red-team PolaRiS benchmark. In the pan cleaning task, π₀.₅ shows incorrect intent, grasping the pan handle. In contrast, CoVer-VLA correctly uses sponge to scrub the pan.

English

1

0

8

572

Jacky Kwok@jackyk02·24 Şub

Huge thanks to my amazing co-lead and collaborator @XilunZhang1999 @mengdixu_ 🙏 Very grateful to @drmapavone @Azaliamirh @liu_yuejiang @chelseabfinn for their guidance throughout this project!

English

0

1

4

540

Jacky Kwok@jackyk02·24 Şub

📋 Takeaways We demonstrate that scaling verification can be more effective than scaling policy learning alone, providing a promising path toward more robust and generalizable robotics foundation models.

English

1

0

2

517

Jacky Kwok@jackyk02·24 Şub

Introducing CoVer-VLA💫— a contrastive verifier + hierarchical test-time scaling framework for VLAs! - Lightweight 1B verifier 🧠 - Outperforms π₀ & π₀.₅ 🦾 - Trained on Bridge & DROID 🤖 Turns out scaling verification > scaling policy learning for VLA alignment! 🧵👇 🌐 Website: cover-vla.github.io 📄 Paper: arxiv.org/abs/2602.12281 🤗 Models: huggingface.co/cover-vla 💻 Code: github.com/cover-vla/cove…

English

3

31

178

33.6K

Jacky Kwok

Descobrir