Jacky Kwok (@jackyk02) - Perfil de Twitter | Zamantika Mersobahis Locabet

Tweet fijado

We release LLM-as-a-Verifier 🧠: A general-purpose verification framework that achieves SOTA 👑 on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%) by scaling: - scoring granularity - repeated verification - criteria decomposition 📄 Blog & Code: llm-as-a-verifier.notion.site

English

8

52

430

44.6K

Jacky Kwok@jackyk02·8h

@lihanc02 Very cool! Thanks for sharing this :)

English

0

14

Hanchen Li@lihanc02·16h

@jackyk02 btw just realized you guys had the eval breakdown. This is another way to break down eval criteria that we wrote before: arxiv.org/abs/2504.07174

English

1

0

1

36

Jacky Kwok@jackyk02·6d

We release LLM-as-a-Verifier 🧠: A general-purpose verification framework that achieves SOTA 👑 on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%) by scaling: - scoring granularity - repeated verification - criteria decomposition 📄 Blog & Code: llm-as-a-verifier.notion.site

English

8

52

430

44.6K

Jacky Kwok@jackyk02·2d

Hi Harman, V1 is great! We compared LLM-as-a-Verifier against V1 using the same setup as your codebase on AIME and HMMT. Our results show that leveraging the full probability distribution (instead of a single discrete score) leads to better performance while using less budget. We’ll definitely include this in the final paper and cite V1 :) Also, in the blog post, we note that for agentic setups like Terminal-Bench, the coarse-grained scoring used in V1 may be insufficient. Check out the first table in the results section for more details!

English

1

0

1

157

Harman Singh@Harman26Singh·2d

@Azaliamirh nice work! It would be great if you could compare with the V1 algorithm, where we explore a very similar pairwise verification algorithm with a fine-grained scoring scale for the problem of picking the winner: x.com/Harman26Singh/… Using log-probs is a great addition to the V1 inference algorithm, but it would be nice to see the performance improvement over it as well on agentic setups like these.

Harman Singh@Harman26Singh

Can LLMs Self-Verify? Much better than you'd expect. LLMs are increasingly used as parallel reasoners, sampling many solutions at once. Choosing the right answer is the real bottleneck. We show that pairwise self-verification is a powerful primitive. Introducing V1, a framework that unifies generation and self-verification: 💡 Pairwise self-verification beats pointwise scoring, improving test-time scaling 💡 V1-Infer: Efficient tournament-style ranking that improves self-verification 💡 V1-PairRL: RL training where generation and verification co-evolve for developing better self-verifiers 🧵👇

English

1

0

11

1.8K

Jacky Kwok retuiteado

Azalia Mirhoseini@Azaliamirh·2d

Turns out we can get SOTA on agentic benchmarks with a simple test-time method! Excited to introduce LLM-as-a-Verifier. Test-time scaling is effective, but picking the "winner" among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the model: 1️⃣ Ask the LLM to rank results on a scale of 1-k 2️⃣ Use the log-probs of those rank tokens to calculate an expected score You can get a verification score in a single sampling pass per candidate pair. Blog: llm-as-a-verifier.notion.site Code: llm-as-a-verifier.github.io Led by @jackyk02 and in collaboration with a great team: @shululi256, @pranav_atreya, @liu_yuejiang, @drmapavone, @istoica05

English

33

113

975

111.3K

Jacky Kwok@jackyk02·2d

@local0ptimist @Azaliamirh We’ve seen promising results in domains without oracle verifiers (e.g., robotics) and will release them in the final paper!

English

0

1

198

kenneth@local0ptimist·2d

@Azaliamirh wouldn’t this only work in domains with lots of certainty? which means domains we mostly already have verifiers for?

English

2

0

2.1K

Jacky Kwok@jackyk02·2d

@ChenMoneyQ @Azaliamirh LLM-as-a-Verifier can be used as a trajectory, process, or outcome-level reward model. For Terminal-Bench and SWE-Bench, we run N agent harnesses in parallel and score the resulting trajectories using our method. More details are in the blog post :)

English

0

2

147

Chen Qian@ChenMoneyQ·2d

@Azaliamirh Great work! I am curious how this fits in multi-turn trajectory, do we only score the last turn?

English

1

0

1.3K

Jacky Kwok retuiteado

Chelsea Finn@chelseabfinn·25 Şub

We can improve VLA generalization to new instructions at test time by: 1. run rephrased lang instructions through model 2. use verifier to select best action Paper: arxiv.org/abs/2602.12281 Code: github.com/cover-vla/cove…

Jacky Kwok@jackyk02

Introducing CoVer-VLA💫— a contrastive verifier + hierarchical test-time scaling framework for VLAs! - Lightweight 1B verifier 🧠 - Outperforms π₀ & π₀.₅ 🦾 - Trained on Bridge & DROID 🤖 Turns out scaling verification > scaling policy learning for VLA alignment! 🧵👇 🌐 Website: cover-vla.github.io 📄 Paper: arxiv.org/abs/2602.12281 🤗 Models: huggingface.co/cover-vla 💻 Code: github.com/cover-vla/cove…

English

2

16

186

20.6K

Jacky Kwok retuiteado

Marco Pavone@drmapavone·4 Mar

Excited to share CoVer-VLA — a contrastive verifier and hierarchical test-time scaling framework that bridges the intention–action gap in generalist robot policies. We show that allocating compute to reasoning and verification at deployment can be more effective than scaling policy training alone. 🌐 Website: cover-vla.github.io 📄 Paper: arxiv.org/abs/2602.12281 🤗 Models: huggingface.co/cover-vla 💻 Code: github.com/cover-vla/cove… Work led by @jackyk02, in collaboration with @Azaliamirh and @chelseabfinn

English

1

13

89

6.4K

Jacky Kwok@jackyk02·26 Şub

@prodarhan @XilunZhang1999 PolaRiS is amazing🚀

English

0

1

77

Arhan Jain@prodarhan·26 Şub

Very cool use of PolaRiS by @jackyk02 and @XilunZhang1999 as a testbed for their new work in scaling verifiers for robot policies‼️

Jacky Kwok@jackyk02

🧵(6) DROID Eval CoVer-VLA achieves 14% gains in task progress and 9% in success rate on the challenging red-team PolaRiS benchmark. In the pan cleaning task, π₀.₅ shows incorrect intent, grasping the pan handle. In contrast, CoVer-VLA correctly uses sponge to scrub the pan.

English

1

0

8

572

Jacky Kwok@jackyk02·24 Şub

Huge thanks to my amazing co-lead and collaborator @XilunZhang1999 @mengdixu_ 🙏 Very grateful to @drmapavone @Azaliamirh @liu_yuejiang @chelseabfinn for their guidance throughout this project!

English

0

1

4

538

Jacky Kwok@jackyk02·24 Şub

📋 Takeaways We demonstrate that scaling verification can be more effective than scaling policy learning alone, providing a promising path toward more robust and generalizable robotics foundation models.

English

1

0

2

515

Jacky Kwok@jackyk02·24 Şub

Introducing CoVer-VLA💫— a contrastive verifier + hierarchical test-time scaling framework for VLAs! - Lightweight 1B verifier 🧠 - Outperforms π₀ & π₀.₅ 🦾 - Trained on Bridge & DROID 🤖 Turns out scaling verification > scaling policy learning for VLA alignment! 🧵👇 🌐 Website: cover-vla.github.io 📄 Paper: arxiv.org/abs/2602.12281 🤗 Models: huggingface.co/cover-vla 💻 Code: github.com/cover-vla/cove…

English

3

31

178

33.6K

Jacky Kwok

Descubrir