Zhe Ye

84 posts

Zhe Ye

@0xlf_

PhD student @BerkeleyRDI | CN @LEAFERx

Berkeley, CA Katılım Haziran 2021

240 Takip Edilen310 Takipçiler

Sabitlenmiş Tweet

Zhe Ye@0xlf_·13 Haz

1/🧵Introducing VERINA: a high-quality benchmark for verifiable code generation. As LLMs are increasingly used to generate software, we need more than just working code--We need formal guarantees of correctness. VERINA offers a rigorous and modular framework for evaluating LLMs across code, specification, and proof generation, as well as their compositions, paving the way toward trustworthy AI-generated software. 🔗 verina.io

English

102

28.9K

Zhe Ye retweetledi

Ziran Yang@__zrrr__·26 Mar

Introducing Goedel-Code-Prover 🌲 LLMs write code, but can they prove it correct? Not just pass tests, but construct machine-checkable proofs that a program works for ALL possible inputs. We built a system that does exactly this. Given aprogram and its specification in Lean 4, Goedel-Code-Prover automatically synthesizes formal proofs ofcorrectness. Our 8B model achieves 62% overall success rate across three benchmarks (Verina, Clever &AlgoVeri), a 2.6x improvement over the strongest baseline, surpassing both frontier LLMs (GPT/Gemini/Claude)and open-source theorem provers up to 84x larger (DeepSeek-Prover/Goedel-Prover/Kimina-Prover/BFS-Prover).

English

554

67.6K

Zhe Ye retweetledi

Xiuyu Li@xiuyu_l·6 Mar

The last project I co-led during my PhD is finally out. Verifiable rewards are a key ingredient for RL. The ability to verify is also what enables parallel agents and self-evolving. We propose V1, where generation and verification co-evolve through RL. When done properly, the model can become a surprisingly effective verifier for itself.

Harman Singh@Harman26Singh

Can LLMs Self-Verify? Much better than you'd expect. LLMs are increasingly used as parallel reasoners, sampling many solutions at once. Choosing the right answer is the real bottleneck. We show that pairwise self-verification is a powerful primitive. Introducing V1, a framework that unifies generation and self-verification: 💡 Pairwise self-verification beats pointwise scoring, improving test-time scaling 💡 V1-Infer: Efficient tournament-style ranking that improves self-verification 💡 V1-PairRL: RL training where generation and verification co-evolve for developing better self-verifiers 🧵👇

English

440

37.3K

Zhe Ye retweetledi

Shu Lynn Liu@shulynnliu·3 Mar

AlphaEvolve is closed-source. We release 🌟SkyDiscover🌟, a flexible, modular open-source framework with two new adaptive algorithms that match or exceed AlphaEvolve on many benchmarks and outperform OpenEvolve, GEPA, and ShinkaEvolve across 200+ optimization tasks. Our new algorithms dynamically adapt their search strategy, and can even let the AI optimize its own optimization process on the fly! Results: 📊 +34% median score improvement on 172 Frontier-CS problems. 🧮 Matches/exceeds AlphaEvolve on many math benchmarks ⚙️ Discovers system optimizations beyond human-designed SOTA 🧵👇

GIF

English

107

582

139.6K

Zhe Ye retweetledi

Zhanhui Zhou@asapzzhou·8 Ara

(1/n) Tiny-A2D: An Open Recipe to Turn Any AR LM into a Diffusion LM Code (dLLM): github.com/ZHZisZZ/dllm Checkpoints: huggingface.co/collections/dl… With dLLM, you can turn ANY autoregressive LM into a diffusion LM (parallel generation + infilling) with minimal compute. Using this recipe, we built a 🤗collection of the smallest diffusion LMs that work well in practice. Key takeaways: 1. Finetuned on Qwen3-0.6B, we obtain the strongest small (~0.5/0.6B) diffusion LMs to date. 2. The base AR LM matters: Investing compute in improving the base AR model is potentially more efficient than scaling compute during adaptation. 3. Block diffusion (BD3LM) generally outperforms vanilla masked diffusion (MDLM), especially on math-reasoning and coding tasks.

English

337

26.2K

Zhe Ye@0xlf_·4 Ara

Congratulations to @HarmonicMath on achieving 96.8% on VERINA-proof! Exciting to see mathematical reasoning systems push the boundaries of AI for formal verification as well. Read more about VERINA on verina.io!

Harmonic@HarmonicMath

Beyond math: Aristotle achieves SOTA 96.8% proof generation on VERINA: Benchmarking Verifiable Code Generation. You can read more about this performance on our engineering blog linked in bio

English

155

Zhe Ye retweetledi

Dawn Song@dawnsongtweets·4 Ara

Congrats to @HarmonicMath for the great advancement on proof generation, reaching such high score on our VERINA benchmark on verifiable code generation! We have been in the process of releasing an even harder benchmark on this soon---stay tuned 😀

Harmonic@HarmonicMath

Beyond math: Aristotle achieves SOTA 96.8% proof generation on VERINA: Benchmarking Verifiable Code Generation. You can read more about this performance on our engineering blog linked in bio

English

6.1K

Zhe Ye retweetledi

Zhanhui Zhou@asapzzhou·11 Kas

(1/n) 🚨 BERTs that chat: turn any BERT into a chatbot with diffusion hi @karpathy, we just trained a few BERTs to chat with diffusion — we are releasing all the model checkpoints, training curves, and recipes! Hopefully this spares you the side quest into training nanochat with diffusion for now 🙂. It’s both a hands-on tutorial for beginners and an example showing how to use our complete toolkit (dLLM) for deeper projects. Code: github.com/ZHZisZZ/dllm Report: api.wandb.ai/links/asap-zzh… Checkpoints: huggingface.co/collections/dl… Motivation: I couldn’t find a good “Hello World” example for training a minimally working yet useful diffusion language models, a class of bidirectional language models capable of parallel token generation in arbitrary order. So I tried finetuning BERTs to make it chat with discrete diffusion—and it turned out more fun than I expected. TLDR: With a small amount of open-source instruction-following data, a standard BERT can gain conversational ability with diffusion. Specifically, a finetuned ModernBERT-large, with a similar number of parameters, performs close to Qwen1.5-0.5B.

Andrej Karpathy@karpathy

Nice, short post illustrating how simple text (discrete) diffusion can be. Diffusion (i.e. parallel, iterated denoising, top) is the pervasive generative paradigm in image/video, but autoregression (i.e. go left to right bottom) is the dominant paradigm in text. For audio I've seen a bit of both. A lot of diffusion papers look a bit dense but if you strip the mathematical formalism, you end up with simple baseline algorithms, e.g. something a lot closer to flow matching in continuous, or something like this in discrete. It's your vanilla transformer but with bi-directional attention, where you iteratively re-sample and re-mask all tokens in your "tokens canvas" based on a noise schedule until you get the final sample at the last step. (Bi-directional attention is a lot more powerful, and you get a lot stronger autoregressive language models if you train with it, unfortunately it makes training a lot more expensive because now you can't parallelize across sequence dim). So autoregression is doing an `.append(token)` to the tokens canvas while only attending backwards, while diffusion is refreshing the entire token canvas with a `.setitem(idx, token)` while attending bidirectionally. Human thought naively feels a bit more like autoregression but it's hard to say that there aren't more diffusion-like components in some latent space of thought. It feels quite possible that you can further interpolate between them, or generalize them further. And it's a component of the LLM stack that still feels a bit fungible. Now I must resist the urge to side quest into training nanochat with diffusion.

English

118

983

175.9K

Zhe Ye retweetledi

Liyi Zhou@lzhou1110·1 Eyl

I’ve been waiting the chance to clear up two things that people often challenge me on. 1. Some say it’s impossible to get near-zero false positives using LLMs. I don’t think that’s true. If top hackers can carefully validate their findings with almost no mistakes, then humans can, and so can systems built the right way. 2. I also hear that my work only applies to blockchain. I hope this new paper with my incoming PhD student, @Zyy_0530 , makes people look at me differently. TLDR, we CAN find zero days in Android Apps, with low false positives rates.

Wesley Wang@Zyy0530

Android security analysts drown in thousands of false warnings while real vulnerabilities slip through. Traditional SAST tools overwhelm teams with noise but miss logic exploits. Excited to share our A2 - the system to mirror human expert analysis! Link: arxiv.org/pdf/2508.21579

English

Zhe Ye retweetledi

Dawn Song@dawnsongtweets·17 Tem

Join us at Agentic AI Summit 2025 — August 2 at UC Berkeley, with ~2,000 in-person attendees and the leading minds in AI. Building on the momentum of the 25K+ LLM Agents MOOC community, this is the largest and most cutting-edge event on #AgenticAI. As 2025 emerges as the Year of the Agents, the summit offers a front-row seat to the breakthroughs shaping the future of #AgenticAI. Be part of the movement. 👀 Register for in-person or online attendance: rdi.berkeley.edu/events/agentic…

English

217

30.9K

Zhe Ye retweetledi

Yajin (Andy) Zhou@yajinzhou·7 Tem

Join us at DeFi’25: Workshop on Decentralized Finance & Security, Co-located with ACM CCS 2025 on October 17, 2025. Submission deadline: July 21, 2025 (AoE) Thanks to our incredible program committee & chairs for making this happen: @yaish_aviv @christoftorres @alexcryptan @chendaLiu @PulpSpy @jgorzny @0xlf_ @manv_sc @pszalach @mysteryfigure @KaihuaQIN @flotschorsch @zzzihaoli @masserova @dmoroz @ObadiaAlex @chiachih_wu @VeroCEG @KushalBabel @0xFanZhang @lzhou1110 @lzhou1110 @chunghaocrypto …and to our steering committee: @TheWattenhofer @dawnsongtweets @HatforceSec @Daeinar Learn more & submit: defiwork.shop

English

3.5K

Zhe Ye retweetledi

Sijun Tan@sijun_tan·2 Tem

The first half of 2025 is all about reasoning models. The second half? It’s about agents. At Agentica, we’re thrilled to launch two major releases: 1. DeepSWE, our STOA coding agent trained with RL that tops SWEBench leaderboard for open-weight models. 2. rLLM, our agent post-training framework that powers DeepSWE training and beyond. These two releases mark our transition from training language reasoners to building language agents that can truly learn from experience. Welcome to the era of experience.

Agentica Project@Agentica_

🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass @1), topping the SWEBench leaderboard for open-weight models. 💪DeepSWE is trained with rLLM, our modular RL post-training framework for agents. rLLM makes it easy to build, train, and deploy RL-tuned agents on real-world workloads — from software engineering to web navigation and beyond. 🤗As always, we’re open-sourcing everything: not just the model, but the training code (rLLM), dataset (R2EGym), and training recipe for full reproducibility. 🔥Train DeepSWE yourself. Extend it. Build your own local agents. No secrets, no barriers. DeepSWE and rLLM mark our major shift: from training language reasoners to building language agents that can truly learn from experience. We believe the future of AI lies in experience-driven learning — and we’re here to democratize it. Welcome to the era of experience. 🌍 Links below: (1/n)

English

6.4K

Zhe Ye retweetledi

Xiuyu Li@xiuyu_l·30 Haz

Sparsity can make your LoRA fine-tuning go brrr 💨 Announcing SparseLoRA (ICML 2025): up to 1.6-1.9x faster LLM fine-tuning (2.2x less FLOPs) via contextual sparsity, while maintaining performance on tasks like math, coding, chat, and ARC-AGI 🤯 🧵1/ z-lab.ai/projects/spars…

English

204

35.8K

Zhe Ye retweetledi

Kaihua Qin@KaihuaQIN·25 Haz

🔓 99+% of Ethereum contracts are closed-source. We built an LLM that decompiles their bytecode — and exposes what’s inside. Readable. Auditable. Battle-tested. Not a toy. Try it now 👉 evmdecompiler.com 📄 arxiv.org/abs/2506.19624 w/ @zaddyzaddy @lzhou1110 @dawnsongtweets @HatforceSec #Ethereum #BSC #SmartContracts #LLM #MEV #DeFi

English

11.8K

Zhe Ye@0xlf_·25 Haz

@badcryptobitch

QME

Mikerah@badcryptobitch·25 Haz

Everything is a graph

English

250

Zhe Ye retweetledi

Yiyou Sun@YiyouSun·24 Haz

🚨 New study on LLM's reasoning boundary! Can LLMs really think out of the box? We introduce OMEGA—a benchmark probing how they generalize: 🔹 RL boosts accuracy on slightly harder problems with familiar strategies, 🔹 but struggles with creative leaps & strategy composition. 👇

Nouha Dziri@nouhadziri

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found that although very powerful, RL struggles to compose skills and to innovate new strategies that were not seen during training. 👇 work w. @UCBerkeley @allen_ai A thread on what we learned 🧵

English

7.6K

Zhe Ye retweetledi

Dawn Song@dawnsongtweets·18 Haz

1/ 🔥 AI agents are reaching a breakthrough moment in cybersecurity. In our latest work: 🔓 CyberGym: AI agents discovered 15 zero-days in major open-source projects 💰 BountyBench: AI agents solved real-world bug bounty tasks worth tens of thousands of dollars 🤖 Autonomously. A pivotal shift is underway — AI agents can now autonomously do what only elite human hackers could before.

English

149

543

136.6K

Zhe Ye@0xlf_·13 Haz

8/Amazing work by @0xlf_, @ZhengxuJYan, @jingxuan_he, Timothe Kasriel, @KaiyuYang4, @dawnsongtweets We hope VERINA will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark.

English

482

Zhe Ye@0xlf_·13 Haz

7/🚀 Available Now: 📄 Paper: arxiv.org/pdf/2505.23135 📊 Dataset: huggingface.co/datasets/sunbl… 💻 Code: github.com/sunblaze-ucb/v… 🌐 Website: verina.io

English

607

Zhe Ye@0xlf_·13 Haz

English

102

28.9K

Keşfet

@HarmonicMath @karpathy @yaish_aviv @christoftorres @alexcryptan @chendaLiu @PulpSpy @jgorzny