Zhe Ye

84 posts

Zhe Ye banner
Zhe Ye

Zhe Ye

@0xlf_

PhD student @BerkeleyRDI | CN @LEAFERx

Berkeley, CA Katılım Haziran 2021
240 Takip Edilen310 Takipçiler
Sabitlenmiş Tweet
Zhe Ye
Zhe Ye@0xlf_·
1/🧵Introducing VERINA: a high-quality benchmark for verifiable code generation. As LLMs are increasingly used to generate software, we need more than just working code--We need formal guarantees of correctness. VERINA offers a rigorous and modular framework for evaluating LLMs across code, specification, and proof generation, as well as their compositions, paving the way toward trustworthy AI-generated software. 🔗 verina.io
English
5
16
102
28.9K
Zhe Ye retweetledi
Ziran Yang
Ziran Yang@__zrrr__·
Introducing Goedel-Code-Prover 🌲 LLMs write code, but can they prove it correct? Not just pass tests, but construct machine-checkable proofs that a program works for ALL possible inputs. We built a system that does exactly this. Given aprogram and its specification in Lean 4, Goedel-Code-Prover automatically synthesizes formal proofs ofcorrectness. Our 8B model achieves 62% overall success rate across three benchmarks (Verina, Clever &AlgoVeri), a 2.6x improvement over the strongest baseline, surpassing both frontier LLMs (GPT/Gemini/Claude)and open-source theorem provers up to 84x larger (DeepSeek-Prover/Goedel-Prover/Kimina-Prover/BFS-Prover).
Ziran Yang tweet media
English
19
76
554
67.6K
Zhe Ye retweetledi
Zhe Ye retweetledi
Shu Lynn Liu
Shu Lynn Liu@shulynnliu·
AlphaEvolve is closed-source. We release 🌟SkyDiscover🌟, a flexible, modular open-source framework with two new adaptive algorithms that match or exceed AlphaEvolve on many benchmarks and outperform OpenEvolve, GEPA, and ShinkaEvolve across 200+ optimization tasks. Our new algorithms dynamically adapt their search strategy, and can even let the AI optimize its own optimization process on the fly! Results: 📊 +34% median score improvement on 172 Frontier-CS problems. 🧮 Matches/exceeds AlphaEvolve on many math benchmarks ⚙️ Discovers system optimizations beyond human-designed SOTA 🧵👇
GIF
English
12
107
582
139.6K
Zhe Ye retweetledi
Zhanhui Zhou
Zhanhui Zhou@asapzzhou·
(1/n) Tiny-A2D: An Open Recipe to Turn Any AR LM into a Diffusion LM Code (dLLM): github.com/ZHZisZZ/dllm Checkpoints: huggingface.co/collections/dl… With dLLM, you can turn ANY autoregressive LM into a diffusion LM (parallel generation + infilling) with minimal compute. Using this recipe, we built a 🤗collection of the smallest diffusion LMs that work well in practice. Key takeaways: 1. Finetuned on Qwen3-0.6B, we obtain the strongest small (~0.5/0.6B) diffusion LMs to date. 2. The base AR LM matters: Investing compute in improving the base AR model is potentially more efficient than scaling compute during adaptation. 3. Block diffusion (BD3LM) generally outperforms vanilla masked diffusion (MDLM), especially on math-reasoning and coding tasks.
English
6
73
337
26.2K
Zhe Ye retweetledi
Dawn Song
Dawn Song@dawnsongtweets·
Congrats to @HarmonicMath for the great advancement on proof generation, reaching such high score on our VERINA benchmark on verifiable code generation! We have been in the process of releasing an even harder benchmark on this soon---stay tuned 😀
Harmonic@HarmonicMath

Beyond math: Aristotle achieves SOTA 96.8% proof generation on VERINA: Benchmarking Verifiable Code Generation. You can read more about this performance on our engineering blog linked in bio

English
2
4
33
6.1K
Zhe Ye retweetledi
Zhanhui Zhou
Zhanhui Zhou@asapzzhou·
(1/n) 🚨 BERTs that chat: turn any BERT into a chatbot with diffusion hi @karpathy, we just trained a few BERTs to chat with diffusion — we are releasing all the model checkpoints, training curves, and recipes! Hopefully this spares you the side quest into training nanochat with diffusion for now 🙂. It’s both a hands-on tutorial for beginners and an example showing how to use our complete toolkit (dLLM) for deeper projects. Code: github.com/ZHZisZZ/dllm Report: api.wandb.ai/links/asap-zzh… Checkpoints: huggingface.co/collections/dl… Motivation: I couldn’t find a good “Hello World” example for training a minimally working yet useful diffusion language models, a class of bidirectional language models capable of parallel token generation in arbitrary order. So I tried finetuning BERTs to make it chat with discrete diffusion—and it turned out more fun than I expected. TLDR: With a small amount of open-source instruction-following data, a standard BERT can gain conversational ability with diffusion. Specifically, a finetuned ModernBERT-large, with a similar number of parameters, performs close to Qwen1.5-0.5B.
Andrej Karpathy@karpathy

Nice, short post illustrating how simple text (discrete) diffusion can be. Diffusion (i.e. parallel, iterated denoising, top) is the pervasive generative paradigm in image/video, but autoregression (i.e. go left to right bottom) is the dominant paradigm in text. For audio I've seen a bit of both. A lot of diffusion papers look a bit dense but if you strip the mathematical formalism, you end up with simple baseline algorithms, e.g. something a lot closer to flow matching in continuous, or something like this in discrete. It's your vanilla transformer but with bi-directional attention, where you iteratively re-sample and re-mask all tokens in your "tokens canvas" based on a noise schedule until you get the final sample at the last step. (Bi-directional attention is a lot more powerful, and you get a lot stronger autoregressive language models if you train with it, unfortunately it makes training a lot more expensive because now you can't parallelize across sequence dim). So autoregression is doing an `.append(token)` to the tokens canvas while only attending backwards, while diffusion is refreshing the entire token canvas with a `.setitem(idx, token)` while attending bidirectionally. Human thought naively feels a bit more like autoregression but it's hard to say that there aren't more diffusion-like components in some latent space of thought. It feels quite possible that you can further interpolate between them, or generalize them further. And it's a component of the LLM stack that still feels a bit fungible. Now I must resist the urge to side quest into training nanochat with diffusion.

English
21
118
983
175.9K
Zhe Ye retweetledi
Liyi Zhou
Liyi Zhou@lzhou1110·
I’ve been waiting the chance to clear up two things that people often challenge me on. 1. Some say it’s impossible to get near-zero false positives using LLMs. I don’t think that’s true. If top hackers can carefully validate their findings with almost no mistakes, then humans can, and so can systems built the right way. 2. I also hear that my work only applies to blockchain. I hope this new paper with my incoming PhD student, @Zyy_0530 , makes people look at me differently. TLDR, we CAN find zero days in Android Apps, with low false positives rates.
Wesley Wang@Zyy0530

Android security analysts drown in thousands of false warnings while real vulnerabilities slip through. Traditional SAST tools overwhelm teams with noise but miss logic exploits. Excited to share our A2 - the system to mirror human expert analysis! Link: arxiv.org/pdf/2508.21579

English
3
5
19
3K
Zhe Ye retweetledi
Dawn Song
Dawn Song@dawnsongtweets·
Join us at Agentic AI Summit 2025 — August 2 at UC Berkeley, with ~2,000 in-person attendees and the leading minds in AI. Building on the momentum of the 25K+ LLM Agents MOOC community, this is the largest and most cutting-edge event on #AgenticAI. As 2025 emerges as the Year of the Agents, the summit offers a front-row seat to the breakthroughs shaping the future of #AgenticAI. Be part of the movement. 👀 Register for in-person or online attendance: rdi.berkeley.edu/events/agentic…
Dawn Song tweet media
English
10
50
217
30.9K
Zhe Ye retweetledi
Yajin (Andy) Zhou
Yajin (Andy) Zhou@yajinzhou·
Join us at DeFi’25: Workshop on Decentralized Finance & Security, Co-located with ACM CCS 2025 on October 17, 2025. Submission deadline: July 21, 2025 (AoE) Thanks to our incredible program committee & chairs for making this happen: @yaish_aviv @christoftorres @alexcryptan @chendaLiu @PulpSpy @jgorzny @0xlf_ @manv_sc @pszalach @mysteryfigure @KaihuaQIN @flotschorsch @zzzihaoli @masserova @dmoroz @ObadiaAlex @chiachih_wu @VeroCEG @KushalBabel @0xFanZhang @lzhou1110 @lzhou1110 @chunghaocrypto …and to our steering committee: @TheWattenhofer @dawnsongtweets @HatforceSec @Daeinar Learn more & submit: defiwork.shop
English
0
11
20
3.5K
Zhe Ye retweetledi
Sijun Tan
Sijun Tan@sijun_tan·
The first half of 2025 is all about reasoning models. The second half? It’s about agents. At Agentica, we’re thrilled to launch two major releases: 1. DeepSWE, our STOA coding agent trained with RL that tops SWEBench leaderboard for open-weight models. 2. rLLM, our agent post-training framework that powers DeepSWE training and beyond. These two releases mark our transition from training language reasoners to building language agents that can truly learn from experience. Welcome to the era of experience.
Agentica Project@Agentica_

🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. 💪DeepSWE is trained with rLLM, our modular RL post-training framework for agents. rLLM makes it easy to build, train, and deploy RL-tuned agents on real-world workloads — from software engineering to web navigation and beyond. 🤗As always, we’re open-sourcing everything: not just the model, but the training code (rLLM), dataset (R2EGym), and training recipe for full reproducibility. 🔥Train DeepSWE yourself. Extend it. Build your own local agents. No secrets, no barriers. DeepSWE and rLLM mark our major shift: from training language reasoners to building language agents that can truly learn from experience. We believe the future of AI lies in experience-driven learning — and we’re here to democratize it. Welcome to the era of experience. 🌍 Links below: (1/n)

English
3
9
55
6.4K
Zhe Ye retweetledi
Xiuyu Li
Xiuyu Li@xiuyu_l·
Sparsity can make your LoRA fine-tuning go brrr 💨 Announcing SparseLoRA (ICML 2025): up to 1.6-1.9x faster LLM fine-tuning (2.2x less FLOPs) via contextual sparsity, while maintaining performance on tasks like math, coding, chat, and ARC-AGI 🤯 🧵1/ z-lab.ai/projects/spars…
English
5
52
204
35.8K
Mikerah
Mikerah@badcryptobitch·
Everything is a graph
English
1
0
0
250
Zhe Ye retweetledi
Yiyou Sun
Yiyou Sun@YiyouSun·
🚨 New study on LLM's reasoning boundary! Can LLMs really think out of the box? We introduce OMEGA—a benchmark probing how they generalize: 🔹 RL boosts accuracy on slightly harder problems with familiar strategies, 🔹 but struggles with creative leaps & strategy composition. 👇
Nouha Dziri@nouhadziri

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found that although very powerful, RL struggles to compose skills and to innovate new strategies that were not seen during training. 👇 work w. @UCBerkeley @allen_ai A thread on what we learned 🧵

English
2
10
48
7.6K
Zhe Ye retweetledi
Dawn Song
Dawn Song@dawnsongtweets·
1/ 🔥 AI agents are reaching a breakthrough moment in cybersecurity. In our latest work: 🔓 CyberGym: AI agents discovered 15 zero-days in major open-source projects 💰 BountyBench: AI agents solved real-world bug bounty tasks worth tens of thousands of dollars 🤖 Autonomously. A pivotal shift is underway — AI agents can now autonomously do what only elite human hackers could before.
Dawn Song tweet media
English
28
149
543
136.6K
Zhe Ye
Zhe Ye@0xlf_·
1/🧵Introducing VERINA: a high-quality benchmark for verifiable code generation. As LLMs are increasingly used to generate software, we need more than just working code--We need formal guarantees of correctness. VERINA offers a rigorous and modular framework for evaluating LLMs across code, specification, and proof generation, as well as their compositions, paving the way toward trustworthy AI-generated software. 🔗 verina.io
English
5
16
102
28.9K