Haoyu Zhao

19 posts

Haoyu Zhao

@thomaszhao1998

PhD student @Princeton, Research Intern @MSFTResearch. Recently interested in theorem proving.

Katılım Haziran 2015

54 Takip Edilen79 Takipçiler

Haoyu Zhao retweetledi

Ziran Yang@__zrrr__·4d

Excited to share our paper AlgoVeri just got an #ICML Spotlight (Top 2.2%) 🙌 A benchmark for verified code generation: the LLM writes code together with a formal proof of correctness. We aligned classical algorithm problems across multiple proof languages, so we can directly compare LLM performance and behavior across verification paradigms. The result is different ceilings and different failure modes, with a real interplay between the proof paradigm and the LLM scaffolding that actually helps.

English

2.3K

Haoyu Zhao retweetledi

Yun Cheng@chengyun01·27 Nis

Our Contextual Drag got the 🏆 Best Paper Award at the ICLR 2026 Workshop on AI with Recursive Self-Improvement!

Mingchen Zhuge@MingchenZhuge

Excited to share our award-winning papers! 🏆 Best Paper Awards • Contextual Drag: How Errors in Context Affect LLM Reasoning • PostTrainBench: Can LLM Agents Automate LLM Post-Training? 🌟 Outstanding Paper Awards • Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning • Learning to Continually Learn via Meta-Learning Agentic Memory Designs @jeffclune @hrdkbhatnagar @CaimingXiong @chengyun01 @yimingxiong_ @shengranhu @richardxp888 @HuaxiuYaoML @prfsanjeevarora

English

3.5K

Haoyu Zhao retweetledi

Yun Cheng@chengyun01·5 Mar

Humans anchor on the first piece of information they receive. Do reasoning models escape this bias? We uncover Contextual Drag: errors in context bias subsequent reasoning toward similar mistakes. It persists even if the error has been recognized via reasoning.

English

16.2K

Haoyu Zhao retweetledi

Ziran Yang@__zrrr__·26 Mar

Introducing Goedel-Code-Prover 🌲 LLMs write code, but can they prove it correct? Not just pass tests, but construct machine-checkable proofs that a program works for ALL possible inputs. We built a system that does exactly this. Given aprogram and its specification in Lean 4, Goedel-Code-Prover automatically synthesizes formal proofs ofcorrectness. Our 8B model achieves 62% overall success rate across three benchmarks (Verina, Clever &AlgoVeri), a 2.6x improvement over the strongest baseline, surpassing both frontier LLMs (GPT/Gemini/Claude)and open-source theorem provers up to 84x larger (DeepSeek-Prover/Goedel-Prover/Kimina-Prover/BFS-Prover).

English

553

69.5K

Haoyu Zhao@thomaszhao1998·15 Tem

Very proud to be a member of the Goedel team and contribute to our prover!

Yong Lin@Yong18850571

(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B model matches DeepSeek-671B on MiniF2F. 📚 Leading on MathOlympiadBench (IMO-level problems) * Solves 73 vs 50 over 671B DeepSeek Prover 🔓 Website: blog.goedel-prover.com 🔓 Model 32B: huggingface.co/Goedel-LM/Goed… 🔓 Model 8B huggingface.co/Goedel-LM/Goed… 🔓Data and training pipeline will be released soon. Amazing Collaborators: @sangertang1999 @Lyubh22 @__zrrr__ @juihuichung @thomaszhao1998 @pero733858111 @thiiis_user @EmilyJge @JingruoS5931 @wujiayun12 @GesiJiri68334 @davidjesusacu @KaiyuYang4 @hongzhou__lin @YejinChoinka @danqi_chen @prfsanjeevarora @chijinML

English

295

Haoyu Zhao retweetledi

Ori Press@ori_press·2 Tem

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

English

163

25.1K

Haoyu Zhao@thomaszhao1998·20 May

Joint work with @thiiis_user @sangertang1999 @Yong18850571 @Lyubh22 @hongzhou__lin @chijinML @prfsanjeevarora 📄 [arxiv.org/abs/2505.12680] | 💻 [github.com/haoyuzhao123/L…]

English

198

Haoyu Zhao@thomaszhao1998·20 May

This isn't about raw difficulty. It's about the model’s inability to reuse known reasoning in slightly new contexts. Ineq-Comp shows what MiniF2F and other benchmarks might overlook: formal provers remain surprisingly brittle.

English

201

Haoyu Zhao@thomaszhao1998·20 May

🚨 Easy math, epic fail! 🚨 Our new benchmark, Ineq-Comp, gives formal theorem provers Lean inequalities... then makes tiny tweaks (duplicating variables, squaring terms) that humans handle easily. Most provers collapse. Simple composition is still surprisingly hard!

English

3.3K

Haoyu Zhao retweetledi

Sanjeev Arora@prfsanjeevarora·20 Ara

@QuantaMagazine featured our work on emergence of skill compositionality (and its limitations) in LLMs among the CS breakthroughs of the year. tinyurl.com/5f5jvzy5. Work was done over 2023 @GoogleDeepMind and @PrincetonPLI. Key pieces: (i) mathematical framework for quantifying how LLM scaling leads to predictable increase in the model’s ability to combine skills while solving new tasks. Joint work with @anirudhg9119 (ii) experiments verifying theoretical prediction in experiments via SkillMix evaluation (lead author @dingliy_yu) (iii) the level of skill-compositionality detected in GPT4O via Sept'23 experiments mathematically imply that it is able to reason and talk about situations it has not seen in its training data —i.e. it has moved beyond the “stochastic parrots” stereotype that had dogged earlier LLMs. Skill emergence paper: arxiv.org/abs/2307.15936 Skillmix Evaluation: arxiv.org/abs/2310.17567 Models can improve skill composition from examples arxiv.org/abs/2409.19808 Wonderful to work with the colleagues and students involved.

English

2.9K

Haoyu Zhao retweetledi

Kaifeng Lyu@vfleaking·4 Mar

Fine-tuning can improve chatbots (e.g., Llama 2-Chat, GPT-3.5) on downstream tasks — but may unintentionally break their safety alignment. Our new paper: Adding a safety prompt is enough to largely mitigate the issue, but be cautious about when to add it! arxiv.org/abs/2402.18540

English

25K

Haoyu Zhao@thomaszhao1998·19 Eki

Surprisingly, the embs of models contain not only the parse tree info. Probing on the spans show a high corr. with the span’s marginal prob. computed by the IO alg! Even for spans with length 10, the corr is higher than 0.75. For spans with length 2, the corr is higher than 0.9!

English

156

Haoyu Zhao@thomaszhao1998·19 Eki

We train RoBERTa on PCFG tailored to English and conduct probing exp. The probes reach ~70% parsing accuracy, and the baselines can only get <40% F1. It strongly suggests that the pre-trained models contain parse tree info! (A12L12 denotes model with 12 layers and 12 atten heads)

English

283

Haoyu Zhao@thomaszhao1998·19 Eki

🔥EMNLP paper🔥 Transformers have the ability to parse. But how do transformers encode this info and what’s the connection with pre-training? We shed light using PCFG and low-rank compression! Joint work w/ @Abhishek_034, Rong Ge, @prfsanjeevarora Paper: arxiv.org/abs/2303.08117

English

19K

Haoyu Zhao retweetledi

Abhishek Panigrahi@Abhishek_034·6 Tem

@icmlconf **paper alert** Fine-tuning LLM on a task gives it new skill. Our “Skill localization” paper shows this skill lives in < 0.01% parameters — rest can be reverted to pre-trained values. 1/6 With @NSaunshi,@thomaszhao1998,@prfsanjeevarora Link: arxiv.org/abs/2302.06600

English

13.6K

Keşfet

@thiiis_user @sangertang1999 @Yong18850571 @Lyubh22 @hongzhou__lin @chijinML @prfsanjeevarora @QuantaMagazine