Haoyu Zhao

19 posts

Haoyu Zhao

Haoyu Zhao

@thomaszhao1998

PhD student @Princeton, Research Intern @MSFTResearch. Recently interested in theorem proving.

Katılım Haziran 2015
54 Takip Edilen79 Takipçiler
Haoyu Zhao retweetledi
Ziran Yang
Ziran Yang@__zrrr__·
Excited to share our paper AlgoVeri just got an #ICML Spotlight (Top 2.2%) 🙌 A benchmark for verified code generation: the LLM writes code together with a formal proof of correctness. We aligned classical algorithm problems across multiple proof languages, so we can directly compare LLM performance and behavior across verification paradigms. The result is different ceilings and different failure modes, with a real interplay between the proof paradigm and the LLM scaffolding that actually helps.
English
4
8
50
2.3K
Haoyu Zhao retweetledi
Haoyu Zhao retweetledi
Yun Cheng
Yun Cheng@chengyun01·
Humans anchor on the first piece of information they receive. Do reasoning models escape this bias? We uncover Contextual Drag: errors in context bias subsequent reasoning toward similar mistakes. It persists even if the error has been recognized via reasoning.
Yun Cheng tweet media
English
1
10
62
16.2K
Haoyu Zhao retweetledi
Ziran Yang
Ziran Yang@__zrrr__·
Introducing Goedel-Code-Prover 🌲 LLMs write code, but can they prove it correct? Not just pass tests, but construct machine-checkable proofs that a program works for ALL possible inputs. We built a system that does exactly this. Given aprogram and its specification in Lean 4, Goedel-Code-Prover automatically synthesizes formal proofs ofcorrectness. Our 8B model achieves 62% overall success rate across three benchmarks (Verina, Clever &AlgoVeri), a 2.6x improvement over the strongest baseline, surpassing both frontier LLMs (GPT/Gemini/Claude)and open-source theorem provers up to 84x larger (DeepSeek-Prover/Goedel-Prover/Kimina-Prover/BFS-Prover).
Ziran Yang tweet media
English
20
76
553
69.5K
Haoyu Zhao
Haoyu Zhao@thomaszhao1998·
Very proud to be a member of the Goedel team and contribute to our prover!
Yong Lin@Yong18850571

(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B model matches DeepSeek-671B on MiniF2F. 📚 Leading on MathOlympiadBench (IMO-level problems) * Solves 73 vs 50 over 671B DeepSeek Prover 🔓 Website: blog.goedel-prover.com 🔓 Model 32B: huggingface.co/Goedel-LM/Goed… 🔓 Model 8B huggingface.co/Goedel-LM/Goed… 🔓Data and training pipeline will be released soon. Amazing Collaborators: @sangertang1999 @Lyubh22 @__zrrr__ @juihuichung @thomaszhao1998 @pero733858111 @thiiis_user @EmilyJge @JingruoS5931 @wujiayun12 @GesiJiri68334 @davidjesusacu @KaiyuYang4 @hongzhou__lin @YejinChoinka @danqi_chen @prfsanjeevarora @chijinML

English
0
0
3
295
Haoyu Zhao retweetledi
Ori Press
Ori Press@ori_press·
Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️
Ori Press tweet media
English
6
59
163
25.1K
Haoyu Zhao
Haoyu Zhao@thomaszhao1998·
This isn't about raw difficulty. It's about the model’s inability to reuse known reasoning in slightly new contexts. Ineq-Comp shows what MiniF2F and other benchmarks might overlook: formal provers remain surprisingly brittle.
English
1
0
2
201
Haoyu Zhao
Haoyu Zhao@thomaszhao1998·
🚨 Easy math, epic fail! 🚨 Our new benchmark, Ineq-Comp, gives formal theorem provers Lean inequalities... then makes tiny tweaks (duplicating variables, squaring terms) that humans handle easily. Most provers collapse. Simple composition is still surprisingly hard!
Haoyu Zhao tweet media
English
2
14
17
3.3K
Haoyu Zhao retweetledi
Sanjeev Arora
Sanjeev Arora@prfsanjeevarora·
@QuantaMagazine featured our work on emergence of skill compositionality (and its limitations) in LLMs among the CS breakthroughs of the year. tinyurl.com/5f5jvzy5. Work was done over 2023 @GoogleDeepMind and @PrincetonPLI. Key pieces: (i) mathematical framework for quantifying how LLM scaling leads to predictable increase in the model’s ability to combine skills while solving new tasks. Joint work with @anirudhg9119 (ii) experiments verifying theoretical prediction in experiments via SkillMix evaluation (lead author @dingliy_yu) (iii) the level of skill-compositionality detected in GPT4O via Sept'23 experiments mathematically imply that it is able to reason and talk about situations it has not seen in its training data —i.e. it has moved beyond the “stochastic parrots” stereotype that had dogged earlier LLMs. Skill emergence paper: arxiv.org/abs/2307.15936 Skillmix Evaluation: arxiv.org/abs/2310.17567 Models can improve skill composition from examples arxiv.org/abs/2409.19808 Wonderful to work with the colleagues and students involved.
English
1
8
36
2.9K
Haoyu Zhao retweetledi
Kaifeng Lyu
Kaifeng Lyu@vfleaking·
Fine-tuning can improve chatbots (e.g., Llama 2-Chat, GPT-3.5) on downstream tasks — but may unintentionally break their safety alignment. Our new paper: Adding a safety prompt is enough to largely mitigate the issue, but be cautious about when to add it! arxiv.org/abs/2402.18540
Kaifeng Lyu tweet media
English
4
17
70
25K
Haoyu Zhao
Haoyu Zhao@thomaszhao1998·
Surprisingly, the embs of models contain not only the parse tree info. Probing on the spans show a high corr. with the span’s marginal prob. computed by the IO alg! Even for spans with length 10, the corr is higher than 0.75. For spans with length 2, the corr is higher than 0.9!
Haoyu Zhao tweet media
English
0
0
2
156
Haoyu Zhao
Haoyu Zhao@thomaszhao1998·
We train RoBERTa on PCFG tailored to English and conduct probing exp. The probes reach ~70% parsing accuracy, and the baselines can only get <40% F1. It strongly suggests that the pre-trained models contain parse tree info! (A12L12 denotes model with 12 layers and 12 atten heads)
Haoyu Zhao tweet media
English
1
0
2
283
Haoyu Zhao
Haoyu Zhao@thomaszhao1998·
🔥EMNLP paper🔥 Transformers have the ability to parse. But how do transformers encode this info and what’s the connection with pre-training? We shed light using PCFG and low-rank compression! Joint work w/ @Abhishek_034, Rong Ge, @prfsanjeevarora Paper: arxiv.org/abs/2303.08117
English
1
3
26
19K