Zerui Cheng

62 posts

Zerui Cheng

Zerui Cheng

@ZeruiCheng

Ph.D. candidate @Princeton / Tencent Hy / Prev ByteDance Seed / LLM Agent Eval&Data / web3, blockchain / ICPC Gold Medalist / Yao Class Alumni 23' @Tsinghua_Uni

Princeton, NJ Katılım Kasım 2021
103 Takip Edilen138 Takipçiler
Sabitlenmiş Tweet
Zerui Cheng
Zerui Cheng@ZeruiCheng·
✈️ Landed in San Diego for #NeurIPS2025! ☀️ 🔹Dec 3: 2 Main Conf Posters (PeerBench & LiveCodeBench Pro) 🔹Dec 4: Talk @ OpenAGI Symposium (📍Sparks Gallery) 🔹 Dec 6: Poster @ Lock-LLM Workshop (OML) Feel free to drop by, say hi and exchange insights on AI eval or blockchains!
Zerui Cheng tweet mediaZerui Cheng tweet mediaZerui Cheng tweet media
English
1
1
4
577
Zerui Cheng retweetledi
Tencent Hy
Tencent Hy@TencentHunyuan·
👋Hi /haɪ/, we're the Tencent Hy /haɪ/ team🐧 Today, we open source Hy3 preview (295B A21B), a leading reasoning and agent model in its size, with great cost efficiency. Give us feedback to help improve Hy3 official! 🤗 hf.co/tencent/Hy3-pr… 📖 hy.tencent.com/hy3-preview
Tencent Hy tweet media
English
64
143
1.3K
359.7K
Zerui Cheng retweetledi
Kaiyuan Liu
Kaiyuan Liu@KaiyuanLiu04·
Excited to share that I’ll be attending ICLR 2026 in Rio de Janeiro 🇧🇷! I’ll be presenting our work AutoCode on April 23 morning Poster session and an oral presentation at the MALGAI Workshop on April 27!
English
3
2
9
4.8K
Zerui Cheng retweetledi
Wenhao Chai
Wenhao Chai@wenhaocha1·
Introducing FrontierCS. LiveCodeBench Pro is already a challenging competitive programming benchmark, so why do we still need to push one step further? The motivation behind FrontierCS is actually pretty simple: we love measuring intelligence with problems that have a "single", "correct", "optimal" answer, but what really matters at the frontier in practice is often open-ended problems where the optimum is unknown, yet every step can be objectively scored and verified like Terence Tao has done this in the wild: @tao/115500681819202377" target="_blank" rel="nofollow noopener">mathstodon.xyz/@tao/115500681…. In our experiments, we kept running into a sobering pattern: simply scaling up reasoning compute doesn’t close the gap. Models often settle for a locally feasible "it runs" solution, then stall on algorithmic and system choices that are still clearly bad. We still have a long way to go. Let’s build Evolving Challenges for Evolving Intelligence!
Huanzhi Mao@HuanzhiMao

Pass/fail benchmarks are saturated. It’s time for FrontierCS. 🚀 150+ unsolved, verifiable problems ranging from competitive programming to real-world research. Designed by PhDs & ICPC experts to evolve model intelligence. 🎓🧠 🧵👇Check it out! Paper: arxiv.org/abs/2512.15699

English
2
5
39
6K
Zerui Cheng retweetledi
Huanzhi Mao
Huanzhi Mao@HuanzhiMao·
Pass/fail benchmarks are saturated. It’s time for FrontierCS. 🚀 150+ unsolved, verifiable problems ranging from competitive programming to real-world research. Designed by PhDs & ICPC experts to evolve model intelligence. 🎓🧠 🧵👇Check it out! Paper: arxiv.org/abs/2512.15699
Huanzhi Mao tweet media
English
7
41
259
42.1K
Zerui Cheng
Zerui Cheng@ZeruiCheng·
DMs are open. Feel free to reach out!
English
0
0
1
44
Zerui Cheng
Zerui Cheng@ZeruiCheng·
✈️ Landed in San Diego for #NeurIPS2025! ☀️ 🔹Dec 3: 2 Main Conf Posters (PeerBench & LiveCodeBench Pro) 🔹Dec 4: Talk @ OpenAGI Symposium (📍Sparks Gallery) 🔹 Dec 6: Poster @ Lock-LLM Workshop (OML) Feel free to drop by, say hi and exchange insights on AI eval or blockchains!
Zerui Cheng tweet mediaZerui Cheng tweet mediaZerui Cheng tweet media
English
1
1
4
577
Zerui Cheng retweetledi
Saining Xie
Saining Xie@sainingxie·
gemini 3 is a super smart coder👩‍💻: it pushes competitive coding performance on LiveCodeBench Pro (livecodebenchpro.com) to the next level--over 200 points higher than gpt-5.1. big kudos to the thinking team behind it
Saining Xie tweet media
English
7
17
176
15K
Zerui Cheng retweetledi
Wenhao Chai
Wenhao Chai@wenhaocha1·
Gemini 3.0 is the next-generation frontier model on our LiveCodeBench Pro benchmark, better than GPT-5/5.1. We’re very excited that Google has adopted our benchmark: a continuously updated collection of problems from Codeforces, ICPC, and IOI designed specifically to minimize data contamination. Huge congratulations to the team! @GoogleDeepMind
Wenhao Chai tweet media
English
14
29
286
38.5K
Zerui Cheng retweetledi
KITE AI
KITE AI@GoKiteAI·
Introducing the Kite Whitepaper: "From Human-Centric to Agent-Native: Building Trustless Payment Infrastructure for Agentic AI" 🔗kite.foundation/whitepaper TL;DR Kite enables AI agents to autonomously transact at scale with cryptographic safety and native x402 compatibility—solving the infrastructure crisis imprisoning the agent economy today. We sincerely appreciate our co-authors: @ZeruiCheng (Ph.D @Princeton), Chen Xi (Staff Software Engineer @Uber), Yi Huang (@cryptocom), Uddhav Marwaha (Payments API Lead @coinbase), and David Weber (Head of PayPal USD). And our reviewers: @no89thkey (Founder @brevis_zk), @tengyanAI (Founder @cot_research), @nathan_sj_stem (Co-founder and CEO @Vishwa_xyz), @PCDispersion (Founder & Managing Partner @DispersionVC), @shumochu (Co-founder @MantaNetwork & @nebrazkp), @yq_acc (Founder @alt_layer), @JustinZhang (Co-founder & CEO @sparsity_xyz), @nake13 (Founder @ChainFeedsxyz), @RosuGrigore (Professor @UofIllinois), @yukez (Director and Distinguished Research Scientist @nvidia). Let’s break it down. 👇
KITE AI tweet media
English
267
263
1.2K
151.8K
Zerui Cheng retweetledi
Zixuan Wang
Zixuan Wang@zzZixuanWang·
I’ve long been fascinated by looped transformers from theory perspective and wondering if they could actually work in practice. Turns out: YES!✅ Thrilled to introduce our Looped Model Ouro that matches much larger models across modern benchmarks!
Rui-Jie Zhu@RidgerZhu

Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.

English
1
4
12
1K
Zerui Cheng retweetledi
Jianzhu Yao
Jianzhu Yao@alexbert135·
🔥Introducing paper: Nondeterminism-Aware Optimistic Verification for Floating-Point Neural Networks🔥 😈Cloud/marketplace ML can silently downgrade/contaminate your results (model swap, early exit, quantization). You can’t verify what really ran: GPUs are nondeterministic.
Jianzhu Yao tweet media
English
3
1
5
392
Zerui Cheng retweetledi
Yuxi Li
Yuxi Li@yuxili99·
OML: A Primitive for Reconciling Open Access with Owner Control in AI Model Distribution, by Zerui Cheng, Princeton U. @ZeruiCheng youtu.be/z2_YYoREbKA
YouTube video
YouTube
Yuxi Li tweet media
English
0
1
1
173
Zerui Cheng retweetledi
Wenhao Chai
Wenhao Chai@wenhaocha1·
LiveCodeBench Pro remains one of the most challenging code benchmarks, but its evaluation and verification process is still a black box. We introduce AutoCode, which democratizes evaluation allowing anyone to locally run verification and perform RL training! For the first time, we also show that an LLM can act as a problem setter, transforming a simple problem into a harder version sometimes even harder than what it can solve itself. In other words, LLMs can generate problems they can’t yet solve, opening the door to true self-play. Moreover, through an agentic framework, we find that LLMs can automatically generate test cases, achieving 98.7% evaluation consistency, which is already highly practical accuracy for an RL verifier.
Wenhao Chai tweet media
English
4
29
123
25.6K
Zerui Cheng
Zerui Cheng@ZeruiCheng·
Thank you to @yuxili99 and the DeAI Institute for the invitation. I will be giving an online talk on OML (arxiv.org/pdf/2411.03887) on Oct 15 at 9pm ET. If you're interested in our work, LLMs, AI safety, or Crypto+AI, I'd love to see you there and look forward to our discussion!
Zerui Cheng tweet media
English
2
0
3
252
Zerui Cheng retweetledi
Wenhao Chai
Wenhao Chai@wenhaocha1·
LiveCodeBench Pro is accepted by #NeurIPS2025 congratulations to the amazing team, special thanks to project lead @ZihanZheng71803 who maintains and actively update the benchmark. Here's the update we made among the past three months, we include some amazing new models: 1. GPT-5: first model that solve one HARD problem 2. GPT-oss: best open-source model choice (before qwen-3-next) 3. Qwen-3-next-80B-A3B: very close to Qwen3-235B-A22B 4. kwaipilot-40B: very early submission and good performance We are still on the way. Although OpenAI and Google DeepMind achieve very good results on ICPC which is amazing, but we have no idea how they run and test their models. In LiveCodeBench Pro, all the things are transparent! Our next plan is to release more details, pipelines, and data to public. You'll be able to test your model or train with verifier for RL locally soon. Stay tuned!
Wenhao Chai@wenhaocha1

We introduce LiveCodeBench Pro. Models like o3-high, o4-mini, and Gemini 2.5 Pro score 0% on hard competitive programming problems.

English
2
8
43
3.9K
Zerui Cheng retweetledi
Wenhao Chai
Wenhao Chai@wenhaocha1·
xAI just released Grok 4 Fast, a powerful model for competitive programming. Through our collaboration with xAI, we tested this amazing model on LiveCodeBench Pro. We found that Grok-4-Fast can compete with o4-mini, slightly outperform Gemini 2.5 Pro, and even solved a hard-level problem in the 2025 Q2 set! Grok-4-Fast-Non-Reasoning has become the strongest non-reasoning model, potentially rivaling gpt-oss-20b (which is a reasoning model). We are excited to see more powerful models achieving new breakthroughs on LiveCodeBench Pro, and we thank @xai for their support.
Wenhao Chai tweet media
English
59
39
326
57.8K
Zerui Cheng retweetledi
Zerui Cheng retweetledi
Wenhao Chai
Wenhao Chai@wenhaocha1·
GPT-5, think more. In our latest LiveCodeBench Pro tests for Competitive Programming, GPT-5 Thinking hit a true 0→1 moment in 2025 Q1 set, the only model to crack the hard split, and this wasn’t even GPT-5 Thinking Pro. Average response length exceeded 100,000 tokens, which is 3x longer than o3. Leaderboard: livecodebenchpro.com All testing and infra credit goes to @ZihanZheng71803
Wenhao Chai tweet media
English
10
23
196
66K