Zerui Cheng

62 posts

Zerui Cheng

@ZeruiCheng

Ph.D. candidate @Princeton / Tencent Hy / Prev ByteDance Seed / LLM Agent Eval&Data / web3, blockchain / ICPC Gold Medalist / Yao Class Alumni 23' @Tsinghua_Uni

Princeton, NJ Katılım Kasım 2021

103 Takip Edilen138 Takipçiler

Sabitlenmiş Tweet

Zerui Cheng@ZeruiCheng·2 Ara

✈️ Landed in San Diego for #NeurIPS2025! ☀️ 🔹Dec 3: 2 Main Conf Posters (PeerBench & LiveCodeBench Pro) 🔹Dec 4: Talk @ OpenAGI Symposium (📍Sparks Gallery) 🔹 Dec 6: Poster @ Lock-LLM Workshop (OML) Feel free to drop by, say hi and exchange insights on AI eval or blockchains!

English

577

Zerui Cheng retweetledi

Tencent Hy@TencentHunyuan·23 Nis

👋Hi /haɪ/, we're the Tencent Hy /haɪ/ team🐧 Today, we open source Hy3 preview (295B A21B), a leading reasoning and agent model in its size, with great cost efficiency. Give us feedback to help improve Hy3 official! 🤗 hf.co/tencent/Hy3-pr… 📖 hy.tencent.com/hy3-preview

English

143

1.3K

359.7K

Zerui Cheng retweetledi

Kaiyuan Liu@KaiyuanLiu04·14 Nis

Excited to share that I’ll be attending ICLR 2026 in Rio de Janeiro 🇧🇷! I’ll be presenting our work AutoCode on April 23 morning Poster session and an oral presentation at the MALGAI Workshop on April 27!

English

4.8K

Zerui Cheng retweetledi

Wenhao Chai@wenhaocha1·18 Ara

Introducing FrontierCS. LiveCodeBench Pro is already a challenging competitive programming benchmark, so why do we still need to push one step further? The motivation behind FrontierCS is actually pretty simple: we love measuring intelligence with problems that have a "single", "correct", "optimal" answer, but what really matters at the frontier in practice is often open-ended problems where the optimum is unknown, yet every step can be objectively scored and verified like Terence Tao has done this in the wild: @tao/115500681819202377" target="_blank" rel="nofollow noopener">mathstodon.xyz/@tao/115500681…. In our experiments, we kept running into a sobering pattern: simply scaling up reasoning compute doesn’t close the gap. Models often settle for a locally feasible "it runs" solution, then stall on algorithmic and system choices that are still clearly bad. We still have a long way to go. Let’s build Evolving Challenges for Evolving Intelligence!

Huanzhi Mao@HuanzhiMao

Pass/fail benchmarks are saturated. It’s time for FrontierCS. 🚀 150+ unsolved, verifiable problems ranging from competitive programming to real-world research. Designed by PhDs & ICPC experts to evolve model intelligence. 🎓🧠 🧵👇Check it out! Paper: arxiv.org/abs/2512.15699

English

Zerui Cheng retweetledi

Huanzhi Mao@HuanzhiMao·18 Ara

English

259

42.1K

Zerui Cheng@ZeruiCheng·2 Ara

DMs are open. Feel free to reach out!

English

Zerui Cheng@ZeruiCheng·2 Ara

Here are the links to all those papers that would be presented this week. PeerBench: arxiv.org/pdf/2510.07575 LiveCodeBench Pro: arxiv.org/pdf/2506.11928 AutoCode: arxiv.org/pdf/2510.12803 OML : arxiv.org/pdf/2411.03887

English

Zerui Cheng@ZeruiCheng·2 Ara

English

577

Zerui Cheng retweetledi

Saining Xie@sainingxie·18 Kas

gemini 3 is a super smart coder👩‍💻: it pushes competitive coding performance on LiveCodeBench Pro (livecodebenchpro.com) to the next level--over 200 points higher than gpt-5.1. big kudos to the thinking team behind it

English

176

15K

Zerui Cheng retweetledi

Wenhao Chai@wenhaocha1·18 Kas

Gemini 3.0 is the next-generation frontier model on our LiveCodeBench Pro benchmark, better than GPT-5/5.1. We’re very excited that Google has adopted our benchmark: a continuously updated collection of problems from Codeforces, ICPC, and IOI designed specifically to minimize data contamination. Huge congratulations to the team! @GoogleDeepMind

English

286

38.5K

Zerui Cheng retweetledi

KITE AI@GoKiteAI·31 Eki

Introducing the Kite Whitepaper: "From Human-Centric to Agent-Native: Building Trustless Payment Infrastructure for Agentic AI" 🔗kite.foundation/whitepaper TL;DR Kite enables AI agents to autonomously transact at scale with cryptographic safety and native x402 compatibility—solving the infrastructure crisis imprisoning the agent economy today. We sincerely appreciate our co-authors: @ZeruiCheng (Ph.D @Princeton), Chen Xi (Staff Software Engineer @Uber), Yi Huang (@cryptocom), Uddhav Marwaha (Payments API Lead @coinbase), and David Weber (Head of PayPal USD). And our reviewers: @no89thkey (Founder @brevis_zk), @tengyanAI (Founder @cot_research), @nathan_sj_stem (Co-founder and CEO @Vishwa_xyz), @PCDispersion (Founder & Managing Partner @DispersionVC), @shumochu (Co-founder @MantaNetwork & @nebrazkp), @yq_acc (Founder @alt_layer), @JustinZhang (Co-founder & CEO @sparsity_xyz), @nake13 (Founder @ChainFeedsxyz), @RosuGrigore (Professor @UofIllinois), @yukez (Director and Distinguished Research Scientist @nvidia). Let’s break it down. 👇

English

267

263

1.2K

151.8K

Zerui Cheng retweetledi

Zixuan Wang@zzZixuanWang·30 Eki

I’ve long been fascinated by looped transformers from theory perspective and wondering if they could actually work in practice. Turns out: YES!✅ Thrilled to introduce our Looped Model Ouro that matches much larger models across modern benchmarks!

Rui-Jie Zhu@RidgerZhu

Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.

English

Zerui Cheng retweetledi

Jianzhu Yao@alexbert135·21 Eki

🔥Introducing paper: Nondeterminism-Aware Optimistic Verification for Floating-Point Neural Networks🔥 😈Cloud/marketplace ML can silently downgrade/contaminate your results (model swap, early exit, quantization). You can’t verify what really ran: GPUs are nondeterministic.

English

392

Zerui Cheng retweetledi

Yuxi Li@yuxili99·16 Eki

OML: A Primitive for Reconciling Open Access with Owner Control in AI Model Distribution, by Zerui Cheng, Princeton U. @ZeruiCheng youtu.be/z2_YYoREbKA

YouTube

English

173

Zerui Cheng retweetledi

Wenhao Chai@wenhaocha1·17 Eki

LiveCodeBench Pro remains one of the most challenging code benchmarks, but its evaluation and verification process is still a black box. We introduce AutoCode, which democratizes evaluation allowing anyone to locally run verification and perform RL training! For the first time, we also show that an LLM can act as a problem setter, transforming a simple problem into a harder version sometimes even harder than what it can solve itself. In other words, LLMs can generate problems they can’t yet solve, opening the door to true self-play. Moreover, through an agentic framework, we find that LLMs can automatically generate test cases, achieving 98.7% evaluation consistency, which is already highly practical accuracy for an RL verifier.

English

123

25.6K

Zerui Cheng@ZeruiCheng·13 Eki

Event/Registration link: luma.com/bh6qy11y?tk=Ae…. The Zoom link will be sent 24 hours before the event :)

English

Zerui Cheng@ZeruiCheng·13 Eki

Thank you to @yuxili99 and the DeAI Institute for the invitation. I will be giving an online talk on OML (arxiv.org/pdf/2411.03887) on Oct 15 at 9pm ET. If you're interested in our work, LLMs, AI safety, or Crypto+AI, I'd love to see you there and look forward to our discussion!

English

252

Zerui Cheng retweetledi

Wenhao Chai@wenhaocha1·19 Eyl

LiveCodeBench Pro is accepted by #NeurIPS2025 congratulations to the amazing team, special thanks to project lead @ZihanZheng71803 who maintains and actively update the benchmark. Here's the update we made among the past three months, we include some amazing new models: 1. GPT-5: first model that solve one HARD problem 2. GPT-oss: best open-source model choice (before qwen-3-next) 3. Qwen-3-next-80B-A3B: very close to Qwen3-235B-A22B 4. kwaipilot-40B: very early submission and good performance We are still on the way. Although OpenAI and Google DeepMind achieve very good results on ICPC which is amazing, but we have no idea how they run and test their models. In LiveCodeBench Pro, all the things are transparent! Our next plan is to release more details, pipelines, and data to public. You'll be able to test your model or train with verifier for RL locally soon. Stay tuned!

Wenhao Chai@wenhaocha1

We introduce LiveCodeBench Pro. Models like o3-high, o4-mini, and Gemini 2.5 Pro score 0% on hard competitive programming problems.

English

3.9K

Zerui Cheng retweetledi

Wenhao Chai@wenhaocha1·20 Eyl

xAI just released Grok 4 Fast, a powerful model for competitive programming. Through our collaboration with xAI, we tested this amazing model on LiveCodeBench Pro. We found that Grok-4-Fast can compete with o4-mini, slightly outperform Gemini 2.5 Pro, and even solved a hard-level problem in the 2025 Q2 set! Grok-4-Fast-Non-Reasoning has become the strongest non-reasoning model, potentially rivaling gpt-oss-20b (which is a reasoning model). We are excited to see more powerful models achieving new breakthroughs on LiveCodeBench Pro, and we thank @xai for their support.

English

326

57.8K

Zerui Cheng retweetledi

Jianzhu Yao@alexbert135·2 Eyl

🐺 The Werewolf Benchmark shows GPT-5 ruling the table, for now. Think your LLM agent can do better? 🎮 Join the MindGames @NeurIPS Competition, now featuring Werewolf and more social reasoning games! 🔗 mindgamesarena.com #NeurIPS2025 #AI #LLMAgents #GameAI #TheoryOfMind

Raphaël Dabadie (YC P26)@RaphaelDabadie

🐺 Introducing the Werewolf Benchmark, an AI test for social reasoning under pressure. Can models lead, bluff, and resist manipulation in live, adversarial play? 👉 We made 7 of the strongest LLMs, both open-source and closed-source, play 210 full games of Werewolf. Below is our role-conditioned Elo leaderboard. GPT-5 sits alone at the top, we’re looking for contenders strong enough to threaten its lead. (📥 DMs are open !) Find out more here: werewolf.foaster.ai

English

976

Zerui Cheng retweetledi

Wenhao Chai@wenhaocha1·11 Ağu

GPT-5, think more. In our latest LiveCodeBench Pro tests for Competitive Programming, GPT-5 Thinking hit a true 0→1 moment in 2025 Q1 set, the only model to crack the hard split, and this wasn’t even GPT-5 Thinking Pro. Average response length exceeded 100,000 tokens, which is 3x longer than o3. Leaderboard: livecodebenchpro.com All testing and infra credit goes to @ZihanZheng71803

English

196

66K

Keşfet

@GoogleDeepMind @Princeton @Uber @cryptocom @coinbase @no89thkey @brevis_zk @tengyanAI