Zhoujun (Jorge) Cheng

400 posts

Zhoujun (Jorge) Cheng banner
Zhoujun (Jorge) Cheng

Zhoujun (Jorge) Cheng

@ChengZhoujun

Ph.D. @UCSanDiego | Scaling RL and agents

San Diego 参加日 Kasım 2021
684 フォロー中1.3K フォロワー
固定されたツイート
Zhoujun (Jorge) Cheng
Zhoujun (Jorge) Cheng@ChengZhoujun·
Pretraining has scaling laws to guide compute allocation. But for RL on LLMs, we lack a practical guide on how to spend compute wisely. We show the optimal compute allocation in LLM RL scales predictably. ↓ Key takeaways below
GIF
English
18
99
443
68.9K
Zhoujun (Jorge) Cheng がリツイート
Shibo Hao
Shibo Hao@Ber18791531·
🍫 CocoaBench v1.0 is out! CocoaBench is a benchmark for unified digital agents, built around open-world tasks that require composing 💻 coding, 👀 vision, 🌐 search. Since our first research preview last December, we have expanded the benchmark substantially with community contributed tasks, and spent months testing and refining the tasks, evaluations, and agent runs. Some takeaways: • Even the best agent system reaches only 45.1% on CocoaBench v1.0. • Coding agents like Codex are already surprisingly strong on general tasks beyond software engineering. • Stronger agents tend to push more of the work into code. • Open source models still lag behind leading frontier models on these general tasks. 👇More on the website and in the paper #AI #Agents #LLM #Benchmark #CocoaBench
Shibo Hao@Ber18791531

🍫 CocoaBench is calling for contributions from the community! Join us and help shape how next-generation agents are evaluated and built🚀✨ #LLM #AI #Agent #CocoaBench More details in the threads 👇

English
2
34
76
8.9K
Zhoujun (Jorge) Cheng がリツイート
Zora Wang
Zora Wang@ZhiruoW·
Excited to announce our new edition of DL4C workshop "Towards Human-Centered Coding Agents"! 🎉 Submit your work on building coding agents that are aligned, verifiable, steerable, and adaptable in their interaction with humans. Explore potential research directions in our position paper: zorazrw.github.io/files/position… Stay tuned! dl4c.github.io
Zijian Wang@zijianwang30

Excited to share that 𝘁𝗵𝗲 𝟱𝘁𝗵 @DL4Code 𝘄𝗼𝗿𝗸𝘀𝗵𝗼𝗽 𝗶𝘀 𝗰𝗼𝗺𝗶𝗻𝗴 𝘁𝗼 #ICML2026🇰🇷. Grateful to everyone who helped make this such an energizing space for the field! Also, a belated update on our position paper 𝗛𝘂𝗺𝗮𝗻𝘀 𝗮𝗿𝗲 𝗠𝗶𝘀𝘀𝗶𝗻𝗴 𝗳𝗿𝗼𝗺 𝗔𝗜 𝗖𝗼𝗱𝗶𝗻𝗴 𝗔𝗴𝗲𝗻𝘁 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵. What I’m especially proud of in the paper is the core argument: the next frontier for coding agents is not just more autonomy, but better human collaboration. As agents get stronger, the real bottleneck is increasingly whether people can align with them, steer them, verify their outputs, and trust them in real workflows. We turn that observation into concrete research agenda for building coding agents that are genuinely more useful. The seed for this paper came from an early conversation with the awesome @KLieret at #NeurIPS. I later discussed the idea with @jyangballin, which led to further conversations with @ZhiruoW @Diyi_Yang, and the final paper was led by these folks together with a fantastic group of co-authors. Parts of the argument were also shaped by wonderful discussions at the #DL4C workshops at #ICLR and #NeurIPS last year. Check it out! zorazrw.github.io/files/position…

English
1
4
44
8.8K
Zhoujun (Jorge) Cheng
Zhoujun (Jorge) Cheng@ChengZhoujun·
@Muennighoff This is cool! We had similar findings that pass@k and pass@1 co-improve at early RL stage, and performance sharpens to pass@1 while pass@k drops at late stage. (figure from "Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective")
Zhoujun (Jorge) Cheng tweet media
English
0
0
2
101
Zhoujun (Jorge) Cheng
Zhoujun (Jorge) Cheng@ChengZhoujun·
@srush_nlp Thanks for sharing these nuances publicly—the community is hungry for insights on agentic RL. May I ask how the short SFT (after CPT before RL) affects the RL stage performance or dynamics? It is hard to tune the SFT data coverage and epochs.
English
0
0
2
117
Zhoujun (Jorge) Cheng がリツイート
General Reasoning
General Reasoning@GenReasoning·
Introducing OpenReward. 🌍 330+ RL environments through one API ⚡ Autoscaled sandbox compute 🍒 4.5M+ unique RL tasks 🚂 Works like magic with Tinker, Miles, Slime Link and thread below.
General Reasoning tweet media
English
25
192
1.3K
239.3K
Zhoujun (Jorge) Cheng がリツイート
Seungwook Han
Seungwook Han@seungwookh·
Can language models learn useful priors without ever seeing language? We pre-pre-train transformers on neural cellular automata — fully synthetic, zero language. This improves language modeling by up to 6%, speeds up convergence by 40%, and strengthens downstream reasoning. Surprisingly, it even beats pre-pre-training on natural text! Blog: hanseungwook.github.io/blog/nca-pre-p… (1/n)
Seungwook Han tweet media
English
47
262
1.7K
247.6K
Zhoujun (Jorge) Cheng がリツイート
Lei Li
Lei Li@_TobiasLee·
Agents are doing real work, but existing benchmarks still test them in isolation. Today we’re releasing Claw-Eval 🦞: an open-source, transparent evaluation framework for AI agents. We feature 104 tasks spanning daily assistants, Office QA, deep finance research, and terminal usage. We test completion, robustness, and safety across real and mock services with configurable error injection. Fully traceable and human-verified. First leaderboard results: Claude Opus 4.6 @AnthropicAI tops pass rate (68.3%), but Gemini 3.1 @GeminiApp Pro edges it on avg score (0.764 vs 0.759). Agents have a long way to go.🤨 Check it out: claw-eval.github.io @steipete @openclaw
Lei Li tweet media
English
10
27
155
41K
Zhoujun (Jorge) Cheng がリツイート
Zora Wang
Zora Wang@ZhiruoW·
AI agents are tackling more and more "human work" But are they benchmarked on the work people actually do? tl;dr: Not really Most benchmarks focus on math & coding, while most human labor and capital lie elsewhere. 📒 We built a database linking agent benchmarks & real-world work Submit new tasks + agent trajectories today 🧵
Zora Wang tweet media
English
21
79
400
60.9K
Zhoujun (Jorge) Cheng がリツイート
Lianhui Qin
Lianhui Qin@Lianhuiq·
🤖Coding agents like Claude Code are already game changers for digital tasks in 2026. But what if they could write code to build physical worlds? 🏙️ Imagine going from a single line of prompt → a controllable, interactive simulated world. Such environments could open new frontiers for game creation, RL training, large-scale world simulation, and studying complex social reasoning. Our SimWorld agent coding team is working toward releasing a platform that lets anyone build their own virtual worlds. Stay tuned.
Murray Kang@haoqik322

What if coding agents could build entire virtual worlds? 🌍🏙️ SimWorld makes it possible — enabling agents like Claude Code 🤖 to generate and interact with scenes directly inside an Unreal Engine simulation 🎮 World simulation for embodied agents just became much easier and more accessible 🚀 Stay tuned — more models and capabilities coming soon ⚡️

English
1
14
48
10.3K
Zhoujun (Jorge) Cheng がリツイート
Toby Pohlen
Toby Pohlen@TobyPhln·
At 1:30 a.m. PT on November 3, 2023 Elon sent a message to the xAI group chat saying that we need to go “extremely hardcore” for the next 36 hours; Grok will be released publicly tomorrow. You didn’t have to be in the exclusive company chat to get the message; it was also posted publicly at the same time: x.com/i/status/17203… What unfolded over the next day and a half was one of the best examples of engineering at pace that I’ve ever seen. All we had when we started was a somewhat fine-tuned base model and a half-baked UI. Our team of ten split up the tasks: curate data, improve the model, implement the raw prompting and RAG service, build the production infra. I took care of the latter. At 8:51 p.m. PT the next day, we announced Grok to the world with a long-form post on X (x.com/xai/status/172…). Over the past 36 hours, we came up with Fun mode (including Grok’s sunglasses), finished the whole production system, and most importantly tuned the RAG system that gave it real-time knowledge of the world through the X platform (a first in the industry). A day and a half of straight coding and shipping; no drugs, not even caffeine, just pure adrenaline. Elon gave us a mission and we delivered. The launch went very well. We invited a couple hundred X creators and Grok’s ability to roast accounts went viral. It was the first time a publicly accessible AI was allowed to poke fun at people. This episode is a prime example of what you can achieve by going extremely hardcore: you move and deliver results faster than any outsider could have anticipated. Within 36 hours, we took the company from silence to relevance. It was well worth it. xAI’s hardcore culture is infamous on X. I love the tent meme that suggests we all sleep (well, slept in my case) in the office in tents. Our reputation precedes us and even new joiners hit the ground grinding hard. However, unless you understand the “why,” you are at risk of simply replicating the “how” without achieving the same results. You need to grind with purpose and the purpose is to move fast towards a known goal. When the goal and the means of reaching it are crystal clear, a small, skilled, and highly motivated team can outcompete companies old and new, big and small. Never grind to show off; never work late to be seen; never sacrifice without cause. There is no medal for the one who tried extremely hard but failed. There is only a medal for the winner. If all your efforts lead nowhere, you’re arguably not very productive. Always keep your eyes firmly on the goal, do everything to reach it as quickly as possible, and make sure you're on track to win. A hardcore engineering culture is one of the most effective ways of accelerating real progress. Watch out for performative sacrifice and don’t confuse pain with progress.
English
38
68
1K
206.8K
Zhoujun (Jorge) Cheng がリツイート
CLS
CLS@ChengleiSi·
Dimitris has been demonstrating the new way to do AI research in this agent era: find a neat problem, reason about why it’s interesting and tractable, offload the execution work to agents, analyze the results and write up a fun post about it. It’s still important for us human researchers to have the expertise to be able to identify the problem and judge the findings (eg, the background knowledge on matrix completion and SVD, and making the connection to LLM benchmarking); but even just automating the execution alone is already a massive acceleration and I think as a community we should really embrace this new form of AI-assisted research.
Dimitris Papailiopoulos@DimitrisPapail

x.com/i/article/2026…

English
5
9
177
28.4K
Zhoujun (Jorge) Cheng がリツイート
Standard Intelligence
Standard Intelligence@si_pbc·
Computer use models shouldn't learn from screenshots. We built a new foundation model that learns from video like humans do. FDM-1 can construct a gear in Blender, find software bugs, and even drive a real car through San Francisco using arrow keys.
GIF
English
187
395
3.9K
1.1M
Zhoujun (Jorge) Cheng がリツイート
Wei Liu
Wei Liu@WeiLiu99·
Writing GPU kernels is the perfect playground for Reinforcement Learning Why? 1️⃣ Verifiable Objectives: Code either runs or crashes; it’s either fast or slow. 2️⃣ Iterative Nature: It naturally fits multi-turn refinement, just like human experts profile and optimize step-by-step. But making it actually work is notoriously hard. RL models love to "cheat" (reward hacking) or "slack off" (lazy optimization). 😈😴 We finally crack this. Introducing Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations. With just 14B parameters, our model achieves performance competitive with, and even surpassing, frontier models like GPT-5 & Claude-4.5-Sonnet on KernelBench To solve these challenges, we: 🔹 Built KERNELGYM: A robust environment that handles crashes & detects hacking. 🔹 Propose TRLOO: An unbiased estimator for multi-turn RL. 🔹 Overcome "Lazy Optimization" via Stability (MRS) & Objective (PR/PRS) alignment. 🔹 Achieve massive gains via Sequential Test-Time Scaling (STTS).
Wei Liu tweet media
English
7
41
229
51.1K
Zhoujun (Jorge) Cheng がリツイート
Qwen
Qwen@Alibaba_Qwen·
🚀 Introducing Qwen3-Coder-Next, an open-weight LM built for coding agents & local development. What’s new: 🤖 Scaling agentic training: 800K verifiable tasks + executable envs 📈 Efficiency–Performance Tradeoff: achieves strong results on SWE-Bench Pro with 80B total params and 3B active ✨ Supports OpenClaw, Qwen Code, Claude Code, web dev, browser use, Cline, etc 🤗 Hugging Face: huggingface.co/collections/Qw… 🤖 ModelScope: modelscope.cn/collections/Qw… 📝 Blog: qwen.ai/blog?id=qwen3-… 📄 Tech report: github.com/QwenLM/Qwen3-C…
Qwen tweet media
English
211
791
5.6K
1.5M
Zhoujun (Jorge) Cheng がリツイート
Zhoujun (Jorge) Cheng がリツイート
Yuxiao Qu
Yuxiao Qu@QuYuxiao·
🚨 NEW PAPER: “POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration”! ❓ How do we train LLMs with RL on hard problems when the model never gets a single correct rollout? 💡 Short answer: standard RL is stuck. We show why, and introduce POPE to break this deadlock. 🧵[1/N]
Yuxiao Qu tweet media
English
9
34
237
44.7K
Zhoujun (Jorge) Cheng がリツイート
Zhen Wang
Zhen Wang@zhenwang9102·
🤖🔬 Can AI actually do science end-to-end? 🧠📈 And how would we know when it matches, or surpasses, humans? ⚡🧪 AI is rapidly automating scientific discovery, but benchmarking full-cycle discovery, from 💡 ideation → 🧑‍💻 execution → 📊 conclusions, remains unsolved: 🧐🧐🧐 ❌🛠️ Open-ended discovery → manual validation (costly, unscalable) ❌📏 Metric-driven benchmarks (e.g., MLE-Bench) → convenient but narrow (is higher accuracy really enough?) ❌🤖⚖️ LLM-as-judge → useful, but fundamentally risky if used alone 🔥🚀 Introducing FIRE-Bench🔥: Fullcycle Insight Rediscovery Evaluation 👉🌐 firebench.github.io 📚✨ A benchmark that turns fresh, human-verified insights from recent 🏆 NeurIPS / ICLR / ICML papers into masked, end-to-end discovery challenges 🧩 🌍🔐 Constrained open-ended discovery–backed by ground truth. 📌 Key takeaways: 1⃣ 📖🧱 Reference-based evaluation still matters: constrained LLM judging helps, but human-grounded references remain essential until agents can consistently match human conclusions 2⃣ 🏆🧠 Expert-validated ground truth: all tasks come from recent NeurIPS / ICLR / ICML papers, with contamination carefully controlled 3⃣ 🔁🎭 Rediscovery, not reproduction: original 🧪 methods, 📊 experiments, 💻 implementations, and 📈 analyses are fully masked to create real discovery challenges 🔑 Key empirical findings: 💡 The "Science Gap" is Real: Even the best setup (Claude Code + Sonnet-4) caps out at an F1 score of 46.7. On hard tasks, agents struggle to break 30 💡 Success is a "Lottery": Performance has incredibly high variance. Reliability is a major unsolved issue. 💡 Coding is no longer the bottleneck; high-level reasoning and analysis are: ~74% of errors stem from flawed planning, not coding ⚙️ How it works: 🔹 Research-Problem Trees: We parse papers into trees (from broad roots to concrete leaves). This allows us to select intermediate nodes that perfectly balance open-ended exploration with verifiable ground truth. 🔹 Claim-Level Evaluation: We match AI conclusions against human conclusions using granular claim decomposition (F1 score). 🔹 Creativity Check: We score false positives to see if agents are finding novel truths (Spoiler🚨: they aren’t creative yet). 🔹 New Diagnostic Taxonomy: failures traced across four stages: 🧠 Planning → 🛠️ Implementation → ▶️ Execution → 🧾 Conclusion 🔹 Additional Analyses: cost efficiency, contamination checks, and more. 👀 The Future: 🚀 Live-FIRE-Bench: a live, continuously updated FIRE-Bench to track real-time progress on the latest research (Newest LLMs should be benchmarked with the newest research) 🚀 Stronger scaffolding (search + planning + coding) 🧠🧰 and converting FIRE-Bench into interactive environments for training research agents 🚀 Toward real creativity: We want better systems that can produce genuinely novel conclusions toward creativity 🎨⏳ 🚀 Better systems 🧠✨ and better benchmarks 📏 must co-evolve 🔄 over time 📜🎥 Paper, video, demo, and research trees: 👉🌐 firebench.github.io #AI 🤖 #MachineLearning 📚 #AI4Science 🔬 #LLMs 🧠 #Research 🧪 #AgenticAI 🚀 #FireBench 🔥
English
3
4
29
8.6K