Christopher Z. Cui

421 posts

Christopher Z. Cui banner
Christopher Z. Cui

Christopher Z. Cui

@ccui9

Just a guy who likes writing, games, and boba. 2nd year PhD @UCSanDiego. @rajammanabrolu Prev: Intern@MBZUAI IFM Inc: Intern @ MSR Montréal

phd reality checked Katılım Ağustos 2011
218 Takip Edilen126 Takipçiler
Glenn Matlin
Glenn Matlin@GlennMatlin·
God, please give me patience, because if You give me strength, I'm going to need bail money.
English
1
0
1
47
Christopher Z. Cui retweetledi
Prithviraj (Raj) Ammanabrolu
Prithviraj (Raj) Ammanabrolu@rajammanabrolu·
Ever wished we had fewer X-training hyphenates? Pre, mid, post etc. Why not just Training? Trying to bridge the divides (and get all our friends into one team again), we intro *Introspective X Training*, an offline RL inspired method that scales effectively across any LLM stage by annotating your data with a thinking reward generated language critique! Up to 2.8x FLOP efficiency + 5-10 point score gains (esp with math and code) at any stage from scratch to 24T tokens on 8b (active) sized models!! We burned much compute ablating so you wouldn't have to Moral of the story is‼️don't throw out any data via filtering, just feedback condition it‼️ You can spend FLOPs up front on inference to *classify* data quality and then train so that tokens aren't all treated equally based on the feedback starting early in training itself. Right now they're really only separated out much later during mid/post training This improves overall compute efficiency and gives us benchmark perf not possible with just baseline methods! Paper here: arxiv.org/abs/2605.20285 Thanks to @BrandoCui and @GXiming for leading this w/ @__SyedaAkter @davidjesusacu @hyunw_kim @jaehunjung_com Yuxiao Qu @shrimai_ @YejinChoinka
English
1
16
79
11.2K
Christopher Z. Cui retweetledi
max
max@maxbittker·
feeling the reinforcement learning... Gemini 3.5 Flash is tied with GPT-5.5 at navigating complex tasks in Runescape - and it's 1/4 the price.
max tweet media
English
2
2
18
1.2K
Christopher Z. Cui retweetledi
Ziang Xiao
Ziang Xiao@ZiangXiao·
Looking forward to my visit and chatting with folks!! 😉
Stanford NLP Group@stanfordnlp

For this week's NLP seminar, we are excited to host @ZiangXiao from Johns Hopkins University! Date and Time: Thursday, May 21, 11:00AM — 12:00 PM Pacific Time. Zoom Link: stanford.zoom.us/j/93941842999?… Title: Evaluation is Power. How can we use it well? Abstract: Evaluation steers the field of AI. It shapes what is worth building, what counts as progress, and what gets deployed and regulated. However, are our evaluation practices keeping pace with that responsibility? Benchmarks often fail to measure what they claim, and practitioners rarely find that these evaluations translate into actionable improvements. In this talk, I argue that good AI evaluation rests on two foundations. First, validity. Drawing on measurement science, we developed a conceptual framework with tools that treat benchmark design as the disciplined construction of measurement instruments. This approach exposes hidden assumptions and makes evaluation accountable to the constructs it intends to capture. Second, human-centeredness. A methodologically rigorous evaluation can widen the sociotechnical gap if the construct was chosen without the people it concerns. I show how HCI methods can help us reveal frictions in human-AI interaction that benchmarks often overlook. I will close by introducing OpenEval, an ongoing infrastructure effort to realize these two foundations, and show how it enables more valid, auditable, and participatory evaluation. Evaluation is power. We should use it well. Hope to see you all there!

English
1
4
28
7K
Christopher Z. Cui
@TuhinChakr I've definitely seen this in my own personal use, especially when I can introduce a prior for the type of writing style I prefer. Content aside, it does feel like these models basically have the prose locked down.
English
0
0
0
64
Christopher Z. Cui retweetledi
Sarah Wiegreffe
Sarah Wiegreffe@sarahwiegreffe·
Looking for 1 emergency reviewer for a @COLM_conf paper on clinical NLP, due Wednesday (05/20). Please DM me if interested. Thanks!
English
0
4
13
3.1K
Christopher Z. Cui retweetledi
Ashutosh Baheti
Ashutosh Baheti@abaheti95·
In 1945, Vannevar Bush imagined a machine to extend a scientist's memory. He called it the MemEx. 80 years later, we built one for LLM agents. Tool outputs become Python objects; only print statements reach the model's context. 🧵 databricks.com/blog/memex-pro…
Ashutosh Baheti tweet media
English
2
14
68
12K
Christopher Z. Cui retweetledi
Junli Wang
Junli Wang@JunliWang2021·
Thrilled to see those promising numbers! 🤯 Same finding on our end with NanoRollout: cross-scaffold generalization basically doesn't happen out of the box -- something the field should be talking about more.
Junli Wang tweet media
Wenlin Yao@YaoWenlin

🌳 Introducing Orchard — an open-source agentic modeling framework! 🎉 One thin & cheap sandbox infra powers training recipes across SWE / GUI / personal-assistant agents: ⚙️ Orchard Env: 0.28s exec latency; 100% success @ 1,000 parallel sandboxes 💪 🛠️ Orchard-SWE: 67.5% on SWE-bench Verified (30B-A3B, ~3B active) 🖥️ Orchard-GUI: 68.4% avg on WebVoyager / Online-Mind2Web / DeepShop (4B!) 📬 Orchard-Claw: 73.9% pass@3 on Claw-Eval 🔗 arxiv.org/abs/2605.15040 📦 Code and data are coming soon! Let's accelerate open agentic AI! 🚀

English
1
6
33
6K
Christopher Z. Cui retweetledi
Vilém Zouhar
Vilém Zouhar@zouharvi·
I reviewed for ICML and all I got was this lousy registration.
English
1
1
42
9.2K
Christopher Z. Cui retweetledi
Owain Evans
Owain Evans@OwainEvans_UK·
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
Owain Evans tweet media
English
62
168
1.4K
340.2K
Christopher Z. Cui retweetledi
alex zhang
alex zhang@a1zhang·
A fun 48-hour run of letting an RLM iteratively building the interface for an RLM to play Pokemon Red (sneak peak of some fun things cooking at @PrimeIntellect😄). The interface generating RLM was just tasked with getting the RLM (same scaffold) to beat the game in under 5 hours wall-clock time. I originally expected the RLM to design some components used in Gemini Plays Pokemon like an extra map, an interface to parse the screen, etc., design low-level policies that would run fast on the emulator, and also design a good prompt and strategy around the RLM to use sub-agents to explore game state with checkpointing, use RNG manipulation in its favor, etc. Instead the RLM eventually just decided to give the RLM a `write_memory` tool, which the RLM player decided to use to 1) warp the player immediately to the Elite 4; 2) give itself a level 100 Mewtwo (which it mistakes to be a Ponyta due to weird Pokedex ID vs. internal ID); 3) give itself $999999; 4) give itself all 8 badges by setting the right flag. It then went ahead and destroyed the Elite 4 and Blue and beat the game in record time :p You'll also notice in the video there's weird backtracking and frame-skipping, this happens because it also did incorporate the strategy of launching sub-agents to explore action trajectories, but had a strange way of saving the frames and recording them (so you see the result of several sub-agent explorations). We'll have some more funny and cool RLM demos soon, but it's cool to see RLMs work as general-purpose agents (both the coding agent that designs the interface and the game-playing agent itself)!
English
8
28
222
11.9K
samsja
samsja@samsja19·
@teortaxesTex thats wild and please can we get more eval like this, legitimately the missing piece of the ecosystem to boost capability
English
1
0
4
467
Christopher Z. Cui
Christopher Z. Cui@ccui9·
@icmlconf (Obligatory ty for gold reviewer award, I prob can't go b/c of logistics but if you're at ICML checkout my lab-mate's awesome work @JennyShen056 )
English
0
0
1
131
Christopher Z. Cui
Christopher Z. Cui@ccui9·
I'm curious for other ICML reviewers who got gold / silver, what percentage were policy A vs B, what their average scores were, and whether the papers ultimately got in. Any chance of those statistics getting released? @icmlconf
English
1
0
1
116
Christopher Z. Cui
Christopher Z. Cui@ccui9·
@lateinteraction Very excited for it, y'all are very consistent with the bangers 👀 That is a good point for the RLVR, my brain is probably too deep in the agent rabbit hole where there's alot of PI beyond the final correct answer and the cost for that info scales with environment complexity.
English
2
0
1
197
Omar Khattab
Omar Khattab@lateinteraction·
@ccui9 Regarding the second point: Cool! Wait for the next blog ;-) But in the meantime, isn't this how all RLVR is done?
English
1
0
3
259
Christopher Z. Cui
Christopher Z. Cui@ccui9·
@lateinteraction I do want to emphasize I think its good work but I tend to wrinkle my brow when I see privileged information be exposed (even indirectly) to the model due to where my research origins started
English
0
0
1
27
Christopher Z. Cui
Christopher Z. Cui@ccui9·
@lateinteraction I guess the way I mentally define it, and what I saw in the blog from my quick speed read is 'information the model wouldn't normally have access to'. My main issue for using this type of information is that in scaled up environments or tasks, it can be costly to obtain.
English
2
0
1
260
Christopher Z. Cui
Christopher Z. Cui@ccui9·
@m2saxon This makes me worried about the adversarial example where a jilted author submits a paper with hallucinated references on purpose to lock out another. (I still think this change is overall for the better but worth thinking about abuse cases)
English
2
0
23
1.3K
Michael Saxon
Michael Saxon@m2saxon·
When the first overdelegated oversubscribed advisor pyramid lab gets effectively banned for a year because the undergrad's unchecked slop related work section goes all the way back up the chain and hits the advisor, there will be much gnashing of teeth. still prob a good policy
Thomas G. Dietterich@tdietterich

Attention @arxiv authors: Our Code of Conduct states that by signing your name as an author of a paper, each author takes full responsibility for all its contents, irrespective of how the contents were generated. 1/

English
7
16
350
19K
Christopher Z. Cui retweetledi
Thomas G. Dietterich
Thomas G. Dietterich@tdietterich·
Attention @arxiv authors: Our Code of Conduct states that by signing your name as an author of a paper, each author takes full responsibility for all its contents, irrespective of how the contents were generated. 1/
English
137
922
6.5K
1.1M