Christopher Z. Cui

317 posts

Christopher Z. Cui banner
Christopher Z. Cui

Christopher Z. Cui

@ccui9

Just a guy who likes writing, games and boba. Interactive Reasoning. 2nd year PhD @UCSanDiego. advised by @rajammanabrolu Prev: Intern @MBZUAI IFM

ramping sleep deprivation Katılım Ağustos 2011
194 Takip Edilen112 Takipçiler
Sabitlenmiş Tweet
Christopher Z. Cui
Christopher Z. Cui@ccui9·
Super excited to finally get this out of the official(unofficial) door! Always been a big fan of games, and since I began my research career LLMs, so of course a question that's been on my mind is "How well can LLMs play games?"
Prithviraj (Raj) Ammanabrolu@rajammanabrolu

Introducing TALES - Text Adventure Learning Environment Suite A benchmark of a few hundred text envs: science experiments and embodied cooking to solving murder mysteries. We test over 30 of the best LLM agents and pinpoint failure modes +how to improve 👨‍💻pip install tale-suite

English
1
3
12
4.1K
Christopher Z. Cui retweetledi
Prithviraj (Raj) Ammanabrolu
Prithviraj (Raj) Ammanabrolu@rajammanabrolu·
Passed out at my desk last night and woke up to my (barebones) parallel agent harness still working and got confused for a sec why my computer was moving without me. My most visceral old man moment yet
Prithviraj (Raj) Ammanabrolu tweet media
English
0
1
19
707
Christopher Z. Cui retweetledi
Peter Hase
Peter Hase@peterbhase·
Can we train models to have more monitorable CoT? We introduce Counterfactual Simulation Training to improve CoT faithfulness/monitorability. CST produces models that admit to reward hacking and deferring too much to Stanford profs (@chrisgpotts told me this is very dangerous)
Peter Hase tweet media
English
12
36
209
20.7K
Christopher Z. Cui retweetledi
Junli Wang
Junli Wang@JunliWang2021·
If you are intersted in the research in video pretraining for agents, you can check our work VideoAgentTrek here: arxiv.org/abs/2510.19488
English
0
1
5
256
Christopher Z. Cui retweetledi
Glenn Matlin
Glenn Matlin@GlennMatlin·
and nothing can go wrong . . . OH NO! IT ALL WENT WRONG
Glenn Matlin tweet media
English
1
1
1
260
Christopher Z. Cui retweetledi
Taiwei Shi
Taiwei Shi@taiwei_shi·
For decades, we’ve trained AI to chase rewards. But humans don’t just optimize outcomes. We experience, reflect, then learn. Can AI do the same? Introducing 𝐄𝐱𝐩𝐞𝐫𝐢𝐞𝐧𝐭𝐢𝐚𝐥 𝐑𝐞𝐢𝐧𝐟𝐨𝐫𝐜𝐞𝐦𝐞𝐧𝐭 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠, a step toward AI that truly learn from experience.
Taiwei Shi tweet media
English
42
219
1.3K
214.7K
Christopher Z. Cui retweetledi
Xinjie Shen
Xinjie Shen@Frilk3·
Are you "using" AI, or are you "over-relying" on it? 🧠🤖 arxiv.org/abs/2602.11567… In our latest CHI 2026 paper, we analyzed 77 users interacting with LLMs to find the answer. AND, we found that looking at the final answer is too late. You have to watch how people move their mouse. Here are 5 behavioral patterns that reveal when you are blindly trusting AI hallucinations. 🧵👇 🧪 The Setup: We ran a controlled study with all participants across 3 real-world tasks: 1️⃣ Quiz Solving 2️⃣ Summarization 3️⃣ Trip Planning We injected plausible misinformation (hallucinations) into the AI's answers. Then, we tracked every scroll, click, and pause to see who fell for it. 🚩 The Core Problem: Most methods detect overreliance by checking if the final answer is wrong. But we wanted Process-Oriented Detection. Can we spot overreliance while it's happening? Yes. We found 5 distinct "fingerprints" in user behavior. Thanks all coauthors Chang Liu @_ChangLiu Qinyi Zhou @qinyizhou2024 Xingyu Bruce Liu @liu_xingyu Sherry Tongshuang Wu @tongshuangwu Xiang 'Anthony' Chen @_xiang_chen_ #chi2026 #HAI #overreliance
Xinjie Shen tweet media
English
7
9
38
6.4K
Christopher Z. Cui retweetledi
Archiki Prasad
Archiki Prasad@ArchikiPrasad·
🚨Excited to share our new work viewing reasoning strategies as teaching tools: for fixed target model, which CoT strategies best support learning and generalization? ✨Our answer is intrinsic dimensionality (minimum effective capacity a model needs to solve the task). Somewhat counterintuitively, adding CoT – which requires generating longer and more structured outputs – can reduce learning complexity. Good reasoning compresses the task, i.e., it reduces the degrees of freedom the model needs to map inputs to correct solutions. 🧵⬇️ (1/5)
Archiki Prasad tweet media
English
5
44
185
24K
Christopher Z. Cui retweetledi
mrinank
mrinank@MrinankSharma·
Today is my last day at Anthropic. I resigned. Here is the letter I shared with my colleagues, explaining my decision.
mrinank tweet mediamrinank tweet media
English
2.5K
5K
35.6K
15M
Christopher Z. Cui retweetledi
Jacob X. Li
Jacob X. Li@jacobli99·
no one cares about your SWE-Bench score if your competitor is giving out free boba Qwen: ask our chatbot to order your drink. it pays.
Jacob X. Li tweet media
English
1
2
21
9.9K
Christopher Z. Cui retweetledi
Prithviraj (Raj) Ammanabrolu
Prithviraj (Raj) Ammanabrolu@rajammanabrolu·
Brandon Sanderson on AI Art drawing from a quote by Oscar Wilde. Everyone in the AI industry should be thinking about how the wider world perceives this work. "All art is useless. We decide what art is. We are the art." youtu.be/mb3uK-_QkOo
YouTube video
YouTube
English
1
1
7
1.8K
Christopher Z. Cui retweetledi
Yunyi Shen/申云逸 🐺
Yunyi Shen/申云逸 🐺@ShenRaphael·
This is kinda fxxked up... (I didn't get an offer yet, but get this email)
Yunyi Shen/申云逸 🐺 tweet media
English
2
3
42
8.6K
Christopher Z. Cui retweetledi
Prithviraj (Raj) Ammanabrolu
Prithviraj (Raj) Ammanabrolu@rajammanabrolu·
Opus 4.6 gets a score of 95/350 in zork1 This is the highest score ever by far for a big model not explicitly trained for the task and imo is more impressive than writing a C compiler. Exploring and reacting to a changing world is hard! Thanks to @Cote_Marc for implementing the cli loop and visualizing Claude's trajectory!
Prithviraj (Raj) Ammanabrolu tweet media
Prithviraj (Raj) Ammanabrolu@rajammanabrolu

Introducing TALES - Text Adventure Learning Environment Suite A benchmark of a few hundred text envs: science experiments and embodied cooking to solving murder mysteries. We test over 30 of the best LLM agents and pinpoint failure modes +how to improve 👨‍💻pip install tale-suite

English
2
8
65
10.2K
Christopher Z. Cui retweetledi
Eli
Eli@elkelk·
If everybody was nice to each other we could combine gpt 5.3 and opus 4.6 and we'd get gptopus 9.9 but nobody is ready for that
Eli tweet media
English
13
4
70
2.5K
Christopher Z. Cui retweetledi
max
max@maxbittker·
racing Opus 4.6 against 4.5 to max out a Runescape account
English
233
248
5.1K
1.4M
Christopher Z. Cui retweetledi
James Vincent
James Vincent@jjvincent·
James Vincent tweet media
ZXX
7
123
1.6K
21.9K
Christopher Z. Cui retweetledi
Prashant Jayannavar
Prashant Jayannavar@p_jayannavar·
🎮🤖 Can games teach AI to understand the physical world? Excited to announce a special session at the 2026 IEEE Conference on Games (@ieee_cog): Evaluating and Advancing Spatial Intelligence through Games. Submit your research and join us in Madrid this September! 🇪🇸 🧵👇 (1/5)
Prashant Jayannavar tweet media
English
2
11
19
4.1K
Christopher Z. Cui retweetledi
Prithviraj (Raj) Ammanabrolu
Prithviraj (Raj) Ammanabrolu@rajammanabrolu·
We've been spending a lot of time thinking about how to scale RL along every possible axis. Automatically "mining" RL tasks from the Internet via synthetic transformations is a very clean solution!
Ximing Lu@GXiming

There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐

English
1
13
80
10K