Christopher Z. Cui

317 posts

Christopher Z. Cui

@ccui9

Just a guy who likes writing, games and boba. Interactive Reasoning. 2nd year PhD @UCSanDiego. advised by @rajammanabrolu Prev: Intern @MBZUAI IFM

ramping sleep deprivation Katılım Ağustos 2011

194 Takip Edilen112 Takipçiler

Sabitlenmiş Tweet

Christopher Z. Cui@ccui9·23 Nis

Super excited to finally get this out of the official(unofficial) door! Always been a big fan of games, and since I began my research career LLMs, so of course a question that's been on my mind is "How well can LLMs play games?"

Prithviraj (Raj) Ammanabrolu@rajammanabrolu

Introducing TALES - Text Adventure Learning Environment Suite A benchmark of a few hundred text envs: science experiments and embodied cooking to solving murder mysteries. We test over 30 of the best LLM agents and pinpoint failure modes +how to improve 👨‍💻pip install tale-suite

English

4.1K

Christopher Z. Cui retweetledi

Prithviraj (Raj) Ammanabrolu@rajammanabrolu·9 Mar

Passed out at my desk last night and woke up to my (barebones) parallel agent harness still working and got confused for a sec why my computer was moving without me. My most visceral old man moment yet

Prithviraj (Raj) Ammanabrolu tweet media

English

707

Christopher Z. Cui retweetledi

Peter Hase@peterbhase·4 Mar

Can we train models to have more monitorable CoT? We introduce Counterfactual Simulation Training to improve CoT faithfulness/monitorability. CST produces models that admit to reward hacking and deferring too much to Stanford profs (@chrisgpotts told me this is very dangerous)

English

209

20.7K

Christopher Z. Cui retweetledi

Glenn Matlin@GlennMatlin·1 Mar

Congratulations to Stanford for catching up to UMD and their offerings mingliiii.github.io/cmsc848r/ you should be following @sarahwiegreffe cs.umd.edu/people/sarahwie who has already been teaching these topics

himanshu@himanshustwts

s in stanford stands for state-of-the-art

English

331

38.4K

Christopher Z. Cui retweetledi

Junli Wang@JunliWang2021·24 Şub

If you are intersted in the research in video pretraining for agents, you can check our work VideoAgentTrek here: arxiv.org/abs/2510.19488

English

256

Christopher Z. Cui retweetledi

Glenn Matlin@GlennMatlin·23 Şub

and nothing can go wrong . . . OH NO! IT ALL WENT WRONG

English

260

Christopher Z. Cui retweetledi

Taiwei Shi@taiwei_shi·17 Şub

For decades, we’ve trained AI to chase rewards. But humans don’t just optimize outcomes. We experience, reflect, then learn. Can AI do the same? Introducing 𝐄𝐱𝐩𝐞𝐫𝐢𝐞𝐧𝐭𝐢𝐚𝐥 𝐑𝐞𝐢𝐧𝐟𝐨𝐫𝐜𝐞𝐦𝐞𝐧𝐭 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠, a step toward AI that truly learn from experience.

English

219

1.3K

214.7K

Christopher Z. Cui retweetledi

Xinjie Shen@Frilk3·13 Şub

Are you "using" AI, or are you "over-relying" on it? 🧠🤖 arxiv.org/abs/2602.11567… In our latest CHI 2026 paper, we analyzed 77 users interacting with LLMs to find the answer. AND, we found that looking at the final answer is too late. You have to watch how people move their mouse. Here are 5 behavioral patterns that reveal when you are blindly trusting AI hallucinations. 🧵👇 🧪 The Setup: We ran a controlled study with all participants across 3 real-world tasks: 1️⃣ Quiz Solving 2️⃣ Summarization 3️⃣ Trip Planning We injected plausible misinformation (hallucinations) into the AI's answers. Then, we tracked every scroll, click, and pause to see who fell for it. 🚩 The Core Problem: Most methods detect overreliance by checking if the final answer is wrong. But we wanted Process-Oriented Detection. Can we spot overreliance while it's happening? Yes. We found 5 distinct "fingerprints" in user behavior. Thanks all coauthors Chang Liu @_ChangLiu Qinyi Zhou @qinyizhou2024 Xingyu Bruce Liu @liu_xingyu Sherry Tongshuang Wu @tongshuangwu Xiang 'Anthony' Chen @_xiang_chen_ #chi2026 #HAI #overreliance

English

6.4K

Christopher Z. Cui retweetledi

Archiki Prasad@ArchikiPrasad·11 Şub

🚨Excited to share our new work viewing reasoning strategies as teaching tools: for fixed target model, which CoT strategies best support learning and generalization? ✨Our answer is intrinsic dimensionality (minimum effective capacity a model needs to solve the task). Somewhat counterintuitively, adding CoT – which requires generating longer and more structured outputs – can reduce learning complexity. Good reasoning compresses the task, i.e., it reduces the degrees of freedom the model needs to map inputs to correct solutions. 🧵⬇️ (1/5)

English

185

24K

Christopher Z. Cui retweetledi

mrinank@MrinankSharma·9 Şub

Today is my last day at Anthropic. I resigned. Here is the letter I shared with my colleagues, explaining my decision.

English

2.5K

35.6K

15M

Christopher Z. Cui retweetledi

Jacob X. Li@jacobli99·10 Şub

no one cares about your SWE-Bench score if your competitor is giving out free boba Qwen: ask our chatbot to order your drink. it pays.

English

9.9K

Christopher Z. Cui retweetledi

Prithviraj (Raj) Ammanabrolu@rajammanabrolu·9 Şub

Brandon Sanderson on AI Art drawing from a quote by Oscar Wilde. Everyone in the AI industry should be thinking about how the wider world perceives this work. "All art is useless. We decide what art is. We are the art." youtu.be/mb3uK-_QkOo

YouTube

English

1.8K

Christopher Z. Cui retweetledi

Yunyi Shen/申云逸 🐺@ShenRaphael·7 Şub

This is kinda fxxked up... (I didn't get an offer yet, but get this email)

English

8.6K

Christopher Z. Cui retweetledi

Prithviraj (Raj) Ammanabrolu@rajammanabrolu·7 Şub

Opus 4.6 gets a score of 95/350 in zork1 This is the highest score ever by far for a big model not explicitly trained for the task and imo is more impressive than writing a C compiler. Exploring and reacting to a changing world is hard! Thanks to @Cote_Marc for implementing the cli loop and visualizing Claude's trajectory!

Prithviraj (Raj) Ammanabrolu@rajammanabrolu

English

10.2K

Christopher Z. Cui retweetledi

Eli@elkelk·6 Şub

If everybody was nice to each other we could combine gpt 5.3 and opus 4.6 and we'd get gptopus 9.9 but nobody is ready for that

English

2.5K

Christopher Z. Cui retweetledi

max@maxbittker·5 Şub

racing Opus 4.6 against 4.5 to max out a Runescape account

English

233

248

5.1K

1.4M

Christopher Z. Cui retweetledi

AVB@neural_avb·6 Şub

A new AGI benchmark is here. LLMs vs Balatro TIL there is a repo to make LLMs play Balatro: github.com/coder/balatrol… And benchmark: github.com/coder/balatrob…

English

670

Christopher Z. Cui retweetledi

James Vincent@jjvincent·4 Şub

ZXX

123

1.6K

21.9K

Christopher Z. Cui retweetledi

Prithviraj (Raj) Ammanabrolu@rajammanabrolu·6 Şub

Axiom: There (always) exists a set of benchmarks such that your (frontier) model is the best

Clive Chan@itsclivetime

why does nobody use the same benchmarks 😭 only overlapping benchmark is TerminalBench 2.0

English

Christopher Z. Cui retweetledi

Prashant Jayannavar@p_jayannavar·2 Şub

🎮🤖 Can games teach AI to understand the physical world? Excited to announce a special session at the 2026 IEEE Conference on Games (@ieee_cog): Evaluating and Advancing Spatial Intelligence through Games. Submit your research and join us in Madrid this September! 🇪🇸 🧵👇 (1/5)

English

4.1K

Christopher Z. Cui retweetledi

Prithviraj (Raj) Ammanabrolu@rajammanabrolu·3 Şub

We've been spending a lot of time thinking about how to scale RL along every possible axis. Automatically "mining" RL tasks from the Internet via synthetic transformations is a very clean solution!

Ximing Lu@GXiming

There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐

English

10K

Keşfet

@chrisgpotts @sarahwiegreffe @_ChangLiu @qinyizhou2024 @liu_xingyu @tongshuangwu @_xiang_chen_ @Cote_Marc