Wayne Chi

464 posts

Wayne Chi banner
Wayne Chi

Wayne Chi

@iamwaynechi

CS Ph.D. at @SCSatCMU. Funded by @NDSEG Fellowship. Editor at https://t.co/kBygvj9hF0.

Santa Clara Katılım Temmuz 2013
227 Takip Edilen914 Takipçiler
Sabitlenmiş Tweet
Wayne Chi
Wayne Chi@iamwaynechi·
New preprint alert 🚨 Can LLM agents develop video games? We release GameDevBench, the first benchmark evaluating agentic game development in a game engine, Godot. We also present two simple multimodal feedback mechanisms that lead to immediate performance gains. /🧵
English
19
27
257
25.7K
Wayne Chi
Wayne Chi@iamwaynechi·
@_sholtodouglas I've been evaluating agents for game development (see our GameDevBench work) and Claude is a noticeably worse than both Codex and Gemini at game development. Going to update the benchmark soon and will keep you posted if you're interested.
Wayne Chi tweet media
English
1
0
1
246
Sholto Douglas
Sholto Douglas@_sholtodouglas·
When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open. If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model
English
1.2K
84
1.4K
388.5K
Wayne Chi retweetledi
Thomas G. Dietterich
Thomas G. Dietterich@tdietterich·
Attention @arxiv authors: Our Code of Conduct states that by signing your name as an author of a paper, each author takes full responsibility for all its contents, irrespective of how the contents were generated. 1/
English
136
922
6.5K
1.1M
Gagan Bansal
Gagan Bansal@bansalg_·
"Good catch. Those two [bib references] were paraphrased from footnotes — I had real author lists from the PDF but I invented the titles." Who said this? 1. Claude 2. Codex 3. Gemini
English
2
0
5
1.1K
Wayne Chi
Wayne Chi@iamwaynechi·
@_Suresh2 @OpenAI @AnthropicAI @Google This isn't really a benchmark issue, but more of a methods problem. Also, in our benchmark agents are able to observe video of the scene so there is reasoning over multiple frames.
English
0
0
0
27
Wayne Chi
Wayne Chi@iamwaynechi·
We observed this a month or two ago on GameDevBench! Ever since GPT 5.4, @OpenAI took over as the best agent for game development. However @AnthropicAI was never in the lead; the best was actually @Google with Gemini (good at multimodal understanding). Good to see further confirmation on what's SOTA for game development.
Grace Li@grx_xce

Fun fact, GPT 5.5 is very good at Game Dev Game Dev is the notable category where @OpenAI consistently beats out @AnthropicAI's Claude models Upon code inspection, our @Designarena team found that GPT 5.5's frontend verbosity plays in its favor for game dev - it consistently created games with the most functional features Congrats to @OpenAI for establishing the new Game Dev frontier!

English
3
1
12
2.2K
Wayne Chi
Wayne Chi@iamwaynechi·
A big downside with the the new focus on ArXiv is you have to read (and eventually cite) some absolutely awful papers that would clearly never pass peer review...
English
0
0
6
682
Wayne Chi
Wayne Chi@iamwaynechi·
I love how southern Jensen sounds when he says America. 'Murica!🇺🇸🇺🇸🇺🇸🦅🦅🦅
English
0
0
1
155
Wayne Chi
Wayne Chi@iamwaynechi·
And they're cutting the next presenter's questions too???
English
0
0
4
675
Wayne Chi
Wayne Chi@iamwaynechi·
The presenters in front of me took 15 minutes instead of 10 minutes each. And then the conference organizer CUT MY QUESTIONS??? wtf @iclr_conf
English
1
2
30
6.2K
Wayne Chi
Wayne Chi@iamwaynechi·
@swyx @SapphoSys Unfortunately the internet decided to listen to canine****** instead
Wayne Chi tweet media
English
0
0
0
171
swyx
swyx@swyx·
@SapphoSys what do you mean, it has 2 nines
English
5
0
60
4.7K
chloe 🐇
chloe 🐇@SapphoSys·
world's first enterprise solution to reach zero nines uptime
chloe 🐇 tweet media
English
120
488
13.1K
620K
Wayne Chi
Wayne Chi@iamwaynechi·
@sumeetrm There's a great AI psychosis paper to be written here...
English
0
0
1
158
Wayne Chi
Wayne Chi@iamwaynechi·
@joseph_h_garvin You need to set a stopping criteria. For example, loss is at most x or code runs without errors. For something like game dev you can set visual feedback like screenshot must show y feature
English
0
0
1
724
Joseph Garvin
Joseph Garvin@joseph_h_garvin·
Claude code rarely runs for longer than 15m without stopping and asking for input from me. How do all these stories of people letting agents run overnight work? Custom harnesses? Yelling at Claude in all caps to keep going no matter what?
English
397
65
5.8K
1.4M
Wayne Chi
Wayne Chi@iamwaynechi·
I think I might be addicted to making benchmarks... evaluating LLMs is, for some strange reason, incredibly fun... Anyways new benchmark coming soon!
English
1
0
12
429