Wayne Chi

447 posts

Wayne Chi banner
Wayne Chi

Wayne Chi

@iamwaynechi

CS Ph.D. at @SCSatCMU. Funded by @NDSEG Fellowship. Editor at https://t.co/kBygvj9hF0.

Santa Clara Katılım Temmuz 2013
212 Takip Edilen878 Takipçiler
Sabitlenmiş Tweet
Wayne Chi
Wayne Chi@iamwaynechi·
New preprint alert 🚨 Can LLM agents develop video games? We release GameDevBench, the first benchmark evaluating agentic game development in a game engine, Godot. We also present two simple multimodal feedback mechanisms that lead to immediate performance gains. /🧵
English
19
29
253
22.6K
Wayne Chi
Wayne Chi@iamwaynechi·
I think I might be addicted to making benchmarks... evaluating LLMs is, for some strange reason, incredibly fun... Anyways new benchmark coming soon!
English
1
0
11
317
Wayne Chi
Wayne Chi@iamwaynechi·
@MatanHalevy Time to make a tier list of llm tier lists. With mini max S tier. Without A tier.
English
0
0
0
37
Tyler LaBonte
Tyler LaBonte@tmlabonte·
@iamwaynechi Can't wait for more games in various shades of red! ("rougelikes"... ok I'll see myself out)
English
1
0
1
62
Wayne Chi
Wayne Chi@iamwaynechi·
@chongdashu codex 5.3 (which powers gpt 5.4) is at the top of GameDevBench... Really cool to see it in action!
English
0
0
0
107
Chong-U
Chong-U@chongdashu·
Couldn't help it! Had to give GPT 5.4 (High) + /fast mode a try. → Added height terrains to the level → Animation tweens for the jumps Used xHigh to solve a gnarly bug with the controls successfully 💪 This Final Fantasy Tactics-inspired game was completely vibe coded!
Chong-U@chongdashu

Just recorded a step by step walkthrough of how I vibe code games using Codex and Claude Code. I implement new features 'live' in the recording, showing how I get the most out of GPT and Opus. Full video hopefully landing tomorrow! It's going to be a good one... don't miss it!

English
23
34
404
272.9K
Wayne Chi
Wayne Chi@iamwaynechi·
With a dramatic +9.5% improvement, codex 5.3 (high) is the new best agent solving 59.1% of GameDevBench tasks. This was achieved using our multimodal feedback method. Honestly, I did not expect codex (which was previously the weakest of the three big providers) to suddenly take the top spot... But congrats to the @OpenAI team on a great model!
Wayne Chi tweet media
Wayne Chi@iamwaynechi

New preprint alert 🚨 Can LLM agents develop video games? We release GameDevBench, the first benchmark evaluating agentic game development in a game engine, Godot. We also present two simple multimodal feedback mechanisms that lead to immediate performance gains. /🧵

English
1
6
14
1.6K
Wayne Chi
Wayne Chi@iamwaynechi·
@mohbii Agreed. But they still struggle with many basic game tasks, which I think are prerequisites to fun games
English
0
0
0
13
mohbi
mohbi@mohbii·
@iamwaynechi Generating game code isn't the same as designing a game. LLMs can scaffold systems, but they have no model of what's fun. The hard part of game dev isn't the code.
English
1
0
1
11
Wayne Chi
Wayne Chi@iamwaynechi·
New preprint alert 🚨 Can LLM agents develop video games? We release GameDevBench, the first benchmark evaluating agentic game development in a game engine, Godot. We also present two simple multimodal feedback mechanisms that lead to immediate performance gains. /🧵
English
19
29
253
22.6K
Graham Neubig
Graham Neubig@gneubig·
What I do if a paper I like doesn't have a GitHub repo. -2025: email the authors for the code 2026-: ask openhands to reimplements the code
English
5
9
170
16.3K
Julian Togelius
Julian Togelius@togelius·
But really, playing all the top games on the App Store or Steam without having been trained on them is maybe just the beginning; how about independently designing those games? Designing new and good games _should_ be harder than playing them, because it requires being able to play what you design. So, that is perhaps the real benchmark!
English
2
0
3
422
Julian Togelius
Julian Togelius@togelius·
How can it be that modern LLMs are so bad at playing games? Aren't they supposed to be generally intelligent? Honestly, they are better at coding games than playing them. Maybe programming is just a particular type of game? Our new position paper tackles these questions. (1/n)
Julian Togelius tweet media
English
5
7
40
3.3K
Chong-U
Chong-U@chongdashu·
This took a while... but it finally happened.
Chong-U tweet media
English
14
0
71
2K
Wayne Chi
Wayne Chi@iamwaynechi·
We saw in GameDevBench that agent scaffolding (claude-code, codex, openhands) is just as important as the model itself, affecting both performance and cost. gpt-5.1-codex-max shot up by close to 10% simply by switching over to openhands. sonnet-4.5 shot up by 5%. Is there a reason that agentic evals don't evaluate across various scaffolds? For example, SWE-Bench-Pro only evaluates on swe-agent. Why not also test performance on claude-code? I think with all the new agent scaffolds around, we have to shift from caring only about models to caring about everything that goes into making an agent tick.
Wayne Chi@iamwaynechi

We find Gemini 3 flash in the Gemini CLI to be by far the most cost-effective model with second best performances and low costs. Claude Sonnet 4.5 and ChatGPT Codex 5.1 actually perform better in OpenHands than their own native agentic frameworks (claude code / codex). This performance, however, comes at increased costs.

English
2
3
13
1.6K
swyx
swyx@swyx·
i've been cynical on open source ai for the last 3 years, and it's not been a popular view. people want to hear that open source is catching up, that some underdog team found this One Weird Trick to outperform gpt5. Kimi K2.5 didnt even beat GPT 5.2 in the end. @DeepSeek_ai v4 next week is probably the moment I really change my stance for the first time. Hearing that the Chinese labs leak like a sieve (do you know which culture loves gossip more than Americans? that's right) and all the other Tigers duly lined up to have their 15 seconds this week. (almost) everything is out now, and the stage is set for Whalefall. Looking forward to it.
English
61
38
830
300.7K
Wayne Chi
Wayne Chi@iamwaynechi·
@pachu2120 It definitely increases cost so there's a tradeoff. Most of the time the model turns it into frames via python and just ingests a few images rather than the entire video.
English
1
0
1
41
Pachu
Pachu@pachu2120·
Fantastic work! I've been using opus 4.5 with godot for some time, but just with a simple MCP that allows it to start the game to see if it compiles. Does having video input of the screen actually work well? How long are the videos? I would expect that having a video input would blow up the context window?
English
1
0
1
81
Engel Nyst - open/acc
Engel Nyst - open/acc@engelnyst·
Interesting! They tested with @OpenHandsDev and got even better performance than the native CLIs from the LLM provider 👀 (claude code, codex, gemini-cli) No, I don't know why. We have done some multimodal improvements, I wouldn't say a lot, and here we are. 😅
Wayne Chi@iamwaynechi

New preprint alert 🚨 Can LLM agents develop video games? We release GameDevBench, the first benchmark evaluating agentic game development in a game engine, Godot. We also present two simple multimodal feedback mechanisms that lead to immediate performance gains. /🧵

English
1
0
3
142