Wayne Chi
447 posts

Wayne Chi
@iamwaynechi
CS Ph.D. at @SCSatCMU. Funded by @NDSEG Fellowship. Editor at https://t.co/kBygvj9hF0.



Benchmarks? Where we’re going, we don’t need benchmarks.


Indie game Slay the Spire 2 has surpassed 500,000 concurrent players on Steam The rougelike is now in the top 20 games with highest all-time peaks on Valve's platform


Just recorded a step by step walkthrough of how I vibe code games using Codex and Claude Code. I implement new features 'live' in the recording, showing how I get the most out of GPT and Opus. Full video hopefully landing tomorrow! It's going to be a good one... don't miss it!


New preprint alert 🚨 Can LLM agents develop video games? We release GameDevBench, the first benchmark evaluating agentic game development in a game engine, Godot. We also present two simple multimodal feedback mechanisms that lead to immediate performance gains. /🧵

Bloated patches: LM generated solutions of SWE-bench tasks are consistently longer than human-written gold solutions (and it's not just comments) 🧵


New preprint alert 🚨 Can LLM agents develop video games? We release GameDevBench, the first benchmark evaluating agentic game development in a game engine, Godot. We also present two simple multimodal feedback mechanisms that lead to immediate performance gains. /🧵


We find Gemini 3 flash in the Gemini CLI to be by far the most cost-effective model with second best performances and low costs. Claude Sonnet 4.5 and ChatGPT Codex 5.1 actually perform better in OpenHands than their own native agentic frameworks (claude code / codex). This performance, however, comes at increased costs.






New preprint alert 🚨 Can LLM agents develop video games? We release GameDevBench, the first benchmark evaluating agentic game development in a game engine, Godot. We also present two simple multimodal feedback mechanisms that lead to immediate performance gains. /🧵






