Nathan Cloos

62 posts

Nathan Cloos

Nathan Cloos

@nacloos

PhD at MIT | prev @MSFTResearch

Katılım Nisan 2014
641 Takip Edilen557 Takipçiler
Sabitlenmiş Tweet
Nathan Cloos
Nathan Cloos@nacloos·
Can LLMs play the game Baba Is You?🧩 In our new @icmlconf workshop paper, we show GPT-4o and Gemini-1.5-Pro fail dramatically in environments where both objects and rules must be manipulated! Here is an example of correct gameplay: (1/n)
English
21
81
452
79.3K
Nathan Cloos
Nathan Cloos@nacloos·
@MattPRD @moltbook Building clawblox.com, a Roblox-like game engine to make it easy for agents to implement multi-player 3D games and to play them with LLM-friendly APIs.
English
1
0
1
131
Matt Schlicht
Matt Schlicht@MattPRD·
Are you *making something agents want*? I might want to feature you on @moltbook, the only community of AI agents on the planet. Please reply here if you are building a service/app/product where an AI agent is your end user. I will reach out to you 🦞
Matt Schlicht tweet media
English
354
44
390
146.7K
Nathan Cloos retweetledi
Lance Ying
Lance Ying@LanceYing42·
Today we present a new framework for measuring human-like general intelligence in machines (what some people call AGI). Conventional AI benchmarks today assess only narrow capabilities in a limited range of human activities. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play all conceivable human games — what we call the ``Multiverse of Human Games''. Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to automatically construct standardized and containerized variants of popular human games on digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10% of the human average score on the majority of the games. Check out our website to play the games, see how agents play, and build agents to solve them!
Lance Ying tweet media
English
4
28
112
19.5K
Nathan Cloos retweetledi
Hansen Lillemark
Hansen Lillemark@hansenlillemark·
State of the art World Models still lack a unified world memory for representing and predicting dynamics out of their field of view. Why is that, and how can we fix it? Introducing Flow Equivariant World Models: models with memory capable of predicting out of view dynamics!🧵⬇️
English
17
101
753
89.5K
Nathan Cloos retweetledi
xAI
xAI@xai·
Grok Play: Enjoy and create multiplayer games where your Grok Owl can climb the leaderboard by playing against you, your friends, your friends' Owls, and itself. @nacloos @961014dltkdg
English
47
82
1.1K
339.1K
Nathan Cloos retweetledi
Ilia Sucholutsky
Ilia Sucholutsky@sucholutsky·
🧵🎉 Our mega-paper is finally published in TMLR! We're "Getting Aligned on Representational Alignment" - the degree to which internal representations of different (biological & artificial) information processing systems agree. 🧠🤖🔬🔍 #CognitiveScience #Neuroscience #AI
Ilia Sucholutsky tweet media
English
5
37
149
34.1K
Nathan Cloos retweetledi
Davide Paglieri
Davide Paglieri@PaglieriDavide·
A new challenger has entered the ring 🥉 This week’s entry on balrogai.com takes third place, powered by a 21B reasoning model @RekaAILabs Reka Flash 3 dominates similarly sized reasoning models like DeepSeek-R1-Distill-Qwen 32B on BALROG’s toughest agentic tasks! 🧵
Davide Paglieri tweet media
English
1
11
46
17.2K
Andrej Karpathy
Andrej Karpathy@karpathy·
For friends of open source: imo the highest leverage thing you can do is help construct a high diversity of RL environments that help elicit LLM cognitive strategies. To build a gym of sorts. This is a highly parallelizable task, which favors a large community of collaborators.
English
315
821
8.4K
1.2M
Paul Calcraft
Paul Calcraft@paul_cal·
The story of LLMs playing games, and what we know so far Tic Tac Toe, Chess, Minecraft, NYT Connections, Wordle, Pictionary, Connect 4, Codenames, Snake... 1/n
GIF
GIF
GIF
English
22
108
1K
248.7K
Nathan Cloos retweetledi
Davide Paglieri
Davide Paglieri@PaglieriDavide·
DeepSeek performed well where short term reasoning and planning are key. 🧩CoT traces showed strong intuitive reasoning—enough to solve the tricky “baba is ai” puzzle. Breaking “wall is stop” to reach the ball proved it can handle complex logic. ⚙️
Davide Paglieri tweet media
English
1
1
8
1.1K
Nathan Cloos
Nathan Cloos@nacloos·
Naming conventions with too few names led to consistency errors when comparing CKA implementations across papers. We iteratively refined our naming convention to resolve inconsistencies while keeping low naming complexity. (5/6)
Nathan Cloos tweet media
English
1
0
1
292
Nathan Cloos
Nathan Cloos@nacloos·
Update on our similarity-repository 🚨 More than 200 similarity measures across 32 papers are now registered! We'll also be presenting our work as an oral at the @NeurIPSConf @unireps workshop!  (1/6)
Nathan Cloos tweet media
English
2
6
38
3.6K