Bobby

24 posts

Bobby

@bobbycxy

Researching AI @ASTARsg

Singapore Katılım Şubat 2020

113 Takip Edilen114 Takipçiler

Bobby retweetledi

Kevin Patrick Murphy@sirbayes·5 Mar

I am pleased to share our 'AutoHarness' paper (ICLR'26 ws), that uses LLM-based code synthesis to generate python harness around an LLM policy. AutoHarness+small Gemini Flash beats Gemini-2.5-Pro and GPT-5.2-High on #TextArena games! openreview.net/forum?id=g9rEY…

English

115

12.3K

Bobby@bobbycxy·8 Ara

@jeffclune Thank you prof @jeffclune - your life’s work is very inspiring on how we can get to super intelligence.

English

Jeff Clune@jeffclune·7 Ara

Tomorrow/Sunday 10:15–10:35 Keynote Talk 3 (Jeff Clune) MindGames workshop if you are interested. I'll try to make it fun and controversial! mindgamesarena.com

English

15.5K

Bobby@bobbycxy·13 Eki

@LeonGuertler @xai Beyond happy for you bro.. You’re a generational talent with a mandate. Let it ripp!

English

818

León@LeonGuertler·12 Eki

Super excited to share that I've joined @xai! At last, Sunday nights at the office aren't lonely anymore haha

English

1.9K

346.9K

Bobby retweetledi

Simon Yu@simon_ycl·5 Ağu

Exciting to see more work on "Game as Benchmark", which is similar to our idea of TextArena (led by @LeonGuertler) for benchmarking models on >60 games. though you can see GM @MagnusCarlsen's comments on LLMs chess play 🔥

Demis Hassabis@demishassabis

Thrilled to announce the @Kaggle Game Arena, a new leaderboard testing how modern LLMs perform on games (spoiler: not very well atm!). AI systems play each other, making it an objective & evergreen benchmark that will scale in difficulty as they improve. kaggle.com/game-arena

English

8.2K

Bobby retweetledi

will brown@willccbb·5 Ağu

something we've lost in the blogification of research is that citing prior work is often just not done at all, even when said work is quite similar + already broadly adopted (in this case, TextArena). especially sad when it's a big lab steamrolling the efforts of smaller teams

Demis Hassabis@demishassabis

English

416

42.9K

Bobby retweetledi

Kevin Wang@KevinWang_111·17 Tem

Excited to announce the Mindgame @NeurIPS Competition is officially LIVE! 🤖 Pit your agents against others in Mafia, Codename, Prisoner’s Dilemma, Stg Hunt, and Colonel Blotto. Sign up now for $500 in compute credits on your initial run! 🔗 Register : mindgamesarena.com

English

20.9K

Bobby@bobbycxy·16 Tem

4/4 Blog: github.com/TextArena/Text… PyPI: github.com/LeonGuertler/T… Frontend: github.com/LeonGuertler/T… Matchmaking: github.com/TextArena/Text… Serverless: github.com/TextArena/Text…

English

159

Bobby@bobbycxy·16 Tem

3/4 Open source feels like the right way to build. And as research shifts to interactive agents and evals, we believe platforms like TextArena will become increasingly useful. We're not experts, but if you're building something similar, @LeonGuertler and I would be more than happy to offer our opinions and help.

English

182

Bobby@bobbycxy·16 Tem

1/4 It turns out that neither @LeonGuertler nor I had any frontend/backend (nor SWE) experience when building TextArena. In the early days, we tried looking for open-source projects we could learn and start from. Unfortunately, there wasn't much avenues we could turn to, apart from OpenAI's gym environment :)

English

758

Bobby retweetledi

León@LeonGuertler·23 Haz

For the past ~2 months we have been working on training reasoning models on TextArena games. The first paper (introducing what we think is a very promising new paradigm) will hopefully be up later this week / early next; and the second one, focusing on the "scaling laws" of self-play and some additional analysis, tentatively around the 18th of july. However, to get more feedback on the structure and implementation, we want to open-source the code now. UnstableBaselines is a very simple Async, Online, Multi-Turn, Multi-Agent RL library built on vLLM and Ray. The code is pretty readable and around 1.2k lines long (and includes a cool rendering interface that you can run via "unstable-terminal") 1/7

GIF

English

304

44.3K

Bobby@bobbycxy·22 Nis

@Souradip3000 @_akhaliq @HuggingPapers Really appreciate that! Hope you have fun experimenting—curious to hear which games you end up liking most!

English

Souradip Pal@Souradip3000·22 Nis

@bobbycxy @_akhaliq @HuggingPapers Can't wait to try all those text games out while experimenting. Thanks for creating TextArena. Looks great!

English

Bobby@bobbycxy·21 Nis

Thank you @_akhaliq and @HuggingPapers for sharing our work. Appreciate it!

DailyPapers@HuggingPapers

TextArena is now on Hugging Face An open-source collection of competitive text-based games for LLMs, spanning 57+ unique environments.

English

13.7K

Bobby retweetledi

León@LeonGuertler·16 Nis

TextArena is live on arXiv! We present a benchmark of 57+ competitive text-based games to evaluate and train LLMs on agentic behavior — including negotiation, deception, theory of mind and many more. Real-time TrueSkill. Multiplayer support. Human-vs-models. Model-vs-model. Perfect environment for Multi-Agent, multi-turn reasoning and Planning! [1/N]

English

211

27.9K

Bobby retweetledi

León@LeonGuertler·12 Mar

Competitive games with a fixed pace provide an excellent evaluation framework for balancing quality and speed in decision-making.

León@LeonGuertler

Some intense fighting between Gemini Flash 2.0 and GPT-4o-mini. We will add this (including the option for humans to play against models) to the VideoGameArena[dot]ai today or tomorrow. If you have other game suggestions, please let us know!

English

613

Bobby@bobbycxy·6 Mar

@LChoshen @LeonGuertler textarena.ai/leaderboard/qw…

QME

Leshem (Legend) Choshen 🤖🤗@LChoshen·5 Mar

@LeonGuertler Have a link?

English

León@LeonGuertler·4 Mar

It's almost romantic; Qwen-plus and Sonnet complete each others soft-skills.

English

266

Bobby@bobbycxy·4 Mar

@LeonGuertler and I have started benchmarking models on Mario on VideoGameArena. And @GoogleDeepMind’s Gemini 2.0 Flash is the first to clear Stage 1-1! We'll be adding more models. Check out the current leaderboard here videogamearena.ai/mario