Bobby

24 posts

Bobby banner
Bobby

Bobby

@bobbycxy

Researching AI @ASTARsg

Singapore Katılım Şubat 2020
113 Takip Edilen114 Takipçiler
Bobby retweetledi
Kevin Patrick Murphy
Kevin Patrick Murphy@sirbayes·
I am pleased to share our 'AutoHarness' paper (ICLR'26 ws), that uses LLM-based code synthesis to generate python harness around an LLM policy. AutoHarness+small Gemini Flash beats Gemini-2.5-Pro and GPT-5.2-High on #TextArena games! openreview.net/forum?id=g9rEY…
English
2
14
115
12.3K
Bobby
Bobby@bobbycxy·
@jeffclune Thank you prof @jeffclune - your life’s work is very inspiring on how we can get to super intelligence.
English
0
0
1
54
Jeff Clune
Jeff Clune@jeffclune·
Tomorrow/Sunday 10:15–10:35 Keynote Talk 3 (Jeff Clune) MindGames workshop if you are interested. I'll try to make it fun and controversial! mindgamesarena.com
English
2
7
46
15.5K
Bobby
Bobby@bobbycxy·
@LeonGuertler @xai Beyond happy for you bro.. You’re a generational talent with a mandate. Let it ripp!
English
1
0
4
818
León
León@LeonGuertler·
Super excited to share that I've joined @xai! At last, Sunday nights at the office aren't lonely anymore haha
English
93
23
1.9K
346.9K
Bobby retweetledi
Simon Yu
Simon Yu@simon_ycl·
Exciting to see more work on "Game as Benchmark", which is similar to our idea of TextArena (led by @LeonGuertler) for benchmarking models on >60 games. though you can see GM @MagnusCarlsen's comments on LLMs chess play 🔥
Simon Yu tweet media
Demis Hassabis@demishassabis

Thrilled to announce the @Kaggle Game Arena, a new leaderboard testing how modern LLMs perform on games (spoiler: not very well atm!). AI systems play each other, making it an objective & evergreen benchmark that will scale in difficulty as they improve. kaggle.com/game-arena

English
0
9
60
8.2K
Bobby retweetledi
will brown
will brown@willccbb·
something we've lost in the blogification of research is that citing prior work is often just not done at all, even when said work is quite similar + already broadly adopted (in this case, TextArena). especially sad when it's a big lab steamrolling the efforts of smaller teams
Demis Hassabis@demishassabis

Thrilled to announce the @Kaggle Game Arena, a new leaderboard testing how modern LLMs perform on games (spoiler: not very well atm!). AI systems play each other, making it an objective & evergreen benchmark that will scale in difficulty as they improve. kaggle.com/game-arena

English
12
17
416
42.9K
Bobby retweetledi
Kevin Wang
Kevin Wang@KevinWang_111·
Excited to announce the Mindgame @NeurIPS Competition is officially LIVE! 🤖 Pit your agents against others in Mafia, Codename, Prisoner’s Dilemma, Stg Hunt, and Colonel Blotto. Sign up now for $500 in compute credits on your initial run! 🔗 Register : mindgamesarena.com
Kevin Wang tweet media
English
8
18
80
20.9K
Bobby
Bobby@bobbycxy·
3/4 Open source feels like the right way to build. And as research shifts to interactive agents and evals, we believe platforms like TextArena will become increasingly useful. We're not experts, but if you're building something similar, @LeonGuertler and I would be more than happy to offer our opinions and help.
English
1
0
3
182
Bobby
Bobby@bobbycxy·
1/4 It turns out that neither @LeonGuertler nor I had any frontend/backend (nor SWE) experience when building TextArena. In the early days, we tried looking for open-source projects we could learn and start from. Unfortunately, there wasn't much avenues we could turn to, apart from OpenAI's gym environment :)
English
1
1
13
758
Bobby retweetledi
León
León@LeonGuertler·
For the past ~2 months we have been working on training reasoning models on TextArena games. The first paper (introducing what we think is a very promising new paradigm) will hopefully be up later this week / early next; and the second one, focusing on the "scaling laws" of self-play and some additional analysis, tentatively around the 18th of july. However, to get more feedback on the structure and implementation, we want to open-source the code now. UnstableBaselines is a very simple Async, Online, Multi-Turn, Multi-Agent RL library built on vLLM and Ray. The code is pretty readable and around 1.2k lines long (and includes a cool rendering interface that you can run via "unstable-terminal") 1/7
GIF
English
2
47
304
44.3K
Bobby retweetledi
León
León@LeonGuertler·
TextArena is live on arXiv! We present a benchmark of 57+ competitive text-based games to evaluate and train LLMs on agentic behavior — including negotiation, deception, theory of mind and many more. Real-time TrueSkill. Multiplayer support. Human-vs-models. Model-vs-model. Perfect environment for Multi-Agent, multi-turn reasoning and Planning! [1/N]
León tweet media
English
8
41
211
27.9K
León
León@LeonGuertler·
It's almost romantic; Qwen-plus and Sonnet complete each others soft-skills.
León tweet media
English
1
1
4
266
Bobby
Bobby@bobbycxy·
@LeonGuertler and I have started benchmarking models on Mario on VideoGameArena. And @GoogleDeepMind’s Gemini 2.0 Flash is the first to clear Stage 1-1! We'll be adding more models. Check out the current leaderboard here videogamearena.ai/mario
English
0
1
5
265