Maxim Saplin

23 posts

Maxim Saplin

@msmxm

شامل ہوئے Ekim 2010

13 فالونگ18 فالوورز

Maxim Saplin@msmxm·26 Oca

GH Copilot (VSCode Insider Preview) has added the context window stats... Eventually. And what a discovery, GPT-5.2 has just 128K context window (out of 272K allowed by the model)

English

Maxim Saplin ری ٹویٹ کیا

Chenguang Wang (hiring)@ChenguangWang·4 Ara

♟️Excited to share that our work LLM Chess! It’s a clean, scalable benchmark showing that even today’s top LLMs still struggle with strategic reasoning and instruction-following in dynamic environments. 📄 Paper: arxiv.org/abs/2512.01992 🏆 Leaderboard: maxim-saplin.github.io/llm_chess/ 💻 Code: github.com/maxim-saplin/l… 🎯Why Chess? Chess is the original AI challenge: strategic, long-horizon, and grounded. It’s also a clean test for LLMs: no contamination, no memorization, and difficulty scales with progress. 🔑• 50+ models including GPT-o3 @OpenAI, Gemini @Google, Claude @AnthropicAI, DeepSeek @deepseek_ai, Llama @Meta, @Alibaba_Qwen evaluated via agentic gameplay. • Reasoning models do much better than non-reasoning, yet many still can’t beat random play. • Top models reach ~758 Elo: good, but nowhere near strong humans. 🧑‍🤝‍🧑 Thank you amazing collaborators @msmxm, @SaiKolasani1, @nrcrispino, @kylepmont, @matei_zaharia, @jaredq_, @Chi_Wang_! 📍The work will also be presented at NeurIPS FoRLM Workshop at Sun, Dec 7 3:00–4:15pm PT in Upper Level Room 33ABC. Come chat with us and check out the live leaderboard!

English

624

Maxim Saplin ری ٹویٹ کیا

Nicholas Crispino@NRCrispino·3 Ara

Excited to share our latest work, now on arXiv and at FoRLM @ NeurIPS'25! 🎉 Introducing **LLM Chess**: a benchmark for evaluating reasoning and instruction-following in LLMs through chess. LLMs now reach experts in math & coding, but can they *reason* in dynamic, multi-step strategic environments? We tested 50+ models. The results? Many models struggle to beat an opponent making *random* moves, and even powerful reasoning models cannot beat a *weak skilled opponent*. Why chess? It's been the "drosophila of AI" since the 1950s, used as a measuring stick for AI progress and a testbed for planning, strategy, and long-horizon decision-making. Unlike static benchmarks that get contaminated or saturated, chess offers: ✅ Dynamic, stochastic gameplay ✅ Adjustable difficulty via engine skill ✅ Resistance to memorization Our setup: LLMs play in an agentic environment, making moves through tool calls. **Phase 1:** 50+ models play 30 games each vs a random agent, a simple test that many models *fail* due to instruction-following failures or poor performance. **Phase 2:** Top reasoning models face the Komodo Dragon engine at various Elo scores from 250 to 1375 for performance estimation grounded in the real world (tied to chess. com Elo). Key findings for Phase 1: ♟️ Reasoning models crush non-reasoning: **45.4% vs 0.7%** win rate, with many models struggling to reach even 50% Win/Loss vs a random player ♟️ Instruction failures **3× higher** in non-reasoning models (71.9% vs 24.4%) ♟️ Test-time scaling for reasoning effort boosts performance up to **+20%** Key findings for Phase 2: 📉 The best LLM we tested (o3-low) peaks at only **~758 Elo**. While LLMs match experts in math & coding, they play chess around the average online player (~611 Elo on chess .com) and far below human masters (~2800 Elo). 🔄LLM Chess is extensible. As models improve, we scale difficulty. No saturation, no contamination. Check it out and let us know what you think! We are continually evaluating more models on the benchmark. Come and see us at the FoRLM workshop at 3:00-4:15pm on Sunday December 7th, 2025 @ Upper Level Room 33ABC at NeurIPS! 📄 Paper: arxiv.org/abs/2512.01992 🏆 Leaderboard: maxim-saplin.github.io/llm_chess/ 💻 Code: github.com/maxim-saplin/l… Huge thanks to @msmxm, @SaiKolasani1, @nrcrispino, @kylepmont, @matei_zaharia, @jaredq, @Chi_Wang_, @ChenguangWang 🙏

English

520

Maxim Saplin@msmxm·20 Haz

@karpathy Mistral responds differently;)

English

Andrej Karpathy@karpathy·18 Haz

Part 2 of this mystery. Spotted on reddit. In my test not 100% reproducible but still quite reproducible. 🤔

Andrej Karpathy@karpathy

Not fully sure why all the LLMs sound about the same - over-using lists, delving into “multifaceted” issues, over-offering to assist further, about same length responses, etc. Not something I had predicted at first because of many independent companies doing the finetuning.

English

1.2K

712

9.3K

2.6M

Maxim Saplin@msmxm·7 Nis

Putting it into perspective, Llama 3 released on April 18, 2024 was pre-trained on 15 trillion tokens. Llama 4 had 40T, almost a 3x increase.

English

Maxim Saplin@msmxm·29 Mar

AI saves hours nailing down this sort of typos in you code:

English

Maxim Saplin@msmxm·26 Mar

A quick speed test of a devbox, measuring TS build time: git clone github.com/microsoft/Type… cd TypeScript npm install time npm run build Few results: i5 13600KF OC, Desktop, WSL2 - 16s M4 Pro MBP 16 - 19.2s i7-8850H, Win, WSL2 - 40.7s i7-8850H, Win - 53.5s i7-8850H, MBP - 52.9s

English

Maxim Saplin@msmxm·21 Ara

@_jasonwei @ren_hongyu @shengjia_zhao I am curious how o3 fairs in chess (e.g. github.com/maxim-saplin/l…) - chat models struggle to score even a single win against a random player

English

285

Jason Wei@_jasonwei·21 Ara

Throughout my time at OpenAI I have found Hongyu Ren to be absolutely ruthless. No mercy for evaluation benchmarks whatsoever: o3-mini is 83% on AIME, 2000+ codeforces elo. Every o*-mini model is so performant and fast. Congrats @ren_hongyu @shengjia_zhao and crew!

Hongyu Ren@ren_hongyu

o3-mini is here! Together with @shengjia_zhao, @_kevinlu, @max_a_schwarzer, @ericmitchellai, @brian_zq, @sandersted and many others, we trained this efficient reasoning model, maximally compressing the intelligence from big brothers o1 / o3. The model is very good in hard math/coding/science questions with a fraction of cost and latency, defining new cost-efficient reasoning frontier. The model supports three different reasoning powers. Users can adjust the thinking time based on different use cases. The longer the model thinks, the better its capability is. With o3-mini-low, we drastically reduce latency compared to o1-mini, achieving GPT-4o level latency for response. You can apply early access to this model today to do safety-testing! openai.com/index/early-ac…

English

381

49.8K

Maxim Saplin@msmxm·20 Ara

Do you mind trying o3 in this eval: maxim-saplin.github.io/llm_chess/ - @gdb?

English

Maxim Saplin@msmxm·16 Kas

When prompted to play chess, LLMs can't score a single win against a random player, @karpathy maxim-saplin.github.io/llm_chess/

English

Maxim Saplin@msmxm·4 Eki

Oneteen onety one (HuggingFaceTB/SmolLM-135M)

English

Maxim Saplin@msmxm·3 Eki

2 studies published in September 2024 and investigating the same subject, impact of GitHub CoPilot on dev productivity, draw opposite conclusions: 👍 papers.ssrn.com/sol3/papers.cf… - 26% more completed tasks 👎cio.com/article/354057… - no change in cycle time, 46% more bugs

English

Maxim Saplin@msmxm·14 Nis

@bindureddy With HumanEval being available on the internet for more than 3 years, I see no point in reporting this value, assuming most of the models could have memorized it multiple times..

English

Bindu Reddy@bindureddy·12 Nis

OpenAI just released some of its own benchmarks and evals showcasing its clear dominance of GPT-4 The stand-out number is HumanEval, which measures the coding abilities of LLMs. GPT-4 blows everyone out of the water!! They are also pretty snarky about Claude Opus. They claim that the reported numbers vs. the numbers independently verified using an API are different 🤣🤣 Gemini is also weak in math and code, which we can confirm. Net-Net, for simple and fast calls, use a local LLM/Haiku. For hard calls, use GPT-4 Thanks, OpenAI for these evals, the open-source git-repo, and generally being slightly more transparent 🙏🙏

English

352

83.8K

Maxim Saplin@msmxm·14 Nis

HuggingFace's dataset collection is a treasure...

English

Maxim Saplin@msmxm·12 Nis

Sundman's general solution to the 3 body problem would involve at least [10 to the power of 8 million] iterations to calculate coordinates of moving planets. There're [10 to the power of 80] atoms in the known universe.

English

Maxim Saplin@msmxm·14 Mar

ZXX

Maxim Saplin@msmxm·1 Oca

Interestingly, those people who recently held my cassette player began their inspection of the device by trying to open the lid to look inside, pulling out the cassette. Only after that did they start pushing buttons or listening to the sound.

English

Maxim Saplin@msmxm·13 Eki

@HJR4711 @TheDotNetDev The guys who use SIMD or boost LLM inference performance via percision reduction will not agree ;)

English

The .NET Dev@TheDotNetDev·10 Eki

Look, C# can float! { author: @msmxm } #DEVCommunity #dotNET dev.to/maximsaplin/c-…

English

Maxim Saplin@msmxm·1 Eki

"BCG consultants solving business problems with OpenAI’s GPT-4 performed 23% worse than those without it, new study finds" Fortune title says Yet, "using GPT-4 for creative product innovation outperformed the control group (those completed the task without using GPT-4) by 40%"

English

Maxim Saplin@msmxm·21 Oca

@chepikov2005 dengi.onliner.by/2013/01/21/bel… Если экстраполировать на весь год, то много выходит... Есть ли характерные сезонные колебания CPI?

Русский

دریافت کریں

@OpenAI @Google @AnthropicAI @deepseek_ai @Meta @Alibaba_Qwen @SaiKolasani1 @nrcrispino