Maxim Saplin

23 posts

Maxim Saplin banner
Maxim Saplin

Maxim Saplin

@msmxm

شامل ہوئے Ekim 2010
13 فالونگ18 فالوورز
Maxim Saplin
Maxim Saplin@msmxm·
GH Copilot (VSCode Insider Preview) has added the context window stats... Eventually. And what a discovery, GPT-5.2 has just 128K context window (out of 272K allowed by the model)
Maxim Saplin tweet media
English
0
0
0
49
Maxim Saplin ری ٹویٹ کیا
Chenguang Wang (hiring)
Chenguang Wang (hiring)@ChenguangWang·
♟️Excited to share that our work LLM Chess! It’s a clean, scalable benchmark showing that even today’s top LLMs still struggle with strategic reasoning and instruction-following in dynamic environments. 📄 Paper: arxiv.org/abs/2512.01992 🏆 Leaderboard: maxim-saplin.github.io/llm_chess/ 💻 Code: github.com/maxim-saplin/l… 🎯Why Chess? Chess is the original AI challenge: strategic, long-horizon, and grounded. It’s also a clean test for LLMs: no contamination, no memorization, and difficulty scales with progress. 🔑• 50+ models including GPT-o3 @OpenAI, Gemini @Google, Claude @AnthropicAI, DeepSeek @deepseek_ai, Llama @Meta, @Alibaba_Qwen evaluated via agentic gameplay. • Reasoning models do much better than non-reasoning, yet many still can’t beat random play. • Top models reach ~758 Elo: good, but nowhere near strong humans. 🧑‍🤝‍🧑 Thank you amazing collaborators @msmxm, @SaiKolasani1, @nrcrispino, @kylepmont, @matei_zaharia, @jaredq_, @Chi_Wang_! 📍The work will also be presented at NeurIPS FoRLM Workshop at Sun, Dec 7 3:00–4:15pm PT in Upper Level Room 33ABC. Come chat with us and check out the live leaderboard!
Chenguang Wang (hiring) tweet mediaChenguang Wang (hiring) tweet media
English
0
2
9
624
Maxim Saplin ری ٹویٹ کیا
Nicholas Crispino
Nicholas Crispino@NRCrispino·
Excited to share our latest work, now on arXiv and at FoRLM @ NeurIPS'25! 🎉 Introducing **LLM Chess**: a benchmark for evaluating reasoning and instruction-following in LLMs through chess. LLMs now reach experts in math & coding, but can they *reason* in dynamic, multi-step strategic environments? We tested 50+ models. The results? Many models struggle to beat an opponent making *random* moves, and even powerful reasoning models cannot beat a *weak skilled opponent*. Why chess? It's been the "drosophila of AI" since the 1950s, used as a measuring stick for AI progress and a testbed for planning, strategy, and long-horizon decision-making. Unlike static benchmarks that get contaminated or saturated, chess offers: ✅ Dynamic, stochastic gameplay ✅ Adjustable difficulty via engine skill ✅ Resistance to memorization Our setup: LLMs play in an agentic environment, making moves through tool calls. **Phase 1:** 50+ models play 30 games each vs a random agent, a simple test that many models *fail* due to instruction-following failures or poor performance. **Phase 2:** Top reasoning models face the Komodo Dragon engine at various Elo scores from 250 to 1375 for performance estimation grounded in the real world (tied to chess. com Elo). Key findings for Phase 1: ♟️ Reasoning models crush non-reasoning: **45.4% vs 0.7%** win rate, with many models struggling to reach even 50% Win/Loss vs a random player ♟️ Instruction failures **3× higher** in non-reasoning models (71.9% vs 24.4%) ♟️ Test-time scaling for reasoning effort boosts performance up to **+20%** Key findings for Phase 2: 📉 The best LLM we tested (o3-low) peaks at only **~758 Elo**. While LLMs match experts in math & coding, they play chess around the average online player (~611 Elo on chess .com) and far below human masters (~2800 Elo). 🔄LLM Chess is extensible. As models improve, we scale difficulty. No saturation, no contamination. Check it out and let us know what you think! We are continually evaluating more models on the benchmark. Come and see us at the FoRLM workshop at 3:00-4:15pm on Sunday December 7th, 2025 @ Upper Level Room 33ABC at NeurIPS! 📄 Paper: arxiv.org/abs/2512.01992 🏆 Leaderboard: maxim-saplin.github.io/llm_chess/ 💻 Code: github.com/maxim-saplin/l… Huge thanks to @msmxm, @SaiKolasani1, @nrcrispino, @kylepmont, @matei_zaharia, @jaredq, @Chi_Wang_, @ChenguangWang 🙏
Nicholas Crispino tweet mediaNicholas Crispino tweet mediaNicholas Crispino tweet media
English
1
5
11
520
Maxim Saplin
Maxim Saplin@msmxm·
Putting it into perspective, Llama 3 released on April 18, 2024 was pre-trained on 15 trillion tokens. Llama 4 had 40T, almost a 3x increase.
English
0
0
0
48
Maxim Saplin
Maxim Saplin@msmxm·
AI saves hours nailing down this sort of typos in you code:
Maxim Saplin tweet media
English
0
0
0
36
Maxim Saplin
Maxim Saplin@msmxm·
A quick speed test of a devbox, measuring TS build time: git clone github.com/microsoft/Type… cd TypeScript npm install time npm run build Few results: i5 13600KF OC, Desktop, WSL2 - 16s M4 Pro MBP 16 - 19.2s i7-8850H, Win, WSL2 - 40.7s i7-8850H, Win - 53.5s i7-8850H, MBP - 52.9s
English
0
0
0
68
Jason Wei
Jason Wei@_jasonwei·
Throughout my time at OpenAI I have found Hongyu Ren to be absolutely ruthless. No mercy for evaluation benchmarks whatsoever: o3-mini is 83% on AIME, 2000+ codeforces elo. Every o*-mini model is so performant and fast. Congrats @ren_hongyu @shengjia_zhao and crew!
Hongyu Ren@ren_hongyu

o3-mini is here! Together with @shengjia_zhao, @_kevinlu, @max_a_schwarzer, @ericmitchellai, @brian_zq, @sandersted and many others, we trained this efficient reasoning model, maximally compressing the intelligence from big brothers o1 / o3. The model is very good in hard math/coding/science questions with a fraction of cost and latency, defining new cost-efficient reasoning frontier. The model supports three different reasoning powers. Users can adjust the thinking time based on different use cases. The longer the model thinks, the better its capability is. With o3-mini-low, we drastically reduce latency compared to o1-mini, achieving GPT-4o level latency for response. You can apply early access to this model today to do safety-testing! openai.com/index/early-ac…

English
7
19
381
49.8K
Maxim Saplin
Maxim Saplin@msmxm·
Oneteen onety one (HuggingFaceTB/SmolLM-135M)
Maxim Saplin tweet media
English
0
0
0
50
Maxim Saplin
Maxim Saplin@msmxm·
@bindureddy With HumanEval being available on the internet for more than 3 years, I see no point in reporting this value, assuming most of the models could have memorized it multiple times..
English
0
0
0
19
Bindu Reddy
Bindu Reddy@bindureddy·
OpenAI just released some of its own benchmarks and evals showcasing its clear dominance of GPT-4 The stand-out number is HumanEval, which measures the coding abilities of LLMs. GPT-4 blows everyone out of the water!! They are also pretty snarky about Claude Opus. They claim that the reported numbers vs. the numbers independently verified using an API are different 🤣🤣 Gemini is also weak in math and code, which we can confirm. Net-Net, for simple and fast calls, use a local LLM/Haiku. For hard calls, use GPT-4 Thanks, OpenAI for these evals, the open-source git-repo, and generally being slightly more transparent 🙏🙏
Bindu Reddy tweet media
English
19
68
352
83.8K
Maxim Saplin
Maxim Saplin@msmxm·
HuggingFace's dataset collection is a treasure...
Maxim Saplin tweet media
English
0
0
0
40
Maxim Saplin
Maxim Saplin@msmxm·
Sundman's general solution to the 3 body problem would involve at least [10 to the power of 8 million] iterations to calculate coordinates of moving planets. There're [10 to the power of 80] atoms in the known universe.
English
0
0
1
60
Maxim Saplin
Maxim Saplin@msmxm·
Interestingly, those people who recently held my cassette player began their inspection of the device by trying to open the lid to look inside, pulling out the cassette. Only after that did they start pushing buttons or listening to the sound.
Maxim Saplin tweet media
English
0
0
0
51
Maxim Saplin
Maxim Saplin@msmxm·
@HJR4711 @TheDotNetDev The guys who use SIMD or boost LLM inference performance via percision reduction will not agree ;)
English
0
0
0
18
Maxim Saplin
Maxim Saplin@msmxm·
"BCG consultants solving business problems with OpenAI’s GPT-4 performed 23% worse than those without it, new study finds" Fortune title says Yet, "using GPT-4 for creative product innovation outperformed the control group (those completed the task without using GPT-4) by 40%"
English
0
0
0
62