Brian Bartoldson

322 posts

Brian Bartoldson

Brian Bartoldson

@bartoldson

ML researcher

USA Katılım Ekim 2016
583 Takip Edilen613 Takipçiler
Sabitlenmiş Tweet
Brian Bartoldson
Brian Bartoldson@bartoldson·
At ICML 2024, we introduced the first scaling laws for adversarial training, improving the *train-time compute* for robustness tradeoff (arxiv.org/abs/2404.09349). At ICLR 2026, we show how to trade *test-time compute* for robustness efficiently. x.com/i/status/20374…
Brian Bartoldson@bartoldson

Give an LLM a spec: more reasoning ➡️ better spec satisfaction. Even on adversarially attacked data. But reasoning benefits fade if attacks are stronger (e.g. white-box or multimodal). Our hypothesis suggests reasoning can stop such attacks. Toy example in the video. 🧵

English
0
0
13
2.4K
Brian Bartoldson
Brian Bartoldson@bartoldson·
@amolk @sumeetrm Depending on your definition of harness, I think this is close to (if not exactly) what we have. Our "restricted harness" setting allows RLMs to perform operations on the context variable, make sub-LLM calls, etc., but disallows directly coding a solution.
English
0
0
1
21
Amol Kelkar
Amol Kelkar@amolk·
@bartoldson @sumeetrm Good point, so how about these 3 tracks? 1. LLM only - single LLM call (no RLM style behind the scene calls allowed) 2. Harness without code generation/execution 3. Harness with code generation/execution
English
1
0
0
24
Brian Bartoldson retweetledi
Sumeet Motwani
Sumeet Motwani@sumeetrm·
LongCoT is adding two new leaderboards! Due to the interest in agents (particularly RLMs), we’re adding a “Restricted Harness” and an “Open Harness” leaderboard. GPT 5.2 RLM from our paper is SOTA on “Open Harness” at 25.12%. We expect tool-use SOTA to exceed this very soon! On “Open Harness”, we allow all tool-use and code execution. On “Restricted Harness”, models may manage context, call subagents, etc, but may not write specific solver code (e.g. writing a BlocksWorld or Sudoku solver). We’re particularly excited about this leaderboard, as it allows agents to do their own context management, while sticking to LongCoT’s goal of testing models’ intrinsic reasoning capabilities.
Sumeet Motwani tweet media
Sumeet Motwani@sumeetrm

We’re releasing LongCoT, an incredibly hard benchmark to measure long-horizon reasoning capabilities over tens to hundreds of thousands of tokens. LongCoT consists of 2.5K questions across chemistry, math, chess, logic, and computer science. Frontier models score less than 10%🧵

English
7
19
82
11.3K
Amol Kelkar
Amol Kelkar@amolk·
@sumeetrm 2 ensures that all reasoning is done by the LLM. 1 allows external symbolic approaches, algorithms, etc.
English
1
0
0
36
Brian Bartoldson
Brian Bartoldson@bartoldson·
Finally, we note that LongCoT has two splits: LongCoT-mini (for fast open source research) and LongCoT (for frontier models). While some problems are naturally solvable by symbolic programs (e.g. some chess problems) and require many CoT reasoning steps, others are great tests for decomposition and maintaining important variables across steps (like mathematics and chemistry, which have composed, interdependent sub-problems). We suspect RLMs may be highly effective for all problem types.
English
0
0
5
259
Brian Bartoldson
Brian Bartoldson@bartoldson·
So, if you want to compare RLMs to no-tools baselines, we suggest ensuring that RLMs are performing this sort of reasoning assistance, rather than just writing a Python code (to symbolically solve some problems) that makes it unnecessary for LLMs to reason through the steps of a problem. Concrete tips: - Look at performance across domains. If chess performance is 80% and chemistry performance is near 0%, the RLM might be writing Python code to simulate chess moves rather than reasoning through the game states as a no-tools baseline does. Such simulations may be harder to code for LongCoT chemistry and mathematics questions. - Given an LLM, when you see its RLM version boost performance, check the code executed by the RLM. Was the RLM decomposing the problem into simpler ones then providing these to the sub-LLM (this is okay!), or did it write code or import a library that solves the problem directly (LLMs w/out tools can’t do this so it won’t be a fair comparison)?
English
1
0
3
268
Brian Bartoldson
Brian Bartoldson@bartoldson·
However, beyond just writing code, RLMs can decompose hard problems into simpler ones, and have the LLM perform intermediate reasoning steps. And this behavior is aligned with what we set out to test with LongCoT. Indeed, in the LongCoT paper, we show a second setting where we verbally ask an RLM (by modifying its system prompt) to avoid solving the problem primarily symbolically, saying it must instead only use code for problem decomposition. As Figure 7 shows, this restriction led to a drop from the 25.1% performance observed with a default RLM. There are likely better strategies than our prompt approach for keeping RLMs comparable to no-tools baselines, and we welcome innovation from the community here.
English
1
0
3
288
Brian Bartoldson
Brian Bartoldson@bartoldson·
Accordingly, tomorrow, we will begin tracking tool-enabled performance on a separate leaderboard at longcot.ai. Note that we expect this to be saturated much faster than the base leaderboard that doesn’t allow tools.
English
1
0
2
260
Brian Bartoldson
Brian Bartoldson@bartoldson·
The attention on LongCoT is great! It's far from solved (GPT 5.2 w/out tools gets 9.8%). Out-of-the-box, a GPT 5.2 RLM gets 25% (see Figure 7). Better prompting/training should push RLMs past this. Comparing RLMs to no-tool baselines? See our 🧵of tips x.com/sumeetrm/statu…
Sumeet Motwani@sumeetrm

We’re releasing LongCoT, an incredibly hard benchmark to measure long-horizon reasoning capabilities over tens to hundreds of thousands of tokens. LongCoT consists of 2.5K questions across chemistry, math, chess, logic, and computer science. Frontier models score less than 10%🧵

English
2
7
51
12.5K
Brian Bartoldson
Brian Bartoldson@bartoldson·
First, some background: The LongCoT paper outlines that, in our primary evaluations, tool use is not allowed. Our main goal is to test whether models can reason through complex problems directly in their chain of thought. If a problem requires simulating the evolution of a chessboard state (e.g.), the LLM must do that in its CoT.
English
1
0
5
381
Gabriel Lespérance
Gabriel Lespérance@GabLesperance·
@dosco @bartoldson same on a lot of the tasks I've tried the RLMs on. That said, i don't think sub-calls are the defining features of RLMs. imo the lift is in the symbolic reasoning / treating code as thinking substrate.
Gabriel Lespérance tweet media
English
1
0
1
36
Brian Bartoldson
Brian Bartoldson@bartoldson·
Exactly -- we evaluated untrained RLMs on LongCoT and think training to decompose tasks would boost performance. Two eval settings to consider: (1) Python may be used to avoid reasoning (dotted bars) (2) LLM/RLM does all the reasoning (solid bars) x.com/a1zhang/status…
Brian Bartoldson tweet media
alex zhang@a1zhang

English
4
9
55
10K
Christopher Z. Cui
Christopher Z. Cui@ccui9·
@raw_works Any chance you have the explicit token counts for the open models / would be willing to share the traces? 👀
English
1
0
2
77
Raymond Weitekamp
Raymond Weitekamp@raw_works·
Ran Qwen3-8B (8.2B dense, open) on LongCoT-Mini. Vanilla: 0/507. dspy.RLM: 33/507 (6.5%). Same model. Same weights. No fine-tuning. The scaffold is doing 100% of the lifting. Context: leaderboard's smallest open MoE is GLM-4.7 at 358B total / 32B active params. Qwen3-8B is ~4x smaller by active params and ~44x smaller by total. A scaffolded 8B dense model matching a GLM-4.7-style overall number (5.9%) on a benchmark designed to reward long-horizon reasoning is the point of RLMs — decomposable problems don't need scale, they need a REPL.
Raymond Weitekamp@raw_works

ok so the default DSPy.RLM is literally going to destroy this benchmark before the end of the day. running now for sonnet 4.5... 🏆 Scoreboard (live) RLM: 90/94 (95.7%) Vanilla: 0/94 (0.0%) anyone want to pay for the opus run? 😉

English
7
9
110
32.1K
Brian Bartoldson
Brian Bartoldson@bartoldson·
We found this offloading seems to happen less on some domains, like chemistry and math, which appear harder to offload to code. This is consistent with the similarity between the solid and dashed bars in the plot above.
English
0
0
5
327
Brian Bartoldson
Brian Bartoldson@bartoldson·
To illustrate setting (1): when I ran DSPy RLMs on a chess problem from LongCoT-mini, I noticed the sub-LLM tool (`llm_query`) wasn't called -- Python ended up doing the search directly as follows. ``` board = ChessBoard() for i, move in enumerate(moves): board.make_move(move) ... solution = board_to_fen(board) ``` You can obtain setting (2) by altering DSPy's RLM prompt to discourage such reasoning offloading. Even in this harder setting, I suspect RLMs can be successful (and training to decompose tasks may be critical here).
English
1
1
6
1.1K
Brian Bartoldson retweetledi
Sumeet Motwani
Sumeet Motwani@sumeetrm·
We already do RLM evals on LongCoT (although our benchmark is intended for just models, not scaffolds). Your results in the main post are different from what you have in your comments and are with LongCoT-mini (x.com/raw_works/stat…). We're very excited about RLMs as a direction and are interested in seeing performance go up on our explicit horizon domains (Math/Chemistry/Computer Science).
Sumeet Motwani tweet media
Raymond Weitekamp@raw_works

almost done with the "mini" 🏆 Scoreboard @ 472 shared (28 RLM rows left) RLM 216/472 (45.8%) Vanilla 13/472 ( 2.8%)

English
0
3
25
1.7K
Brian Bartoldson retweetledi
λux
λux@novasarc01·
what stands out to me from a research perspective is that LongCoT isolates compositional horizon failure rather than just benchmark hardness...the local steps are often tractable but performance collapses when those steps must be coordinated across long dependency graphs with planning, state maintenance and backtracking. i think that makes it much more scientifically valuable than another “hard reasoning” benchmark bcoz it cleanly exposes the gap between step-level competence and trajectory-level reasoning robustness.
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning "We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models." "At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities."

English
1
4
40
4.1K
Brian Bartoldson
Brian Bartoldson@bartoldson·
@a_karvonen @TheNormanMu Also, at ICLR this year, we'll present work on the relationship between *test compute* scaling and robustness. Consistent with the above blog post, we study the timely problem of prompt injections and consider how reasoning can provide defenses. x.com/i/status/20375…
Brian Bartoldson@bartoldson

At ICML 2024, we introduced the first scaling laws for adversarial training, improving the *train-time compute* for robustness tradeoff (arxiv.org/abs/2404.09349). At ICLR 2026, we show how to trade *test-time compute* for robustness efficiently. x.com/i/status/20374…

English
0
0
4
236
Adam Karvonen
Adam Karvonen@a_karvonen·
It would cost a ~GPT-4 training run to get a human-level robustness CIFAR-10 classifier, or at least a 10 million times scale-up from a non-robust CIFAR-10 classifier! From @TheNormanMu 's latest blog post.
Adam Karvonen tweet media
English
1
0
44
4.5K