Brian Bartoldson (@bartoldson) - Twitter Profili

Sabitlenmiş Tweet

At ICML 2024, we introduced the first scaling laws for adversarial training, improving the *train-time compute* for robustness tradeoff (arxiv.org/abs/2404.09349). At ICLR 2026, we show how to trade *test-time compute* for robustness efficiently. x.com/i/status/20374…

Brian Bartoldson@bartoldson

Give an LLM a spec: more reasoning ➡️ better spec satisfaction. Even on adversarially attacked data. But reasoning benefits fade if attacks are stronger (e.g. white-box or multimodal). Our hypothesis suggests reasoning can stop such attacks. Toy example in the video. 🧵

English

0

13

2.4K

Brian Bartoldson@bartoldson·20 Nis

@amolk @sumeetrm Depending on your definition of harness, I think this is close to (if not exactly) what we have. Our "restricted harness" setting allows RLMs to perform operations on the context variable, make sub-LLM calls, etc., but disallows directly coding a solution.

English

0

1

21

Amol Kelkar@amolk·20 Nis

@bartoldson @sumeetrm Good point, so how about these 3 tracks? 1. LLM only - single LLM call (no RLM style behind the scene calls allowed) 2. Harness without code generation/execution 3. Harness with code generation/execution

English

1

0

24

Brian Bartoldson retweetledi

Sumeet Motwani@sumeetrm·20 Nis

LongCoT is adding two new leaderboards! Due to the interest in agents (particularly RLMs), we’re adding a “Restricted Harness” and an “Open Harness” leaderboard. GPT 5.2 RLM from our paper is SOTA on “Open Harness” at 25.12%. We expect tool-use SOTA to exceed this very soon! On “Open Harness”, we allow all tool-use and code execution. On “Restricted Harness”, models may manage context, call subagents, etc, but may not write specific solver code (e.g. writing a BlocksWorld or Sudoku solver). We’re particularly excited about this leaderboard, as it allows agents to do their own context management, while sticking to LongCoT’s goal of testing models’ intrinsic reasoning capabilities.

Sumeet Motwani@sumeetrm

We’re releasing LongCoT, an incredibly hard benchmark to measure long-horizon reasoning capabilities over tens to hundreds of thousands of tokens. LongCoT consists of 2.5K questions across chemistry, math, chess, logic, and computer science. Frontier models score less than 10%🧵

English

7

19

82

11.3K

Brian Bartoldson@bartoldson·20 Nis

@amolk @sumeetrm I like this suggestion. A potential issue: an LLM can code up a domain-specific symbolic algorithm using only general purpose tools in a REPL. For example, I applied an RLM to a LongCoT chess problem, and it coded then ran a chessboard simulator. x.com/i/status/20452…

Brian Bartoldson@bartoldson

To illustrate setting (1): when I ran DSPy RLMs on a chess problem from LongCoT-mini, I noticed the sub-LLM tool (`llm_query`) wasn't called -- Python ended up doing the search directly as follows. ``` board = ChessBoard() for i, move in enumerate(moves): board.make_move(move) ... solution = board_to_fen(board) ``` You can obtain setting (2) by altering DSPy's RLM prompt to discourage such reasoning offloading. Even in this harder setting, I suspect RLMs can be successful (and training to decompose tasks may be critical here).

English

1

0

2

70

Amol Kelkar@amolk·20 Nis

@sumeetrm 2 ensures that all reasoning is done by the LLM. 1 allows external symbolic approaches, algorithms, etc.

English

1

0

36

Brian Bartoldson retweetledi

Sumeet Motwani@sumeetrm·19 Nis

@raw_works Important context x.com/bartoldson/sta…

Brian Bartoldson@bartoldson

The attention on LongCoT is great! It's far from solved (GPT 5.2 w/out tools gets 9.8%). Out-of-the-box, a GPT 5.2 RLM gets 25% (see Figure 7). Better prompting/training should push RLMs past this. Comparing RLMs to no-tool baselines? See our 🧵of tips x.com/sumeetrm/statu…

English

0

1

8

1.4K

Brian Bartoldson@bartoldson·19 Nis

Finally, we note that LongCoT has two splits: LongCoT-mini (for fast open source research) and LongCoT (for frontier models). While some problems are naturally solvable by symbolic programs (e.g. some chess problems) and require many CoT reasoning steps, others are great tests for decomposition and maintaining important variables across steps (like mathematics and chemistry, which have composed, interdependent sub-problems). We suspect RLMs may be highly effective for all problem types.

English

0

5

259

Brian Bartoldson@bartoldson·19 Nis

So, if you want to compare RLMs to no-tools baselines, we suggest ensuring that RLMs are performing this sort of reasoning assistance, rather than just writing a Python code (to symbolically solve some problems) that makes it unnecessary for LLMs to reason through the steps of a problem. Concrete tips: - Look at performance across domains. If chess performance is 80% and chemistry performance is near 0%, the RLM might be writing Python code to simulate chess moves rather than reasoning through the game states as a no-tools baseline does. Such simulations may be harder to code for LongCoT chemistry and mathematics questions. - Given an LLM, when you see its RLM version boost performance, check the code executed by the RLM. Was the RLM decomposing the problem into simpler ones then providing these to the sub-LLM (this is okay!), or did it write code or import a library that solves the problem directly (LLMs w/out tools can’t do this so it won’t be a fair comparison)?

English

1

0

3

268

Brian Bartoldson@bartoldson·19 Nis

However, beyond just writing code, RLMs can decompose hard problems into simpler ones, and have the LLM perform intermediate reasoning steps. And this behavior is aligned with what we set out to test with LongCoT. Indeed, in the LongCoT paper, we show a second setting where we verbally ask an RLM (by modifying its system prompt) to avoid solving the problem primarily symbolically, saying it must instead only use code for problem decomposition. As Figure 7 shows, this restriction led to a drop from the 25.1% performance observed with a default RLM. There are likely better strategies than our prompt approach for keeping RLMs comparable to no-tools baselines, and we welcome innovation from the community here.

English

1

0

3

288

Brian Bartoldson@bartoldson·19 Nis

Accordingly, tomorrow, we will begin tracking tool-enabled performance on a separate leaderboard at longcot.ai. Note that we expect this to be saturated much faster than the base leaderboard that doesn’t allow tools.

English

1

0

2

260

Brian Bartoldson@bartoldson·19 Nis

The attention on LongCoT is great! It's far from solved (GPT 5.2 w/out tools gets 9.8%). Out-of-the-box, a GPT 5.2 RLM gets 25% (see Figure 7). Better prompting/training should push RLMs past this. Comparing RLMs to no-tool baselines? See our 🧵of tips x.com/sumeetrm/statu…

Sumeet Motwani@sumeetrm

We’re releasing LongCoT, an incredibly hard benchmark to measure long-horizon reasoning capabilities over tens to hundreds of thousands of tokens. LongCoT consists of 2.5K questions across chemistry, math, chess, logic, and computer science. Frontier models score less than 10%🧵

English

2

7

51

12.5K

Brian Bartoldson@bartoldson·19 Nis

RLMs can aid this LLM-based simulation (very well in theory), but (due to REPL access) they can also just write Python code that solves certain problems symbolically (without any LLM ever seeing intermediate reasoning steps). x.com/bartoldson/sta…

Brian Bartoldson@bartoldson

To illustrate setting (1): when I ran DSPy RLMs on a chess problem from LongCoT-mini, I noticed the sub-LLM tool (`llm_query`) wasn't called -- Python ended up doing the search directly as follows. ``` board = ChessBoard() for i, move in enumerate(moves): board.make_move(move) ... solution = board_to_fen(board) ``` You can obtain setting (2) by altering DSPy's RLM prompt to discourage such reasoning offloading. Even in this harder setting, I suspect RLMs can be successful (and training to decompose tasks may be critical here).

English

1

0

3

419

Brian Bartoldson@bartoldson·19 Nis

First, some background: The LongCoT paper outlines that, in our primary evaluations, tool use is not allowed. Our main goal is to test whether models can reason through complex problems directly in their chain of thought. If a problem requires simulating the evolution of a chessboard state (e.g.), the LLM must do that in its CoT.

English

1

0

5

381

Brian Bartoldson@bartoldson·18 Nis

@GabLesperance @dosco We want to avoid a particular kind of offloading. LongCoT evaluates models w/out Python use, but RLMs require Python. For a fair comparison, we prompt the RLM to not just code a solution. Otherwise, RLMs "offload" reasoning to code. E.g., see below x.com/bartoldson/sta…

Brian Bartoldson@bartoldson

To illustrate setting (1): when I ran DSPy RLMs on a chess problem from LongCoT-mini, I noticed the sub-LLM tool (`llm_query`) wasn't called -- Python ended up doing the search directly as follows. ``` board = ChessBoard() for i, move in enumerate(moves): board.make_move(move) ... solution = board_to_fen(board) ``` You can obtain setting (2) by altering DSPy's RLM prompt to discourage such reasoning offloading. Even in this harder setting, I suspect RLMs can be successful (and training to decompose tasks may be critical here).

English

1

0

1

49

Gabriel Lespérance@GabLesperance·18 Nis

@dosco @bartoldson same on a lot of the tasks I've tried the RLMs on. That said, i don't think sub-calls are the defining features of RLMs. imo the lift is in the symbolic reasoning / treating code as thinking substrate.

English

1

0

1

36

Brian Bartoldson@bartoldson·18 Nis

Exactly -- we evaluated untrained RLMs on LongCoT and think training to decompose tasks would boost performance. Two eval settings to consider: (1) Python may be used to avoid reasoning (dotted bars) (2) LLM/RLM does all the reasoning (solid bars) x.com/a1zhang/status…

alex zhang@a1zhang

English

4

9

55

10K

Brian Bartoldson@bartoldson·18 Nis

@ccui9 @raw_works I'm also interested in seeing the successful traces. Here's a summary of a trace for a chess problem: x.com/i/status/20452…

Brian Bartoldson@bartoldson

To illustrate setting (1): when I ran DSPy RLMs on a chess problem from LongCoT-mini, I noticed the sub-LLM tool (`llm_query`) wasn't called -- Python ended up doing the search directly as follows. ``` board = ChessBoard() for i, move in enumerate(moves): board.make_move(move) ... solution = board_to_fen(board) ``` You can obtain setting (2) by altering DSPy's RLM prompt to discourage such reasoning offloading. Even in this harder setting, I suspect RLMs can be successful (and training to decompose tasks may be critical here).

English

0

2

45

Christopher Z. Cui@ccui9·17 Nis

@raw_works Any chance you have the explicit token counts for the open models / would be willing to share the traces? 👀

English

1

0

2

77

Raymond Weitekamp@raw_works·17 Nis

Ran Qwen3-8B (8.2B dense, open) on LongCoT-Mini. Vanilla: 0/507. dspy.RLM: 33/507 (6.5%). Same model. Same weights. No fine-tuning. The scaffold is doing 100% of the lifting. Context: leaderboard's smallest open MoE is GLM-4.7 at 358B total / 32B active params. Qwen3-8B is ~4x smaller by active params and ~44x smaller by total. A scaffolded 8B dense model matching a GLM-4.7-style overall number (5.9%) on a benchmark designed to reward long-horizon reasoning is the point of RLMs — decomposable problems don't need scale, they need a REPL.

Raymond Weitekamp@raw_works

ok so the default DSPy.RLM is literally going to destroy this benchmark before the end of the day. running now for sonnet 4.5... 🏆 Scoreboard (live) RLM: 90/94 (95.7%) Vanilla: 0/94 (0.0%) anyone want to pay for the opus run? 😉

English

7

9

110

32.1K

Brian Bartoldson@bartoldson·18 Nis

We found this offloading seems to happen less on some domains, like chemistry and math, which appear harder to offload to code. This is consistent with the similarity between the solid and dashed bars in the plot above.

English

0

5

327

Brian Bartoldson@bartoldson·18 Nis

To illustrate setting (1): when I ran DSPy RLMs on a chess problem from LongCoT-mini, I noticed the sub-LLM tool (`llm_query`) wasn't called -- Python ended up doing the search directly as follows. ``` board = ChessBoard() for i, move in enumerate(moves): board.make_move(move) ... solution = board_to_fen(board) ``` You can obtain setting (2) by altering DSPy's RLM prompt to discourage such reasoning offloading. Even in this harder setting, I suspect RLMs can be successful (and training to decompose tasks may be critical here).

English

1

6

1.1K

Brian Bartoldson retweetledi

Sumeet Motwani@sumeetrm·16 Nis

We already do RLM evals on LongCoT (although our benchmark is intended for just models, not scaffolds). Your results in the main post are different from what you have in your comments and are with LongCoT-mini (x.com/raw_works/stat…). We're very excited about RLMs as a direction and are interested in seeing performance go up on our explicit horizon domains (Math/Chemistry/Computer Science).

Raymond Weitekamp@raw_works

almost done with the "mini" 🏆 Scoreboard @ 472 shared (28 RLM rows left) RLM 216/472 (45.8%) Vanilla 13/472 ( 2.8%)

English

0

3

25

1.7K

Brian Bartoldson retweetledi

λux@novasarc01·16 Nis

what stands out to me from a research perspective is that LongCoT isolates compositional horizon failure rather than just benchmark hardness...the local steps are often tractable but performance collapses when those steps must be coordinated across long dependency graphs with planning, state maintenance and backtracking. i think that makes it much more scientifically valuable than another “hard reasoning” benchmark bcoz it cleanly exposes the gap between step-level competence and trajectory-level reasoning robustness.

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning "We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models." "At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities."

English

1

4

40

4.1K

Brian Bartoldson@bartoldson·16 Nis

@a_karvonen @TheNormanMu Also, at ICLR this year, we'll present work on the relationship between *test compute* scaling and robustness. Consistent with the above blog post, we study the timely problem of prompt injections and consider how reasoning can provide defenses. x.com/i/status/20375…

Brian Bartoldson@bartoldson

At ICML 2024, we introduced the first scaling laws for adversarial training, improving the *train-time compute* for robustness tradeoff (arxiv.org/abs/2404.09349). At ICLR 2026, we show how to trade *test-time compute* for robustness efficiently. x.com/i/status/20374…

English

0

4

236

Brian Bartoldson@bartoldson·16 Nis

@a_karvonen @TheNormanMu Thanks for sharing this post! Here's the relevant tweet thread if you want more information/visuals for the scaling laws used to produce this estimate. x.com/i/status/18141…

Brian Bartoldson@bartoldson

Our scaling laws suggest that synthetic data quality, dataset size, and model size all benefit robustness. They also predict NN robustness stops improving around 90% (left side of figure). Corroborating this limit, we find humans have a peak of ~90% (right side of figure). 3/n

English

1

0

7

440

Adam Karvonen@a_karvonen·16 Nis

It would cost a ~GPT-4 training run to get a human-level robustness CIFAR-10 classifier, or at least a 10 million times scale-up from a non-robust CIFAR-10 classifier! From @TheNormanMu 's latest blog post.

English

1

0

44

4.5K

Brian Bartoldson

Keşfet