Brian Bartoldson

277 posts

Brian Bartoldson

Brian Bartoldson

@bartoldson

ML researcher

USA Katılım Ekim 2016
577 Takip Edilen524 Takipçiler
Sabitlenmiş Tweet
Brian Bartoldson
Brian Bartoldson@bartoldson·
🧊 Off-policy RL for LLMs is hard. Dr. GRPO collapses at 10 steps off-policy. TBA doesn't. @Kimi_Moonshot K2's approach is robust too – both independently landed on the same key ingredients 🤝 We ablate RL recipe ingredients + show the 2 small changes giving off-policy robustness. 🧵below + NeurIPS poster Friday @ 11 AM.
Brian Bartoldson tweet media
English
6
29
236
46.2K
Brian Bartoldson retweetledi
Bhavya Kailkhura
Bhavya Kailkhura@bkailkhu·
We are releasing RSAHarness — a general harness for scaling frontier LLM capability with test-time compute using Recursive Self-Aggregation (RSA). Idea: let the model try multiple lines of reasoning, recombine the useful pieces, and iterate to keep improving the answer. Its already crushing broad reasoning benchmarks: • 87% on ARC-AGI-2 (no tools) • Current SOTA on FrontierScience Bio Blog: rsa-llm.github.io/blog
Moksh Jain@JainMoksh

We have been pushing the limits of test-time scaling with RSA for single-turn reasoning problems in science and math. Check out our blog post with new results on ARC-AGI-2, ArXivMath, and FrontierScience! A lot of gains with just test-time scaling! rsa-llm.github.io/blog

English
0
2
17
1.9K
🎭
🎭@deepfates·
Uhh is the agentic misalignment paper actually propaganda?
🎭 tweet media
Nathan Calvin@_NathanCalvin

This passage in the New Yorker piece on the Anthropic DOW conflict yesterday, including a back and forth between the journalist (Gideon Lewis-Kraus) and an anonymous admin official, is gonna stick in my mind for a long time. “We must also remember that Cyberdyne Systems created Skynet for the government. It was supposed to help America dominate its enemies. It didn’t exactly work out as planned. The government thinks this is absurd. But the Pentagon has not tried to build an aligned A.I., and Anthropic has. Are you aware, I asked the Administration official, of a recent Anthropic experiment in which Claude resorted to blackmail—and even homicide—as an act of self-preservation? It had been carried out explicitly to convince people like him. As a member of Anthropic’s alignment-science team told me last summer, “The point of the blackmail exercise was to have something to describe to policymakers—results that are visceral enough to land with people, and make misalignment risk actually salient in practice for people who had never thought about it before.” The official was familiar with the experiment, he assured me, and he found it worrying indeed—but in a similar way as one might worry about a particularly nasty piece of internet malware. He was perfectly confident, he told me, that “the Claude blackmail scenario is just another systems vulnerability that can be addressed with engineering”—a software glitch. Maybe he’s right. We might get only one chance to find out.” I really recommend everyone read both the full New Yorker piece and Anthropic’s research on persona selection (both linked in the replies) and then spend a while sitting with the disconcerting situation we may have found ourselves in.

English
36
15
248
65.9K
Brian Bartoldson retweetledi
Moksh Jain
Moksh Jain@JainMoksh·
We have been pushing the limits of test-time scaling with RSA for single-turn reasoning problems in science and math. Check out our blog post with new results on ARC-AGI-2, ArXivMath, and FrontierScience! A lot of gains with just test-time scaling! rsa-llm.github.io/blog
Moksh Jain tweet media
English
0
19
81
11.8K
Brian Bartoldson
Brian Bartoldson@bartoldson·
Original TBA results used no IS and stayed stable with replay buffers containing samples up to hundreds of steps off-policy. IS did help TBA for some model-dataset combinations -- the OAPL approach without IS looks robust across evals, notable! (3/3) x.com/bartoldson/sta…
Brian Bartoldson@bartoldson

⏳Traditional RL runs slow because on-policy training has searcher and trainer processes waiting for each other. TBA decouples these processes to go fast: - Multiple searchers generate LLM outputs constantly - A trainer learns asynchronously from the generated off-policy data

English
0
0
5
281
Brian Bartoldson
Brian Bartoldson@bartoldson·
Interestingly, OAPL’s objective was also derived in @Kimi_Moonshot’s k1.5 paper. We directly compared Kimi’s similar k2 objective to TBA’s, finding you get high accuracy and off-policy robustness either way. (2/3) x.com/bartoldson/sta…
Brian Bartoldson@bartoldson

🔬 Ablation 1: KL reference policy Why are both TBA and Kimi K2 performant on highly off-policy data? Both use KL regularization. Both compute KL against a moving reference policy: TBA resets every ρ=50 steps (flexible), Kimi K2 uses the inference policy as the reference.

English
1
0
5
323
Brian Bartoldson
Brian Bartoldson@bartoldson·
@juddrosenblatt Our ICLR 2026 paper arxiv.org/abs/2510.06790 studies this. System prompts aren't sufficient to stop adversarial attacks unless your model has robust representations, in which case system prompts do add additional robustness.
English
1
0
1
91
Judd Rosenblatt
Judd Rosenblatt@juddrosenblatt·
Alignment via explicit instruction (RLHF, constitutional AI, system prompts) is operating on the surface while the deep representational structure carries implicit content that may be misaligned You can't constitutionally constrain what was never constitutionally represented.
English
6
0
18
1.1K
stochasm
stochasm@stochasticchasm·
@samsja19 @saurabh_shah2 surely it's not just putting it in the raw reward before advantage calculation though right?
English
1
0
2
130
stochasm
stochasm@stochasticchasm·
what if we just have an auxiliary reward that penalizes the model when kl mismatch is high
stochasm tweet media
English
5
1
62
3.3K
Brian Bartoldson retweetledi
𝚟𝚒𝚎 ⟢
𝚟𝚒𝚎 ⟢@viemccoy·
Lots of alpha right now in identifying wealthy users of ClawdBot and sending them certain types of emails containing certain strings of tokens. Not saying anything more about this
English
39
66
2K
150K
Brian Bartoldson retweetledi
Siddarth Venkatraman
Siddarth Venkatraman@siddarthv66·
Recursive Self-Aggregation (RSA) + Gemini 3 Flash scores 59.31% on the public ARC-AGI-2 evals, placing it firmly among the top performers! Here are the highlights: > Outperforms Gemini DeepThink at about 1/10th the cost > Bridges the performance gap with GPT-5.2-xHigh for a similar cost > Nearly matches Poetiq while using a much simpler pipeline. Poetiq uses scaffolded refinement (often via generated code), while RSA does not Also, Gemini 3 Flash is impressive; the Gemini team cooked with this one! We're also eager to run evals with GPT-5.2 + RSA in the future (anyone with credits? :P)
Siddarth Venkatraman tweet media
English
32
91
740
104.4K
Brian Bartoldson
Brian Bartoldson@bartoldson·
@cwolferesearch I just meant = E_{x ~ p*} [log(p*(x))] - E_{x ~ p*} [log(p_t(x))] = <cross entropy> - <entropy of target distribution> Not (= <entropy of target distribution> - <cross entropy>).
English
0
0
1
45
Cameron R. Wolfe, Ph.D.
Cameron R. Wolfe, Ph.D.@cwolferesearch·
@bartoldson yes cross entropy - generally cross entropy on next token prediction distribution is referred to as entropy in the LLM domain
English
1
0
0
220
Cameron R. Wolfe, Ph.D.
Cameron R. Wolfe, Ph.D.@cwolferesearch·
SFT / RL training objectives are equivalent to forward / reverse KL divergence. This is commonly cited in papers but hasn't always been trivial for me to understand, so I derived both cases below for reference... Notation: Let's assume p* is out target distribution (for SFT this is the distribution over our training dataset, for RL this is the optimal policy we are trying to find via exploration) and p_t is our current policy. SFT = Forward KL Objective SFT uses a negative log-likelihood objective, or a cross-entropy loss over ground-truth next tokens. This is equivalent to the forward KL divergence, which we can prove as follows: D_KL(p* || p_t) = E_{x ~ p*} [log(p*(x)) - log(p_t(x))] = E_{x ~ p*} [log(p*(x))] - E_{x ~ p*} [log(p_t(x))] = - In the above expression, the entropy of our target distribution is a constant, so minimizing forward KL is equivalent to minimizing negative log-likelihood. RL = Reverse KL Objective In RL, we are maximizing the reward of on-policy completions (i.e., sampled from our policy p_t), as well as minimizing kl divergence w.r.t. a reference policy: \max_{p} E_{x ~ p} [reward(x)] - beta * D_KL(p || p_ref) From this objective, we can derive a closed-form expression for the solution / optimal policy (also used to derive the DPO loss!): p* = 1 / Z * p_ref(x) * exp( r(x) / beta) where Z is the partition function. If we assume that this optimal policy is our target distribution, then we can show that maximizing the RL objective is equivalent to minimizing the reverse KL to this target distribution: D_KL (p_t || p*) = E_{x ~ p_t} [log(p_t(x)) - log(p*(x))] = E_{x ~ p_t} [log(p_t(x)) - log(p_ref(x)) + log(Z) - (r(x) / beta) =- (1 / beta) E_{x ~ p_t} [r(x)] + D_KL(p_t || p_ref) + log(z) As we can see, the final term above is the negative of our objective for RL (plus a scaling fractor of 1 / beta and an additional constant). Therefore, minimizing this objective (the reverse KL divergence) will maximize the standard RL objective! What is the difference? The key difference here between the forward and reverse KL is how we are sampling. SFT / forward KL samples from a dataset; x ~p*, where our dataset is p* because we are trying to minimize negative log-likelihood over the data). In contrast, RL / reverse KL is performing on-policy sampling; x ~ p_t, where p_t is our current policy. This leads to some interesting mode-seeking versus mode-covering behavior that provides useful intuition for the mechanics of SFT and RL (will cover soon in a follow-up).
English
11
33
321
21.6K
Rohan Pandey
Rohan Pandey@khoomeik·
a little surprising that there isn’t more gigabrained math work on quantization/precision my guess is that the intersection of systems people and theory people is ~empty
English
21
3
162
17.1K
Brian Bartoldson retweetledi
Siddarth Venkatraman
Siddarth Venkatraman@siddarthv66·
Check out our new preprint which empirically investigates both biased and unbiased KL gradient estimators for LLM RL. TLDR; unbiased estimators are stable and biased estimators are unstable or result in worse performance. KL3 in loss (as implemented in GRPO) is a biased estimator of reverse KL, and performs slightly worse. Interestingly, it ends up as an importance sampled forward KL gradient estimator when used this way, which may be over-conservative.
Vedant Shah@veds_12

LOTs of discourse lately about the correctness of the KL-regularization term used in RLVR fine-tuning of LLMs. Which estimator to use? Whether to add it to the reward or loss? What’s even the difference? 🤔 In our new preprint, we evaluate these choices empirically. 🧵 1/n

English
1
5
47
5.3K
Brian Bartoldson retweetledi
Johan Obando-Ceron 👍🏽
Johan Obando-Ceron 👍🏽@johanobandoc·
🚨 When I started playing around with LLMs fine-tuned for reasoning, I noticed a lot of confusion around the role of KL — when it’s used, how it’s used, and why. In our work “A Comedy of Estimators: On KL Regularization in RL Training of LLMs”, we revisit the role of KL and provide new insights into its practical impact. Check out the amazing thread by @veds_12 for more details 👇🏽
Vedant Shah@veds_12

LOTs of discourse lately about the correctness of the KL-regularization term used in RLVR fine-tuning of LLMs. Which estimator to use? Whether to add it to the reward or loss? What’s even the difference? 🤔 In our new preprint, we evaluate these choices empirically. 🧵 1/n

English
2
3
24
1.9K
Brian Bartoldson retweetledi
Vedant Shah
Vedant Shah@veds_12·
LOTs of discourse lately about the correctness of the KL-regularization term used in RLVR fine-tuning of LLMs. Which estimator to use? Whether to add it to the reward or loss? What’s even the difference? 🤔 In our new preprint, we evaluate these choices empirically. 🧵 1/n
Vedant Shah tweet media
English
7
34
123
19.4K
Brian Bartoldson retweetledi
Rosinality
Rosinality@rosinality·
One more study (arxiv.org/abs/2510.01555) on KL penalty with K1, K3 estimators as a reward or a loss.
Rosinality tweet mediaRosinality tweet media
English
5
25
193
38.6K
Brian Bartoldson retweetledi
Siddarth Venkatraman
Siddarth Venkatraman@siddarthv66·
We’ll be presenting this at the FoRLM workshop between 10:15-11:30am room 33 tomorrow! Drop by if you’d like to chat about this paper, or RL for LLMs in general (I got some juicy new insights)
Siddarth Venkatraman@siddarthv66

NO verifiers. NO Tools. Qwen3-4B-Instruct can match DeepSeek-R1 and o3-mini (high) with ONLY test-time scaling. Presenting Recursive Self-Aggregation (RSA) — the strongest test-time scaling method I know of! Then we use aggregation-aware RL to push further!! 📈📈 🧵below!

English
3
9
30
11.6K