Valentin Thomas

100 posts

Valentin Thomas

@_valthomas

reward hacking @cohere, PhD from Mila. Formerly @layer6AI, @deepmind. Interested in agentic RL, reasoning, ICL, infra/efficiency.

Katılım Temmuz 2016

1.2K Takip Edilen314 Takipçiler

Valentin Thomas retweetledi

Yaroslav Bulatov@yaroslavvb·1 May

Keller's approach (ultra-fast iteration) is promising because it lead to the first major innovation since Adam (Muon). CIFAR was only 2 seconds to train end-to-end which meant he could try many ideas fast. His first unoptimized Muon run was something like 30 seconds but it was clear it was onto something due to large drop in steps

Keller Jordan@kellerjordan0

Modded-NanoGPT Optimization Benchmark Hundreds of neural network optimizers have been proposed in the literature, recently including dozens citing Muon: MARS, SWAN, REG, ADANA, Newton-Muon, TrasMuon, AdaMuon, HTMuon, COSMOS, Conda, ASGO, SAGE, and Magma, to name a few. The majority of this innovation is happening in the public research community. But the community currently lacks a widely accepted, easily accessible way to compare and make sense of the deluge of methods. As a result, promising new ideas get buried, and spurious results go unchallenged. To help address these issues, I'm releasing a new optimization benchmark. It's designed for maximum simplicity and speed: Just a single file containing ~350 lines of plain PyTorch, which can complete a baseline LM training within 20 minutes of booting up a fresh 8xH100 machine. It also works with {1,2,4}xH100 or A100. These attributes make the new benchmark more accessible than any prior work. The rules are simple: The optimization algorithm can be changed arbitrarily, with the goal being to minimize the number of training steps needed to reach 3.28 val loss on FineWeb (this is the same target loss as in the main speedrun). Modifying the architecture or dataloader, on the other hand, is not allowed. Wallclock time is unlimited, in order to give a fair chance to optimizers which would need kernel work or larger scale to become wallclock-efficient. Like the main NanoGPT speedrun, submissions are open, and new results will be publicly broadcast. Beyond just improving the step count record, another goal of the benchmark is to collaboratively produce well-tuned baselines for as many optimizers as possible. For example, any improvement to the benchmark's best hyperparameters for AdamW would be considered a worthwhile new result. This benchmark is not intended to be the final measure of optimizer quality across all domains. Convenient shared experimental infrastructure which covers the full space of possibilities -- across varying batch size, tokens per parameter, model scale, epoch count, and architecture -- is desirable, but far beyond the current status quo. This benchmark is only meant to be one step towards that goal. To start the benchmark off, I've spent ~20 runs tuning baselines for Muon and AdamW. From time to time over the next few weeks, I'll add another optimizer from the literature, with my best effort at finding good hyperparameters. Researchers interested in neural network optimization are invited to join in by picking an optimizer and giving it a try on the benchmark. All optimizers are welcome, and even runs that don't necessarily have the best hyperparameters are desirable additions to the repo, because each new run adds to the collective knowledge.

English

203

33.2K

Valentin Thomas retweetledi

Neil Zeghidour@neilzegh·20 Oca

Me defending my O(n^3) solution to the coding interviewer.

English

411

4.9K

48.7K

Valentin Thomas retweetledi

Jonathan Gorard@getjonwithit·9 Oca

Like @davidbessis and others, I think that Hinton is wrong. To explain why, let me tell you a brief story. About a decade ago, in 2017, I developed an automated theorem-proving framework that was ultimately integrated into Mathematica (see: youtube.com/watch?v=mMaid2…) (1/15)

YouTube

vitrupo@vitrupo

Geoffrey Hinton says mathematics is a closed system, so AIs can play it like a game. They can pose problems to themselves, test proofs, and learn from what works, without relying on human examples. “I think AI will get much better at mathematics than people, maybe in the next 10 years or so.”

English

126

490

2.8K

850.1K

Valentin Thomas@_valthomas·8 Kas

@konstmish @clashluke proceedings.mlr.press/v108/thomas20a…

QME

Valentin Thomas@_valthomas·8 Kas

@konstmish @clashluke A while ago for many small networks (cnn, MLP) we had found that the gradient second moment, fisher and hessian tended to align pretty early in training

English

Konstantin Mishchenko@konstmish·7 Kas

A gentle reminder that the square of the gradient has nothing to do with the second derivative. E.g. if f(x)=xᵖ, then (f'(x))² = p² x²ᵖ⁻², but f''(x) = p(p-1)xᵖ⁻². Really nothing in common.

Aryan Mokhtari@AryanMokhtari

Second-order methods and preconditioner-based methods are **NOT** the same. Please stop using them interchangeably!

English

232

35.8K

Valentin Thomas@_valthomas·21 Eyl

@tensor_rotator @F_Vaggi @TacoCohen @dwarkesh_sp We had some fun paper a while ago on it arxiv.org/abs/2008.13773 and it seems like the value function does a better job at exploring (being optimistic/having "high standards") compared to the variance reducing baseline

English

Valentin Thomas@_valthomas·21 Eyl

@tensor_rotator @F_Vaggi @TacoCohen @dwarkesh_sp Actually, the value function can be a very poor baseline for reducing the variance of the gradient: in a simple 2 arm bandit with rewards 1 and 0 and (sigmoid) proba p and 1-p the value function would be p while the variance reducing baseline is 1-p! So those are anticorrelated!

English

Dwarkesh Patel@dwarkesh_sp·19 Eyl

How does backprop work with RL? The virtue of backprop is that it updates EACH individual parameter in proportion to how much wiggling it affects the loss. This is only possible if you know how changing each parameter affects the loss function. But of course with RL this is not the case: the environment (and the reward it produces) is a whole separate system. You don’t have some continuous differentiable function which tells you how much wiggling each parameter affects the probability of falling off a cliff. The solutions are quite clever! Here are some ways to come up with a differentiable proxy for reward: Policy gradient methods: You can’t differentiate the reward with respect to the network. But you can differentiate the probabilities of different actions/tokens suggested by the network. So just make the loss = the (sum of negative log) probabilities WEIGHTED by the reward. Loss is higher when reward is lower, so the model learns to output tokens which lead to higher reward at higher probability. Q-learning: Again, reward is not differentiable with respect to the network. But you know what is? The network’s prediction of the reward. And you can update it based on how wrong that prediction was. Now that you can predict what actions will lead to what reward, your policy can simply just be to take the highest expected reward actions.

English

92K

Valentin Thomas@_valthomas·11 Ağu

@DimitrisPapail I see it's because you don't use a baseline so the update for non valid tokens is 0 right? Do you think you generally get rid of the baseline?

English

Dimitris Papailiopoulos@DimitrisPapail·11 Ağu

@_valthomas x.com/DimitrisPapail…

Dimitris Papailiopoulos@DimitrisPapail

@samsja19 @jackminong the above is still on policy! the lines between on and off policy blur because you can think of on policy as generate tokens, from your model, till they match a training data point. When they do you update

QME

414

Dimitris Papailiopoulos@DimitrisPapail·11 Ağu

Another interesting observation performing SGD on x-entropy loss on your text corpus is equivalent to REINFORCE, i.e., on-policy policy gradient, with binary reward "Did my model generate text from corpus"

Dimitris Papailiopoulos@DimitrisPapail

Why is cross-entropy a good loss for language pretraining? caveat: this is all known btw; interestingly, even though there are many viewpoints and intuitions on "why x-ent", they all can be arrived at from a single starting point. Here's a simple first-principles derivation that doesn't assume anything about the data distribution. It comes from a very reasonable operational requirement :) "I want my model to sound intelligent" but we can't measure that, so we ask "I want my model to sound like a human" Although we have access to all texts ever written, we can't quite measure that either, so we instead ask "I want my model to be as likely as possible to generate one of the texts ever written" Or more bluntly: "I want my model to memorize the training data." Consider this thought experiment: Given a dataset S of all text ever written by humans, we perform independent trials for each "text" in S: Sample: "sample text" from our model Pr( ;W) Check: did "sample text" exactly match the original? Note: we do not condition on anything! we just ask, of all the stuff the model could generate, did we get "text". Define success as the event E = "all per-sample checks succeed" The probability of E is, the product of the probabilities assigned to the correct ground truth by your model W Pr(E) = Π_{text in S} Pr(text; W) Maximizing log Pr(E) over W gives you the cross-entropy objective. How do you do you optimize this with SGD? sample text from corpus compute grad log Pr(token|prefix) for every prefix of text update model What's elegant is that this same simultaneously: 1) Minimizes the description length of the data under model P( ;W) (compression view) 2) Minimizes KL divergence to the true distribution—if one exists (though we never assumed one) 3) Implements maximum likelihood estimation The derivation is straightforward and well-known, but it highlights something important: cross-entropy emerges naturally from wanting exact reproduction of the training data. P.S. you could have instead asked to maximize Pr(text generated by the model is in ground truth) interestingly, optimizing this can lead to mode collapse, since an optimal solution is to always predict a single piece of text from the corpus. Yet the gradients again look like x-entropy but with a multiplying factor i.e., Pr(text;W) grad log Pr(text;W)

English

253

75.2K

Valentin Thomas retweetledi

Yunhao (Robin) Tang@robinphysics·12 Haz

Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL. This implementation, however, is quite common in open source RL repos and recent research papers. In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.

English

657

71.4K

Valentin Thomas retweetledi

Vahid Balazadeh @ICML@vahidbalazadeh·11 Haz

Can neural networks learn to map from observational datasets directly onto causal effects? YES! Introducing CausalPFN, a foundation model trained on simulated data that learns to do in-context heterogeneous causal effect estimation, based on prior-fitted networks (PFNs). Joint work with @Layer6AI & @hamid_R_kamkar w/ @_valthomas, Jeremy Ma, Benson Li, Jesse C. Cresswell, & @rahulgk 📝ArXiv: arxiv.org/abs/2506.07918 🔗Code: github.com/vdblm/CausalPF… 🗣️Oral paper @ ICML SIM workshop 🧵[1/7]

English

3.8K

Valentin Thomas@_valthomas·23 Mar

@leloykun Isn't that just a bias you can fold in the learning rate? I'm not sure it matters at all compared to having a non constant bias of the return (by using a value function for instance)

English

318

leloy!@leloykun·22 Mar

I'm not sure if someone has already pointed this out, but Dr. GRPO still has a bias that is more pronounced the smaller the group size is. To make it unbiased, simply multiply Dr. GRPO's A_i by the correction term N/N-1. With this, you'll get LOOP (Leave-One-Out Proximal Policy Optimization). And if you also remove PPO's clipping, you'll get RLOO's (Reinforce Leave-One-Out).

Zichen Liu@zzlccc

🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full details: github.com/sail-sg/unders… 🛠️Code: github.com/sail-sg/unders…

English

379

74.3K

Valentin Thomas@_valthomas·26 Şub

@Nils_Reimers @jxmnop I think the question is about the ratio between the FF dim and the transformer dim, even for the same parameter count.

English

124

Nils Reimers@Nils_Reimers·26 Şub

Solving problems in higher dimensional spaces can be easier for NN. Eg to linear separate ([in_feat], class) ([-2] , 1), ([0] , 0), ([2] , 1) isn't possible. But if you cast it with a kernel [x] -> [x, x^2] to higher dimension it becomes an easy problem. Same has been shown for NN: Approximating certain functions is extremely difficult with narrow NN

English

103

9.3K

Jack Morris@jxmnop·26 Şub

does anyone have a good explanation for why the MLP in transformers has to project the representation up to a much larger dimensionality, then back down again? i’m trying to figure out why all the weight matrices in a Transformer can’t just be square

English

102

712

129.4K

Valentin Thomas@_valthomas·20 Şub

@y0b1byte And I totally forgot but it leads to an additional -pi1 grad log pi For the negative sample So RL pushed down log prob of negative samples but doesn't push up as much log prob of positive In contrast SFT pushes up/copies positives examples

English

Valentin Thomas@_valthomas·19 Şub

@y0b1byte So you also didn't add a baseline, if you do it's value is pi(tau1) Leading to (1 - pi(tau1)) Nabla log pi(tau1) For the gradient. So there's an additional saturation effect which can also help with exploration

English

719

yobibyte@y0b1byte·19 Şub

RL/RLHF/LLM folks, is my reasoning correct? If we have two trajectories with sparse rewards (one traj with 0, one traj with 1), a single REINFORCE update step is equivalent to SFT with cross-entropy on the good trajectory with reward 1. Effectively, both of the methods want to go towards a policy that gives the probability of one to a good trajectory.

English

341

52.5K

Valentin Thomas@_valthomas·19 Şub

@FSchaipp That's a very interesting question. I had worked on second order, fisher, and some ADMM stuff a while ago. it was kind of an open secret among optimization researchers I knew that it didn't generalize as well. Would love to see it confirmed or debunked!

English

Valentin Thomas retweetledi

Fabian Schaipp@FSchaipp·18 Şub

Optimization hyperparameters (LR, schedule, weight decay) do not affect loss-to-loss scaling of LLMs (which could be seen as a proxy for generalization). ☄️ Unclear: how about different optimizers (Shampoo, ScheduleFree...)? Plots from this paper: arxiv.org/pdf/2502.12120

English

6.3K

Valentin Thomas retweetledi

Surya Ganguli@SuryaGanguli·10 Şub

*Every single* cure for a disease ultimately flowed from basic exploratory research. Stopping basic research is like stopping the mountain rains and expecting rivers of cures to still flow. Examples: 1) studying saliva of Gila monster -> GLP1's 2) studying funghi -> first statins 3) mRNA biology -> gene therapy for spinal atrophy 4) studying bacterial genetics -> CRISPR gene therapies 5) studies of nuclear magnetic resonance -> MRI scans this list can go on and on. Not only in biology but all aspects of technology.... e.g. 6) curvature of spacetime -> GPS 7) quantum mechanics -> semiconductors 8) electromagnetism -> fiber optics -> internet ...

Andrew D. Huberman, Ph.D.@hubermanlab

As a taxpayer (irrespective of whether you’re a scientist) would you would be in favor of more of the @NIH budget going to fund efforts to solve specific diseases at the expense of basic exploratory research? Which diseases?

English

170

1.4K

9.9K

743.5K

Valentin Thomas retweetledi