Valentin Thomas

99 posts

Valentin Thomas

Valentin Thomas

@_valthomas

technical person @cohere, PhD from Mila. Formerly @layer6AI, @deepmind. Interested in RL, reasoning, ICL.

Katılım Temmuz 2016
1.1K Takip Edilen308 Takipçiler
Valentin Thomas retweetledi
Neil Zeghidour
Neil Zeghidour@neilzegh·
Me defending my O(n^3) solution to the coding interviewer.
English
423
5.1K
49.7K
4M
Valentin Thomas retweetledi
Jonathan Gorard
Jonathan Gorard@getjonwithit·
Like @davidbessis and others, I think that Hinton is wrong. To explain why, let me tell you a brief story. About a decade ago, in 2017, I developed an automated theorem-proving framework that was ultimately integrated into Mathematica (see: youtube.com/watch?v=mMaid2…) (1/15)
YouTube video
YouTube
vitrupo@vitrupo

Geoffrey Hinton says mathematics is a closed system, so AIs can play it like a game. They can pose problems to themselves, test proofs, and learn from what works, without relying on human examples. “I think AI will get much better at mathematics than people, maybe in the next 10 years or so.”

English
124
436
2.5K
746.4K
Valentin Thomas
Valentin Thomas@_valthomas·
@konstmish @clashluke A while ago for many small networks (cnn, MLP) we had found that the gradient second moment, fisher and hessian tended to align pretty early in training
English
1
0
1
63
Valentin Thomas
Valentin Thomas@_valthomas·
@tensor_rotator @F_Vaggi @TacoCohen @dwarkesh_sp Actually, the value function can be a very poor baseline for reducing the variance of the gradient: in a simple 2 arm bandit with rewards 1 and 0 and (sigmoid) proba p and 1-p the value function would be p while the variance reducing baseline is 1-p! So those are anticorrelated!
English
1
0
1
52
Dwarkesh Patel
Dwarkesh Patel@dwarkesh_sp·
How does backprop work with RL? The virtue of backprop is that it updates EACH individual parameter in proportion to how much wiggling it affects the loss. This is only possible if you know how changing each parameter affects the loss function. But of course with RL this is not the case: the environment (and the reward it produces) is a whole separate system. You don’t have some continuous differentiable function which tells you how much wiggling each parameter affects the probability of falling off a cliff. The solutions are quite clever! Here are some ways to come up with a differentiable proxy for reward: Policy gradient methods: You can’t differentiate the reward with respect to the network. But you can differentiate the probabilities of different actions/tokens suggested by the network. So just make the loss = the (sum of negative log) probabilities WEIGHTED by the reward. Loss is higher when reward is lower, so the model learns to output tokens which lead to higher reward at higher probability. Q-learning: Again, reward is not differentiable with respect to the network. But you know what is? The network’s prediction of the reward. And you can update it based on how wrong that prediction was. Now that you can predict what actions will lead to what reward, your policy can simply just be to take the highest expected reward actions.
Dwarkesh Patel tweet media
English
25
69
1.1K
91.8K
Valentin Thomas
Valentin Thomas@_valthomas·
@DimitrisPapail I see it's because you don't use a baseline so the update for non valid tokens is 0 right? Do you think you generally get rid of the baseline?
English
0
0
0
58
Dimitris Papailiopoulos
Dimitris Papailiopoulos@DimitrisPapail·
Another interesting observation performing SGD on x-entropy loss on your text corpus is equivalent to REINFORCE, i.e., on-policy policy gradient, with binary reward "Did my model generate text from corpus"
Dimitris Papailiopoulos@DimitrisPapail

Why is cross-entropy a good loss for language pretraining? caveat: this is all known btw; interestingly, even though there are many viewpoints and intuitions on "why x-ent", they all can be arrived at from a single starting point. Here's a simple first-principles derivation that doesn't assume anything about the data distribution. It comes from a very reasonable operational requirement :) "I want my model to sound intelligent" but we can't measure that, so we ask "I want my model to sound like a human" Although we have access to all texts ever written, we can't quite measure that either, so we instead ask "I want my model to be as likely as possible to generate one of the texts ever written" Or more bluntly: "I want my model to memorize the training data." Consider this thought experiment: Given a dataset S of all text ever written by humans, we perform independent trials for each "text" in S: Sample: "sample text" from our model Pr( ;W) Check: did "sample text" exactly match the original? Note: we do not condition on anything! we just ask, of all the stuff the model could generate, did we get "text". Define success as the event E = "all per-sample checks succeed" The probability of E is, the product of the probabilities assigned to the correct ground truth by your model W Pr(E) = Π_{text in S} Pr(text; W) Maximizing log Pr(E) over W gives you the cross-entropy objective. How do you do you optimize this with SGD? sample text from corpus compute grad log Pr(token|prefix) for every prefix of text update model What's elegant is that this same simultaneously: 1) Minimizes the description length of the data under model P( ;W) (compression view) 2) Minimizes KL divergence to the true distribution—if one exists (though we never assumed one) 3) Implements maximum likelihood estimation The derivation is straightforward and well-known, but it highlights something important: cross-entropy emerges naturally from wanting exact reproduction of the training data. P.S. you could have instead asked to maximize Pr(text generated by the model is in ground truth) interestingly, optimizing this can lead to mode collapse, since an optimal solution is to always predict a single piece of text from the corpus. Yet the gradients again look like x-entropy but with a multiplying factor i.e., Pr(text;W) grad log Pr(text;W)

English
21
13
257
75.1K
Valentin Thomas retweetledi
Yunhao (Robin) Tang
Yunhao (Robin) Tang@robinphysics·
Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL. This implementation, however, is quite common in open source RL repos and recent research papers. In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.
Yunhao (Robin) Tang tweet media
English
15
54
662
71K
Valentin Thomas retweetledi
Vahid Balazadeh
Vahid Balazadeh@vahidbalazadeh·
Can neural networks learn to map from observational datasets directly onto causal effects? YES! Introducing CausalPFN, a foundation model trained on simulated data that learns to do in-context heterogeneous causal effect estimation, based on prior-fitted networks (PFNs). Joint work with @Layer6AI & @hamid_R_kamkar w/ @_valthomas, Jeremy Ma, Benson Li, Jesse C. Cresswell, & @rahulgk 📝ArXiv: arxiv.org/abs/2506.07918 🔗Code: github.com/vdblm/CausalPF… 🗣️Oral paper @ ICML SIM workshop 🧵[1/7]
Vahid Balazadeh tweet media
English
3
11
35
3.6K
Valentin Thomas
Valentin Thomas@_valthomas·
@leloykun Isn't that just a bias you can fold in the learning rate? I'm not sure it matters at all compared to having a non constant bias of the return (by using a value function for instance)
English
1
0
1
316
leloy!
leloy!@leloykun·
I'm not sure if someone has already pointed this out, but Dr. GRPO still has a bias that is more pronounced the smaller the group size is. To make it unbiased, simply multiply Dr. GRPO's A_i by the correction term N/N-1. With this, you'll get LOOP (Leave-One-Out Proximal Policy Optimization). And if you also remove PPO's clipping, you'll get RLOO's (Reinforce Leave-One-Out).
leloy! tweet media
Zichen Liu@zzlccc

🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full details: github.com/sail-sg/unders… 🛠️Code: github.com/sail-sg/unders…

English
11
63
383
74.2K
Valentin Thomas
Valentin Thomas@_valthomas·
@Nils_Reimers @jxmnop I think the question is about the ratio between the FF dim and the transformer dim, even for the same parameter count.
English
0
0
0
123
Nils Reimers
Nils Reimers@Nils_Reimers·
Solving problems in higher dimensional spaces can be easier for NN. Eg to linear separate ([in_feat], class) ([-2] , 1), ([0] , 0), ([2] , 1) isn't possible. But if you cast it with a kernel [x] -> [x, x^2] to higher dimension it becomes an easy problem. Same has been shown for NN: Approximating certain functions is extremely difficult with narrow NN
English
5
0
103
9.3K
dr. jack morris
dr. jack morris@jxmnop·
does anyone have a good explanation for why the MLP in transformers has to project the representation up to a much larger dimensionality, then back down again? i’m trying to figure out why all the weight matrices in a Transformer can’t just be square
English
103
31
717
129.3K
Valentin Thomas
Valentin Thomas@_valthomas·
@y0b1byte And I totally forgot but it leads to an additional -pi1 grad log pi For the negative sample So RL pushed down log prob of negative samples but doesn't push up as much log prob of positive In contrast SFT pushes up/copies positives examples
English
0
0
1
46
Valentin Thomas
Valentin Thomas@_valthomas·
@y0b1byte So you also didn't add a baseline, if you do it's value is pi(tau1) Leading to (1 - pi(tau1)) Nabla log pi(tau1) For the gradient. So there's an additional saturation effect which can also help with exploration
English
1
0
3
717
yobibyte
yobibyte@y0b1byte·
RL/RLHF/LLM folks, is my reasoning correct? If we have two trajectories with sparse rewards (one traj with 0, one traj with 1), a single REINFORCE update step is equivalent to SFT with cross-entropy on the good trajectory with reward 1. Effectively, both of the methods want to go towards a policy that gives the probability of one to a good trajectory.
yobibyte tweet media
English
18
22
345
52.4K
Valentin Thomas
Valentin Thomas@_valthomas·
@FSchaipp That's a very interesting question. I had worked on second order, fisher, and some ADMM stuff a while ago. it was kind of an open secret among optimization researchers I knew that it didn't generalize as well. Would love to see it confirmed or debunked!
English
0
0
1
58
Valentin Thomas retweetledi
Fabian Schaipp
Fabian Schaipp@FSchaipp·
Optimization hyperparameters (LR, schedule, weight decay) do not affect loss-to-loss scaling of LLMs (which could be seen as a proxy for generalization). ☄️ Unclear: how about different optimizers (Shampoo, ScheduleFree...)? Plots from this paper: arxiv.org/pdf/2502.12120
Fabian Schaipp tweet mediaFabian Schaipp tweet media
English
4
9
89
6.3K
Valentin Thomas retweetledi
Surya Ganguli
Surya Ganguli@SuryaGanguli·
*Every single* cure for a disease ultimately flowed from basic exploratory research. Stopping basic research is like stopping the mountain rains and expecting rivers of cures to still flow. Examples: 1) studying saliva of Gila monster -> GLP1's 2) studying funghi -> first statins 3) mRNA biology -> gene therapy for spinal atrophy 4) studying bacterial genetics -> CRISPR gene therapies 5) studies of nuclear magnetic resonance -> MRI scans this list can go on and on. Not only in biology but all aspects of technology.... e.g. 6) curvature of spacetime -> GPS 7) quantum mechanics -> semiconductors 8) electromagnetism -> fiber optics -> internet ...
Andrew D. Huberman, Ph.D.@hubermanlab

As a taxpayer (irrespective of whether you’re a scientist) would you would be in favor of more of the @NIH budget going to fund efforts to solve specific diseases at the expense of basic exploratory research? Which diseases?

English
175
1.5K
10K
742.8K
Valentin Thomas retweetledi
Ben (no treats)
Ben (no treats)@andersonbcdefg·
see this decoder only autoregressive transformer? that's right. it goes in the neurosymbolic AI hole
Ben (no treats) tweet media
English
20
29
806
25.2K
Valentin Thomas retweetledi
wh
wh@nrehiew_·
R1 is a ...(checks notes)... neurosymbolic architecture?
wh tweet media
English
46
15
588
95.5K
Valentin Thomas retweetledi
KZ
KZ@kzSlider·
Damn, triple-homicide in one day. SAEs really taking a beating recently
KZ tweet media
English
10
35
364
143.6K