Sidhant Thole

140 posts

Sidhant Thole

Sidhant Thole

@SPThole

ai@fidelityinvestments, ai researcher

가입일 Aralık 2023
281 팔로잉13 팔로워
Sidhant Thole 리트윗함
Dwarkesh Patel
Dwarkesh Patel@dwarkesh_sp·
Recently met @srush_nlp and he started giving me an impromptu lecture on how targeted on-policy self-distillation works. I asked him if I could record it on my iPhone. The basic idea is this: if the model made a mistake at some point in the rollout (for example, calling a tool that doesn't exist), we want to discourage this specific error, but we don't want to just learn from the final reward, because it's a very noisy signal spread out over the whole trajectory. So we have another model read this trajectory and figure where the error was made. It simply inserts some hint tokens to the part of the trajectory right above where the mistake was made. Now with these injected hint tokens, have the model run a forward pass. You're not having to regenerate a new rollout - aka no new decode required. The hint causes the model to assign lower probabilities to the error tokens. You then trains the original model to match these new probabilities, teaching it to downweight that specific mistake.
English
35
132
2K
251.1K
Sidhant Thole
Sidhant Thole@SPThole·
Did you run an ablation where the baseline NanoGPT model is given the same parameter budget increase (e.g., +7M parameters distributed across layers) as Parallax? Since the proposed attention is present in every layer, this would be a cleaner comparison and help disentangle architectural improvements from gains due to additional model capacity.
English
1
0
0
198
Yifei Zuo
Yifei Zuo@YifeiZuoX·
Very impressive results from Min Li and @Haoxiang__Wang: simply swapping Attention for Parallax reaches 2880 steps with the SOAP-H optimizer, beating the latest SOTA record on modded-nanogpt (@kellerjordan0) with no hyperparameter tuning. A few observations: - Parallax is uniformly stronger than Softmax Attention across all records. - Optimizers don't transfer to Parallax with the same magnitude, which confirms the optimizer–architecture interaction from the Parallax paper. - The cleanest modifications often transfer best; records built on heavy tuning transfer less reliably. These are preliminary results, I believe both the Parallax architecture and the optimizer side have room to improve. Code is open-sourced below, give it a try. Code: github.com/Yifei-Zuo/modd… Kernel: github.com/Yifei-Zuo/Para… Paper: arxiv.org/abs/2605.29157
Yifei Zuo tweet mediaYifei Zuo tweet media
English
8
18
155
24.8K
Sidhant Thole 리트윗함
Nilin
Nilin@nilinabra·
My thinking was to control weight norm without needing to tune weight decay. WD takes effect as the norms get near an equilibrium. Radial brake compresses the outward gradient component and takes effect immediately. It also affects the condition number differently than WD. nilin.github.io/radial-brake/
Nilin tweet mediaNilin tweet mediaNilin tweet media
Keller Jordan@kellerjordan0

Modded-NanoGPT optimization result #29 (2026/05/11): @nilinabra has achieved a new step-count record of 2990 (40-step improvement) by halving the growth rate of the L2-norm of the hidden matrix parameters. This result is better than the previous record with a p-value of 4e-5.

English
1
4
68
8.7K
Sidhant Thole 리트윗함
Peter H. Diamandis, MD
Peter H. Diamandis, MD@PeterDiamandis·
The thing nobody tells you about exponential change is that it feels like nothing is happening right up until the moment everything happens at once.
English
675
2.2K
12.1K
16.1M
Sidhant Thole
Sidhant Thole@SPThole·
Haven’t read the paper yet, so this is just a mental model. I wonder if “capacity” can be thought of as a representational budget—the amount of space the model has to encode distinct features and behaviors. In that framing, a neuron/feature budget refers to the number of neurons and feature directions available to represent information. More width → more neurons and potentially more independent feature directions. Small models force more features or tasks to share the same representations, increasing feature interference (where different features compete for or overlap in the same neurons and directions); larger models have enough representational room for even rare features to maintain distinct representations. A related way to think about this is through the Wx model: the weight matrix W maps inputs into the model’s representation space, and increasing width expands that space, allowing W to allocate more distinct directions to different features. Under this view, wider models can separate features more cleanly in the Wx representation, reducing interference and making it easier to preserve specialized or infrequent features. Curious how much this aligns with the paper’s findings versus where the results suggest a different mechanism. Will read it!
English
0
0
0
27
Christopher Potts
Christopher Potts@ChrisGPotts·
We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.
Christopher Potts tweet media
English
20
133
860
118.8K
Vivek
Vivek@itsreallyvivek·
>20 >dropped out >moving to london this july >joining anthropic >$187k base on my first full-time role >grateful, slightly terrified, mostly excited ask me anything.
Vivek tweet media
English
181
18
1.1K
74.1K
Sidhant Thole
Sidhant Thole@SPThole·
@mohitwt_ You missed the correction and sampling based speculative decoding
English
0
0
2
115
Sidhant Thole
Sidhant Thole@SPThole·
Trained or expert humans, also do variation and evaluation, but over the time they develop the taste of what will work and what not and that becomes a prior to their selection of experiments, AI although can do lots of variation and given environment to evaluate itself, it can do wonders but taste is what still missing, may be the crucial part here is something like memory (which is not exactly the memory which we see currently, but the taste or overall learnings till now whether it might be task oriented learnings that they have done right now or they have done in that past)
English
0
0
4
313
hallerite
hallerite@hallerite·
@RichardSSutton I just don't understand how one can seriously believe this in 2026. LLMs are not chat bots anymore. They are agents. They interact with the world and are trained with the very Reinforcement Learning that you have written a seminal textbook about.
English
12
0
78
4.9K
Richard Sutton
Richard Sutton@RichardSSutton·
A new and possibly controversial perspective: In this video, I explain the sense in which generative AI trained by supervised learning is incapable of making novel discoveries. youtu.be/K5LAFEjTlBA The text of the speech: AI Creativity and Discovery Good day ladies and gentlemen. I regret that I am unable to be with you all today to engage in a back-and-forth discussion, but I am nevertheless pleased to be able to share with you, via this recording, some high-level thoughts about the current and future state of artificial intelligence, and in particular about AI’s relationship to science and mathematics, which is, as I understand it, the central focus of this meeting and of the SAIR Foundation. I would like to start with an old joke; I am sure you have heard it before. It is the one about the researcher whose work is being evaluated, and the review comes back, and says “This work is both novel and good. Unfortunately, the parts that are good are not novel, and the parts that are novel are not good.” My first point about AI is that this assessment applies exactly to large parts of AI as we know it today. Not all of today’s AI, but a large part of it. Pretty much all of what we mean by “Generative AI”---which includes large language models, and the images and video models, and even the new methods for learning world models. All of these AIs take large numbers of examples and produce a “model” which behaves similar to the examples, that is, which generates text like people, or images like artists or nature, and videos like we find on the internet. Don’t get me wrong, Generative AI can be extremely useful. No doubt about that. But the assessment of the joke still applies. These systems can produce output that is both novel and good, but not at the same time. In many ways this is just absolutely not a problem. When we ask an AI for an answer from the internet, or to summarize a document, we don’t want it to be novel. We are happy if the quality of the answer, the goodness, comes from the source material—from the people who wrote the document or the articles on the internet. If the AI’s answer is novel it means it is going beyond the source material, adding something beyond it. This is what we call “hallucinations”. In most cases, we don’t like it when the AI makes something up, when it adds something novel. One exception, of course, is when we are looking not for facts or reality, but for fiction and entertainment. We might ask for a bedtime story for a child, or an image based on existing images on the internet but which is nevertheless different and distinct from them. In these cases, it is never easy for us to know how creative the AI is actually being, as we do not know how close the AI’s story, poem, or image is to the source material. In a real practical sense we can not know this because the internet is too big, the possible sources that the AI may draw upon are too numerous. When we ask for a fiction or novelty, the AI can give it to us because its processing is in part stochastic. Every decision can go multiple ways and will go different ways and produce a different trajectory every time. The trajectory can be random—and thus novel—or it can be based on the training data—and thus “good” because the training data is good, sourced from people or reality. Thus, the trajectory is either novel or good—based on randomness or based on data—but never both at the same time. Really, I think it is okay if the output of Generative AI is never good and novel at the same time. For the researcher in the joke this is a devastating criticism, but for most things it is not, and for Generative AI it is not. Generative AI is meant to be a mimic. This is what supervised learning is for. Generative AI can be extremely useful, even when it just mimics, if it is faster, or cheaper, or smaller, or more customizable, or more copy-able, than the thing being mimicked. It is okay if Generative AI cannot be both novel and good at the same time. It is still a transformative technology. But it is a limitation. And remember we are here to use AI for science and mathematics, and for these areas the assessment of the reviewer in the joke is devastating. For these areas we need true creativity and discovery. Generative AI—or Mimicking AI—will never get where us there. For these we need something more, and indeed we have something more in other parts of AI. We have many AI systems which can give us more. We have AlphaGo with its world-changing move 37, or AlphaZero with its brilliant original chess-playing style. We have GT-Sophy that drives simulated racecars better than any human. We have AlphaFold and AlphaProof and Claude-Code, which have brought true advances in science, mathematics, and programming. We have RL-Lyft which optimizes the assignment of cars to passengers in the ride-hailing business. All these systems have found things that are both novel and good. And, truth be told, some language models have been augmented in ways that make them more than Generative AI based on supervised learning. All these systems have some additional features that make them capable of true creativity and true discovery. It is important for us to recognize what this is—and that it is not present in ordinary, garden-variety Generative AI. It is something that can not come from just supervised learning, from learning from examples. What is it? Well, it is a simple thing, a commonsense thing. It is not new. We have many names for it, but unfortunately none of them are very good names. I will call it Discovery. Basically, Discovery is just the idea of trying many things and seeing which of them work, then keeping those that worked the best. Evolution by natural selection works this way. The scientific method works this way. And just ordinary life and learning works this way. We try things and remember what works. What could be more obvious? In this behavioral case, psychology has two names for it— “instrumental learning” and “operant conditioning”—and in machine learning it is what we mean by “reinforcement learning”. We also see the idea of Discovery in planning and combinatorial search—anything that involves the idea of “generate and test”. The essence of Discovery is to combine three steps: 1. Variation, 2. Evaluation, and 3. Selective retention. Of course, I am not the first to say this. I am not the first to point out that this combination of steps is key to science, to evolution by natural selection, and to animal behavior. I think particularly of papers by Donald Campbell, by Daniel Dennett, and by Gary Cziko. What is new in my remarks is to directly relate the idea of Discovery to modern AI to help us see that it is not present in supervised learning or Generative AI—in particular, that Discovery is not present in backpropagation or gradient descent. Let me say explicitly what is missing from Generative AI. As we have remarked, these systems do have a stochastic aspect, so they do generate a variety of trajectories and behavior. What is missing is the Evaluation step. The generator was pre-trained by supervised learning, leaving no way at runtime to Evaluate what it generates. And of course without Evaluation there can be no Selective retention, and thus no Discovery. The variation can bring novelty, but without evaluation there is no Discovery, and arguably, no creativity. That is, I would say that creativity requires that the new things generated be Evaluated. Without evaluation, and retention of the best, there is nothing created. The novelty flickers into existence but, if its value is unrecognized, it flickers away and is lost. In many cases, Evaluation is done by people to make a discovery. As when we have Generative AI make many pictures for us, and then we pick the one that we like the best. The human+AI system completes the discovery. In many other cases, the Evaluation comes from a clear objective. Some moves lead to checkmate, some steps lead to a proof, some actions result in high reward, some genotypes make more copies, some theories explain the data better. Some prefer the Variation step to be called Blind variation, where “blind” here means that it is uninformed, a shot in the dark. It does not need to be completely uninformed; a good scientist does not select theories to test at random. But neither can it be completely informed and determined. There must be some uncertainty about where the answer lies in order for there to be a discovery. In practice, the variation is partly informed and partly blind, but it is the blind part that corresponds to the discovery. Now let us briefly go all the way to modern deep learning, to the backpropagation algorithm. At first it might seem that backpropagation is incapable of discovery because it is deterministic and thus incapable of variation. But this is not correct. The weight updates of backprop are deterministic, but the weights are initialized to small random values. The random initialization is often downplayed, but in fact it is a necessary form of variation; it must be done properly to get good performance. In backprop this Variation is done once, at network initialization, so its effect is temporary, and later the network may lose its ability to learn. This is the weakness of deep learning that is alleviated with a new algorithm that my group presented in Nature a couple of years ago. Our “continual backpropagation” made one small change: every so often a less-used neuron would be re-initialized to small random weights. This allows the variation to continue and plasticity to be retained. Although there is much more to be said about Creativity and Discovery, this is the key point: they are more than supervised learning, more than pattern recognition, more than prediction, and more than world modeling. Those things are important, but they alone will not bring us to discovery. Discovery requires Evaluation from a person or from an explicit goal, and only in the latter case will we attain full autonomy. So that is my call to arms. If we want the full power of AI scientists, then we should share the goals with them so they can create, evaluate, discover, and in these ways fully participate in achieving the goals. Let’s be bold! Let’s fully automate Creativity and Discovery!
YouTube video
YouTube
English
96
257
1.5K
602K
Sidhant Thole
Sidhant Thole@SPThole·
Greedy decoding can simply replace mismatches with the target argmax. For sampling, drawing from the full target distribution after rejection would double count (for lack of better word) probability mass already allocated through accepted draft tokens, so speculative decoding samples from the residual instead.
English
0
0
1
1K
varun
varun@varunneal·
Interview question: when rejecting a draft model’s token q(x), why don’t we resample from the base model’s distribution p(x), but rather from the residual p(x)-q(x)?
English
5
0
115
25.4K
Muyu He
Muyu He@HeMuyu0327·
In our paper, we also find another interesting angle to see how much deep attention layers hate to compute from what is in their residual stream: If you learn coefficients for standard value vectors in final attention layers, they will be driven toward 0 (Fig 1, bottom curve), but if you learn coefficients for context-free value vectors corresponding to token IDs, they will be scaled up to 10x (Fig 1, top curves). The former is very surprising because it seems that the model is driving attention outputs to essentially 0, just to get rid of those value vectors, which degrades from baseline performance. Again, there must be something deep about this, and we really want to find out! Paper: github.com/RiddleHe/nanoc…
Muyu He tweet media
Larry Dial@classiclarryd

In modded-nanogpt we also found that the last couple attention layers hate interacting with the final prediction MLPs. So we work around it with a cached activation from earlier. In the attention residuals paper, Kimi doesn't explicitly mention it, but you can see from their chart that the final attn layers dont engage with the final outputs. So I think there is something fundamental going on here.

English
6
11
98
13.2K
Sidhant Thole 리트윗함
Prof. Feynman
Prof. Feynman@ProfFeynman·
whatever you decide to do, make sure it makes you happy.
English
28
188
1.3K
60.2K
Sidhant Thole
Sidhant Thole@SPThole·
Jagged intelligence makes me test it on my own task and then I make my opinion which is very limited to that task, but I break the task into some general capabilities like summarization or creative writing or inference and then port that understanding when evaluating on newer task as prior only not the decision making criteria
English
0
0
0
220
Tibo
Tibo@thsottiaux·
Do you still trust benchmarks or do you just listen to your friends? What makes you try a new model?
English
975
37
2.1K
231.5K
Sidhant Thole
Sidhant Thole@SPThole·
@juleslogs Well it helps to even optimize the model architecture eventually so yes it does make model smarter
English
1
0
1
56
jules
jules@juleslogs·
one of the most interesting areas in AI right now: MECHANISTIC INTERPRETABILITY the goal isnt to make models smarter. its to REVERSE ENGINEER how they think.
English
9
0
22
1K
Sidhant Thole
Sidhant Thole@SPThole·
Steering in some rough proxy way could be a good dataset to train some smaller model to detect where there could be a chance of mistake, frontier lab might be having huge data of this sort, basically human in general would steer the agent if they see something is off, or this could be a proxy where agent should pause and ask the question
λux@novasarc01

i’m increasingly convinced that the best agent evals will come from mining real agent failure traces. my view is that every failed trace contains a potential eval but not in its raw form. raw traces are messy, long and too specific. the research problem is to distill them into clean reproducible tests. the pipeline i’m interested in is (which i'm currently working on): failure trace → failure attribution → earliest divergence point → minimal reproducible state → targeted eval → regression suite this turns trace data from passive observability into an active improvement loop. like can we extract the exact decision point where the agent should have behaved differently? and can we convert that into an eval that catches the same failure class in the future? i guess this matters because most agent failures are trajectory-level failures and not just output-level failures. personally i think this is much more realistic than relying only on hand-written benchmarks (imo they should look more like failure memory systems). hand-written evals encode what we think agents will fail on. traces encode what agents actually failed on. also once you have the mechanism, you can mutate the trace into variants. that is basically fuzzing for agents.

English
0
0
0
65