Sabitlenmiş Tweet
Tiberiu Mușat
75 posts

Tiberiu Mușat
@Tiberiu_Musat_
Trying to figure out how AI works 🔍🧠 Currently at @ETH Zurich, previously @EPFL 🇨🇭 LLMs, interpretability, emergence, grokking 🤖
Zurich, Switzerland Katılım Şubat 2025
952 Takip Edilen754 Takipçiler

@Tiberiu_Musat_ For the first two, don't you run into trouble because your activations and parameters are finite float precision, meaning the positional encoding in the attention mechanism doesn't work anymore past a certain size? Or are we macgyvering some kind of different access mechanism?
English

Why does deep learning generalize? What does weight decay really do? Can algorithmic information theory address these questions?
In my latest preprint, I give a proof that the minimum neural weight norm matches the minimum program length (aka Kolmogorov Complexity), up to a logarithmic factor. In other words, the neural network with the smallest possible weight norm (that fits the data) must encode the shortest program (that fits the data).
The result only holds for fixed-precision neural nets: infinite precision nets can store infinite information with finite (small) weights.
arxiv.org/abs/2605.10878

English

@BushnaqLucius The proof works for looped neural architectures that can access an unbounded tape. Examples include chain-of-though transformers, looped transformers with large context, neural computers, etc.
English

@Tiberiu_Musat_ Actually wait I'm confused, the paper seems to say the result holds for vanilla K-complexity, not K-complexity under some kind of memory bound. How does that work memory-wise when the mlp has a fixed finite width?
English

@BushnaqLucius I agree, implicit biases usually induce a similar prior without explicit regularization.
English

@Tiberiu_Musat_ Nice. Notably, the proof is actually about the number of non-zero parameters in the network. So this solution would also be favoured in the training prior on account of its size in the loss landscape, not just because of explicit weight regularisation.
English

@Andres_Nava_12 @MatthieuWyart Very interesting! Seems related to arxiv.org/abs/2601.19208 @SharonYixuanLi
English

Very happy to share my first first-author preprint with @MatthieuWyart.
Meaning is hierarchical: dog → mammal → animal.
This hierarchy appears as geometry in LLM embeddings. But where does that geometry come from?
We show that word co-occurrence statistics are sufficient to induce it.
arxiv.org/abs/2605.23821

English
Tiberiu Mușat retweetledi

Mechanistic interpretability aspires to be the biology of deep learning. @KuninDaniel and @learning_mech say that an emerging theory of deep learning they and their team call 🛠️ learning mechanics 🛠️ will be the physics.
English
Tiberiu Mușat retweetledi

LLM pretraining may follow a hidden curriculum.
Across 9 open models, component skills tend to emerge before composite skills in a predictable order.
The order is legible enough to predict held-out task trajectories.
What this tells you is that a loss curve dashboard is not enough when training models.
We need training milestone evals too, so you know where in the curriculum the model is.

English




