Maissam Barkeshli

683 posts

Maissam Barkeshli banner
Maissam Barkeshli

Maissam Barkeshli

@MBarkeshli

Visiting Researcher @ Meta FAIR. Professor of Physics @ University of Maryland & Joint Quantum Institute. Previously @ Berkeley,MIT,Stanford,Microsoft Station Q

University of Maryland, College Park Katılım Aralık 2011
395 Takip Edilen2.9K Takipçiler
Sabitlenmiş Tweet
Maissam Barkeshli
Maissam Barkeshli@MBarkeshli·
An absolutely incredible, highly interconnected web of ideas connecting some of the most important discoveries of late twentieth century physics and mathematics. This is an extremely abridged, biased history (1970-2010) with many truly ground-breaking works still not mentioned:
English
10
129
485
0
Maissam Barkeshli retweetledi
Surya Ganguli
Surya Ganguli@SuryaGanguli·
Our new paper "Deriving neural scaling laws from the statistics of natural language" arxiv.org/abs/2602.07488 lead by @Fraccagnetta & @AllanRaventos w/ Matthieu Wyart makes a breakthrough! We can predict data-limited neural scaling law exponents from first principles using the structure of natural language itself for the very first time! If you give us two properties of your natural language dataset: 1) How conditional entropy of the next token decays with conditioning length. 2) How pairwise token correlations decay with time separation. Then we can give you the exponent of the neural scaling law (loss versus data amount) through a simple formula! The key idea is that as you increase the amount of training data, models can look further back in the past to predict, and as long as they do this well, the conditional entropy of the next token, conditioned on all tokens up to this data-dependent prediction time horizon, completely governs the loss! This gets us our simple formula for the neural scaling law!
Surya Ganguli tweet media
English
20
117
571
59.6K
Maissam Barkeshli
Maissam Barkeshli@MBarkeshli·
@GoonGarrett No, we have no understanding so far. I think it is relatively robust to model size but we didn’t do a careful study.
English
1
0
1
30
Garrett Goon
Garrett Goon@GoonGarrett·
@MBarkeshli This is cool. Do you have some analytic understanding of the ~sqrt(m) context-length scaling law? Missed it if so. Curious how/if that scaling is sensitive to model size, as well
English
1
0
0
88
Maissam Barkeshli
Maissam Barkeshli@MBarkeshli·
Our ICLR 2026 paper shows how transformers can learn pseudo-random numbers. We demonstrate successful in-context prediction of pseudo-random sequences from permuted congruential generators, which are used in practice in NumPy. We succesfully attacked PCGs with moduli up to 2^22. Surprisingly, the transformer can learn the sequence even when only one bit is output from the hidden state. We found that curriculum learning is essential for these problems. We also found novel structures in the embedding layers: the model spontaneously clusters numbers according to how their bit strings transform under rotations.
Maissam Barkeshli tweet mediaMaissam Barkeshli tweet mediaMaissam Barkeshli tweet mediaMaissam Barkeshli tweet media
English
1
2
10
494
Maissam Barkeshli
Maissam Barkeshli@MBarkeshli·
Scaling laws in AI – where do they come from? The discovery of neural scaling laws several years ago showed that the loss decreases predictably as a power law in model size, amount of data, and compute. But why? And what sets the exponents of the power law? The most popular explanation is that the dataset already has power law correlations in it (for example, power laws are prevalent in natural language corpora, e.g. Zipf’s law, etc), which translate to power laws in the loss. We studied transformers performing next token prediction on sequences coming from random walks on random graphs, where the data has no power law correlations. Nevertheless, after training the model, we observed power laws in the loss that look similar to those found in natural language. For example, here are results from a random walk on an Erdös-Renyi graph with 8K edges and 50K nodes: This challenges existing explanations, since this dataset of random walks falls outside of the assumptions made in existing models of scaling laws. Going forward, we need explanations of scaling laws based on expressivity and learnability of discrete data, where there is no data manifold, and which do not require the data to already have power laws built in. We also found a setting where we could tune the complexity of a language dataset by starting with a bigram model and gradually dialing up complexity until we get to natural language. This allowed us to track how the exponents of the scaling laws change with complexity:
Maissam Barkeshli tweet mediaMaissam Barkeshli tweet media
English
4
7
41
3.8K
Zohar Komargodski
Zohar Komargodski@ZoharKo·
@MBarkeshli @BSeradjeh I mean it is also not surprising at all, Hamas and Hezbollah, which the pro Palestine crowd supported for months, is funded by the Ayatollah.
English
1
0
2
132
Zohar Komargodski
Zohar Komargodski@ZoharKo·
Over 2000 protesters were killed, apparently 😭 But note that there are no demonstrations on university campuses, no daily UN meetings, no petitions, and the keyboard freedom fighters that we all got to know so well on this platform are very quiet. nytimes.com/2026/01/10/wor…
English
3
1
18
3.3K
Zohar Komargodski
Zohar Komargodski@ZoharKo·
@MBarkeshli @BSeradjeh I just saw a demonstration in NYC pro Hamas and pro Ayatollah. These go together, very unsurprisingly, as one funds the other.
English
1
0
2
166
Maissam Barkeshli
Maissam Barkeshli@MBarkeshli·
@BSeradjeh @ZoharKo Well, almost no Iranians in the US have protested in this case, whereas many protested during Israel-Gaza, so that can’t be true.
English
1
0
0
129
Maissam Barkeshli
Maissam Barkeshli@MBarkeshli·
@ZoharKo But I think a lot of those activities were aimed at trying to affect foreign policy. In this case there is no foreign policy to try to change. The lack of media coverage is more concerning to me
English
1
0
6
175
Maissam Barkeshli retweetledi
Surya Ganguli
Surya Ganguli@SuryaGanguli·
We have 14 survey lectures for our @SimonsFdn Collaboration on the Physics of Learning and Neural Computation! All videos available at: physicsoflearning.org/webinar-series Here is the list: @zdeborova: Attention-based models and how to solve them using tools from quadratic networks and matrix denoising @KempeLab: Recent lessons from LLM reasoning @MBarkeshli: Sharpness dynamics in neural network training @KrzakalaF: How Do Neural Networks Learn Simple Functions with Gradient Descent? Michael Douglas: Mathematics, Economics and AI Yuhai Tu: Towards a Physics-based Theoretical Foundation for Deep Learning: Stochastic Learning Dynamics and Generalization @SuryaGanguli: An analytic theory of creativity for convolutional diffusion models Eva Silverstein: Hamiltonian dynamics for stabilizing neural simulation-based inference @adnarim066: Generation with Unified Diffusion Bernd Rosenow: Random matrix analysis of neural networks: distinguishing noise from learned information @jhhalverson Nerual networks and conformal field theory @KempeLab Synthetic data: friend or foe in the age of scaling @WyartMatthieu Learning hierarchical representations with deep architectures @CPehlevan Mean-field theory of deep network learning dynamics and applications to neural scaling laws
English
2
57
249
22K
Maissam Barkeshli retweetledi