Jasper Gilley

3.3K posts

Jasper Gilley banner
Jasper Gilley

Jasper Gilley

@0xjasper

MTS @yutori_ai | the greatest art is yet to be created

San Francisco, CA Katılım Aralık 2013
550 Takip Edilen949 Takipçiler
Jasper Gilley
Jasper Gilley@0xjasper·
# Why RL on high-dimensional data recreates Zipfian grokking dynamics Suppose a model's learned representations form a manifold M of intrinsic dimensionality k_model, and the task has intrinsic dimensionality k_task. If k_model ~= k_task (i.e., tasks like math and coding), then policy gradients will tend to be on-manifold. But if k_model << k_task (i.e., open-ended reasoning, creative writing), RL recreates Zipfian grokking dynamics. We expect advantages to be heavy-tailed because they are created by sequential composition where good choices (actions, next token predictions, etc.) compound on each other. The highest-advantage rollouts are therefore the ones that compounded the most atypical choices, pushing their activations furthest from M. And because arbitrary displacements in high-dimensional space are increasingly likely to be orthogonal to any given manifold, these large displacements are overwhelmingly off-manifold. Off-manifold updates are tantamount to memorization because they reject the directions that the model has already chosen to compress the data. So, the highest-advantage rollouts lead to the most memorization — and they carry the most gradient weight. Therefore, RL in high-dimensional space is likely to create the "soft ceiling on abstraction quality" I refer to in the article :)
Jasper Gilley@0xjasper

x.com/i/article/2031…

English
0
0
2
133
You Jiacheng
You Jiacheng@YouJiacheng·
@cloneofsimo I wonder if we train 100 epoch with 100TPP data, what will happen.
English
1
0
3
443
Jasper Gilley
Jasper Gilley@0xjasper·
@RoyalCities Most exciting work going on in music AI in a long time! You rock, keep it up
English
1
0
1
70
RoyalCities
RoyalCities@RoyalCities·
As just some random dude, seeing my model hit top trending on Hugging Face is kinda insane to me. It’s been about a week, so here’s what worked, what didn’t, and what’s coming next 👇 (short 📜- spoilers I’m not done)
RoyalCities tweet mediaRoyalCities tweet media
English
5
3
57
8K
Jasper Gilley
Jasper Gilley@0xjasper·
@_ueaj Right! So it seems like perhaps the simplest explanation for at least part of the phenomenon of scaling laws is the heavy-tailed nature of the things we're training on?
English
0
0
1
10
ueaj
ueaj@_ueaj·
well my point is that *everything* has zipf distributions (or more precisely, pareto distributions). Hierarchical network connections are also distributed like this, and occur everywhere (neurons in the brain, internet hyperlinks, computer networks, food chains, etc.) The density distribution of matter in our universe is also a power law. So is the distribution of atomic masses of compounds. Literally *everything* at sufficient scale in the universe is organized like this
ueaj tweet media
English
1
0
0
16
Jasper Gilley
Jasper Gilley@0xjasper·
@_ueaj I mean yeah but images, user behavior, audio etc. also happen to have Zipf-like distributions. You could do similar papers to the above in those domains too
English
1
0
0
13
ueaj
ueaj@_ueaj·
Actually scaling laws apply to images, video, audio, embeddings, recsys, agentic behavior, etc. it's not just language. IMO there are 2 even more fundamental laws of physics deeper than the literal QM/GM rules. 1. hierarchical organization 2. emergent complexity The deep neural network is the perfect inductive bias for emergent complexity, as the circuits deeper in the model can be composed into higher level ones, i.e. they affect the hypothesis space of higher layers, which is exactly how emergent complexity works. This is why DNNs work in virtually every field at scale: it's the only truly universal inductive bias
English
1
0
1
32
Jasper Gilley
Jasper Gilley@0xjasper·
@_ueaj I imagine it owes to the natural distributions of natural language! No reason it'd make a difference whether you're modeling it or meta-modeling it arxiv.org/abs/2602.07488
English
1
0
1
23
ueaj
ueaj@_ueaj·
I can first principles explain why scale works everywhere but it's cool to see it validated experimentally
English
1
0
6
265
Hiranmay Darshane
Hiranmay Darshane@hdarshane·
Amazing post that deserves much more attention... Confirms the intuition that your data distribution acts as a generalisation pressure/regulariser through amazingly clean toy experiments... Really recommend reading it.
Jasper Gilley@0xjasper

x.com/i/article/2031…

English
1
1
5
333
Joyousguard
Joyousguard@truththrulove·
@WilliamShatner Original Trek was written as a conversation. New Trek is written as a lecture.
English
123
32
1.8K
60.4K
William Shatner
William Shatner@WilliamShatner·
During the first airing of my Star Trek series where a kiss was objectionable; many southern stations pulled the episode & condemned the show. Using today’s vernacular it would absolutely be called“woke DEI crap”because it went against “norms” of society for its time. Not a lot seems to have changed.🤷🏼😑
English
2.1K
5.6K
41.1K
1.2M
Jasper Gilley
Jasper Gilley@0xjasper·
@_TobiasLee x being good enough depends on models having a privileged view of their own internals
English
0
0
1
96
Lei Li
Lei Li@_TobiasLee·
continual learning for agents can be explained in one line: Y = Wx - change x: memory.md - change W: weight updates, e.g., LoRa change x is sufficiently good now
English
3
1
81
8.6K
Jasper Gilley
Jasper Gilley@0xjasper·
@hdarshane agreed on all accounts! It definitely feels like a poorly-motivated/poorly-executed test of generalization. but you would probably still concede that they did show something like "the model's programming reasoning circuitry is somewhat entangled with its Python circuitry", right?
English
1
0
1
28
Hiranmay Darshane
Hiranmay Darshane@hdarshane·
@0xjasper not in good spirit to shackle generalisation that happens via those vectors, thus reducing total generalisation capabilities, and to then say something like "these models fail to generalise" in the Gary Marcus dialect or whatever
English
1
0
0
22
Jasper Gilley
Jasper Gilley@0xjasper·
@Noahpinion IMO the proper term for this sort of thinking is more 'leftist' than 'progressive'
English
0
0
0
127
Jasper Gilley
Jasper Gilley@0xjasper·
@rabrg @eshear You can compress lots of uninteresting things without actually learning anything though!
English
0
0
0
25
Emmett Shear
Emmett Shear@eshear·
Learning is as much about effectively forgetting noise as it is about remembering signal.
English
29
40
414
20.4K
Tenobrus
Tenobrus@tenobrus·
i can't fucking believe this is really our roster man. what the fuck did we do to deserve this
Tenobrus tweet media
English
809
303
10.5K
608.8K
Jasper Gilley
Jasper Gilley@0xjasper·
I suspect that the path to superintelligence lies through the sort of thing that could be called "gradient aesthetics" If you've already learned from all text ever written, the only way to get more signal is to think harder about the best data you have
English
0
0
2
98
Noah Smith 🐇🇺🇸🇺🇦🇹🇼
The reason AI still can't write well is that it writes what IT wants to say, not what YOU want to say. Writing is thinking. I expect this to be fixable, but not via typical "scale to AGI" approaches.
Natasha Jaques@natashajaques

The paper I’ve been most obsessed with lately is finally out: nbcnews.com/tech/tech-news…! Check out this beautiful plot: it shows how much LLMs distort human writing when making edits, compared to how humans would revise the same content. We take a dataset of human-written essays from 2021, before the release of ChatGPT. We compare how people revise draft v1 -> v2 given expert feedback, with how an LLM revises the same v1 given the same feedback. This enables a counterfactual comparison: how much does the LLM alter the essay compared to what the human was originally intending to write? We find LLMs consistently induce massive distortions, even changing the actual meaning and conclusions argued for.

English
12
6
171
74.6K
roon
roon@tszzl·
@Noahpinion “using gpt 5 mini” every. damn. time
English
34
6
739
17.3K