Jasper Gilley

3.3K posts

Jasper Gilley

@0xjasper

MTS @yutori_ai | the greatest art is yet to be created

San Francisco, CA Katılım Aralık 2013

550 Takip Edilen949 Takipçiler

Sabitlenmiş Tweet

Jasper Gilley@0xjasper·11 Mar

x.com/i/article/2031…

ZXX

10.6K

Jasper Gilley@0xjasper·5h

# Why RL on high-dimensional data recreates Zipfian grokking dynamics Suppose a model's learned representations form a manifold M of intrinsic dimensionality k_model, and the task has intrinsic dimensionality k_task. If k_model ~= k_task (i.e., tasks like math and coding), then policy gradients will tend to be on-manifold. But if k_model << k_task (i.e., open-ended reasoning, creative writing), RL recreates Zipfian grokking dynamics. We expect advantages to be heavy-tailed because they are created by sequential composition where good choices (actions, next token predictions, etc.) compound on each other. The highest-advantage rollouts are therefore the ones that compounded the most atypical choices, pushing their activations furthest from M. And because arbitrary displacements in high-dimensional space are increasingly likely to be orthogonal to any given manifold, these large displacements are overwhelmingly off-manifold. Off-manifold updates are tantamount to memorization because they reject the directions that the model has already chosen to compress the data. So, the highest-advantage rollouts lead to the most memorization — and they carry the most gradient weight. Therefore, RL in high-dimensional space is likely to create the "soft ceiling on abstraction quality" I refer to in the article :)

Jasper Gilley@0xjasper

x.com/i/article/2031…

English

133

Jasper Gilley@0xjasper·1d

@YouJiacheng @cloneofsimo It's also got to be a matter of the intrinsic dimensionality of the data, language is uber compressed

English

You Jiacheng@YouJiacheng·1d

@cloneofsimo I wonder if we train 100 epoch with 100TPP data, what will happen.

English

443

You Jiacheng@YouJiacheng·1d

really? are we data-rich?

Simo Ryu@cloneofsimo

My take on this is that we entered era of data-rich large compute scaling (i.e., back to underparameterization post chinchilla), notion of "good generalization" is kinda dead. So people dont care about sharpness of the landscape, cause there is no danger of overfitting to begin with. This is clear transition from IN1K era when we dick measured val-top1 @ 1k epoch bullshit, and generalization / sharpness mattered very much

English

6.8K

Jasper Gilley@0xjasper·3d

@RoyalCities Most exciting work going on in music AI in a long time! You rock, keep it up

English

RoyalCities@RoyalCities·3d

As just some random dude, seeing my model hit top trending on Hugging Face is kinda insane to me. It’s been about a week, so here’s what worked, what didn’t, and what’s coming next 👇 (short 📜- spoilers I’m not done)

English

Jasper Gilley@0xjasper·3d

@_ueaj Right! So it seems like perhaps the simplest explanation for at least part of the phenomenon of scaling laws is the heavy-tailed nature of the things we're training on?

English

ueaj@_ueaj·3d

well my point is that *everything* has zipf distributions (or more precisely, pareto distributions). Hierarchical network connections are also distributed like this, and occur everywhere (neurons in the brain, internet hyperlinks, computer networks, food chains, etc.) The density distribution of matter in our universe is also a power law. So is the distribution of atomic masses of compounds. Literally *everything* at sufficient scale in the universe is organized like this

English

ueaj@_ueaj·3d

smth really cool I learned at pangram is that AI detection has scaling laws too both for sample context length and parameter count. Haven't quantified it exactly but it's def there

Jerry Tworek@MillionInt

Todays AIs have a taste of maximally bland and median appeal RLHF. Spotting AI content maybe is not completely trivial, but over time human mind develops very good sense od AI smell

English

1.6K

Jasper Gilley@0xjasper·3d

@_ueaj I mean yeah but images, user behavior, audio etc. also happen to have Zipf-like distributions. You could do similar papers to the above in those domains too

English

ueaj@_ueaj·3d

Actually scaling laws apply to images, video, audio, embeddings, recsys, agentic behavior, etc. it's not just language. IMO there are 2 even more fundamental laws of physics deeper than the literal QM/GM rules. 1. hierarchical organization 2. emergent complexity The deep neural network is the perfect inductive bias for emergent complexity, as the circuits deeper in the model can be composed into higher level ones, i.e. they affect the hypothesis space of higher layers, which is exactly how emergent complexity works. This is why DNNs work in virtually every field at scale: it's the only truly universal inductive bias

English

Jasper Gilley@0xjasper·3d

@_ueaj I imagine it owes to the natural distributions of natural language! No reason it'd make a difference whether you're modeling it or meta-modeling it arxiv.org/abs/2602.07488

English

ueaj@_ueaj·3d

I can first principles explain why scale works everywhere but it's cool to see it validated experimentally

English

265

Hiranmay Darshane@hdarshane·3d

Amazing post that deserves much more attention... Confirms the intuition that your data distribution acts as a generalisation pressure/regulariser through amazingly clean toy experiments... Really recommend reading it.

Jasper Gilley@0xjasper

x.com/i/article/2031…

English

333

Jasper Gilley@0xjasper·3d

@hdarshane 🙏😄

QME

Jasper Gilley@0xjasper·3d

@truththrulove @WilliamShatner You're a fake Trek fan if you think this, Discovery notwithstanding

English

817

Joyousguard@truththrulove·4d

@WilliamShatner Original Trek was written as a conversation. New Trek is written as a lecture.

English

123

1.8K

60.4K

William Shatner@WilliamShatner·4d

During the first airing of my Star Trek series where a kiss was objectionable; many southern stations pulled the episode & condemned the show. Using today’s vernacular it would absolutely be called“woke DEI crap”because it went against “norms” of society for its time. Not a lot seems to have changed.🤷🏼😑

English

2.1K

5.6K

41.1K

1.2M

Jasper Gilley@0xjasper·4d

@_TobiasLee x being good enough depends on models having a privileged view of their own internals

English

Lei Li@_TobiasLee·4d

continual learning for agents can be explained in one line: Y = Wx - change x: memory.md - change W: weight updates, e.g., LoRa change x is sufficiently good now

English

8.6K

Jasper Gilley@0xjasper·5d

@hdarshane agreed on all accounts! It definitely feels like a poorly-motivated/poorly-executed test of generalization. but you would probably still concede that they did show something like "the model's programming reasoning circuitry is somewhat entangled with its Python circuitry", right?

English

Hiranmay Darshane@hdarshane·5d

@0xjasper not in good spirit to shackle generalisation that happens via those vectors, thus reducing total generalisation capabilities, and to then say something like "these models fail to generalise" in the Gary Marcus dialect or whatever

English

Hiranmay Darshane@hdarshane·5d

keeps getting worse so unserious

Chase Brower@ChaseBrowe32432

the core benchmark results were reported 0-shot, with (either 16k or 32k token limit, it's not in their code and they told me different numbers), with a prompt that explicitly forced the model not to use explanations or comments, and immediately output final code (if you know anything about brainfuck, it's completely inscrutable and this makes the task functionally impossible)

English

527

Jasper Gilley@0xjasper·5d

@Noahpinion IMO the proper term for this sort of thinking is more 'leftist' than 'progressive'

English

127

Noah Smith 🐇🇺🇸🇺🇦🇹🇼@Noahpinion·5d

Yes. Progressive culture frames relationships as being all about power and extraction instead of love, affection, etc. It's horrible.

Wokal Distance@wokal_distance

Applying Marxist labor theory to romantic relationships is one of the worst things to happen to our culture. Treating things like listening, caring, and emotional support as "emotional labor" to only be given in exchange for compensation makes every relationship a transaction.

English

1.3K

54.8K

Jasper Gilley@0xjasper·5d

@rabrg @eshear You can compress lots of uninteresting things without actually learning anything though!

English

Ryan Greene@rabrg·5d

@eshear in other words, compression!

English

171

Emmett Shear@eshear·5d

Learning is as much about effectively forgetting noise as it is about remembering signal.

English

414

20.4K

Jasper Gilley@0xjasper·5d

@joelithic @tenobrus Peanut brain ahh take

English

Joel Neris@joelithic·5d

@0xjasper @tenobrus Gavin has been terrible for California.

English

Tenobrus@tenobrus·6d

i can't fucking believe this is really our roster man. what the fuck did we do to deserve this

English

809

303

10.5K

608.8K

Jasper Gilley@0xjasper·6d

I suspect that the path to superintelligence lies through the sort of thing that could be called "gradient aesthetics" If you've already learned from all text ever written, the only way to get more signal is to think harder about the best data you have

English

Jasper Gilley@0xjasper·6d

@yearemias @tszzl @Noahpinion Try it yourself

English

133

Yearemias@yearemias·6d

@0xjasper @tszzl @Noahpinion source?

English

119

Noah Smith 🐇🇺🇸🇺🇦🇹🇼@Noahpinion·21 Mar

The reason AI still can't write well is that it writes what IT wants to say, not what YOU want to say. Writing is thinking. I expect this to be fixable, but not via typical "scale to AGI" approaches.

Natasha Jaques@natashajaques

The paper I’ve been most obsessed with lately is finally out: nbcnews.com/tech/tech-news…! Check out this beautiful plot: it shows how much LLMs distort human writing when making edits, compared to how humans would revise the same content. We take a dataset of human-written essays from 2021, before the release of ChatGPT. We compare how people revise draft v1 -> v2 given expert feedback, with how an LLM revises the same v1 given the same feedback. This enables a counterfactual comparison: how much does the LLM alter the essay compared to what the human was originally intending to write? We find LLMs consistently induce massive distortions, even changing the actual meaning and conclusions argued for.

English

171

74.6K

Jasper Gilley@0xjasper·6d

@0xHondo @tszzl @Noahpinion No