Fern

2.2K posts

Fern

Fern

@hi_tysam

Neural network speedrunner and community-funded open source researcher. Set the CIFAR-10 record several times. Say hi!

Katılım Ocak 2023
221 Takip Edilen3K Takipçiler
Sabitlenmiş Tweet
Fern
Fern@hi_tysam·
New NanoGPT training speed record: 3.28 FineWeb val loss in 3.17 minutes on 8xH100 Previous record (recreation): 3.32 minutes Lots of changes! - New token-dependent lm_head bias - Fused several ops - Multi-GPU grad bugfix - Steps 1390 -> 1350 - Semi-ortho init - (More in thread)
Fern tweet media
English
12
26
262
38.3K
kalomaze
kalomaze@kalomaze·
well technically yes but what is usually logged is the "raw norm", and the clipping is applied globally before either of those things, no?
Lucas Beyer (bl16)@giffmana

@kalomaze No you are mixing things! What's usually clipped is the gradient norm. What the formula above computes (approximates) is the norm of the parameter update (before lr and wd). These are very different things.

English
2
0
30
11.2K
Fern
Fern@hi_tysam·
@evaninwords @main_horse @kalomaze @stochasticchasm @clashluke Yeah, defo for the input. Then the norms of each vector should be ~.7-1.3ish along each vector after it's done iirc I think there was some attempt to get a better estimator for this to reduce the number of steps but this seemed to be the computationally cheapest iirc
English
0
0
1
109
Fern
Fern@hi_tysam·
@stochasticchasm @main_horse @kalomaze Yeah. The idea is pretty unique. Could see it being especially helpful for e.g. RL when you do really want constrained updates to your network
English
0
0
1
69
stochasm
stochasm@stochasticchasm·
@hi_tysam @main_horse @kalomaze I imagine they picked less stable hyperparams to show that this method is stable in cases where it would otherwise be unstable, but yeah not sure
English
1
0
0
82
Fern
Fern@hi_tysam·
@evaninwords @main_horse @kalomaze @stochasticchasm @clashluke Yes, same for the layerwise norm-scaling opts that were spicy, oh gosh that's 8 years ago now noooooo The unit norm bit does range from .7-1.3ish but I think the coefficients used for NS mean the vast majority are 1 and below (lot in the upper .7's IIRC)
English
1
0
2
98
Evan Walters
Evan Walters@evaninwords·
@hi_tysam @main_horse @kalomaze @stochasticchasm Yeah I like that ZClip idea @clashluke once shared with me. Interestingly, if you use muon, momentum is normalized to unit norm layerwise anyway so pre-optimizer clipping won’t really matter (it’ll just change what goes into momentum a little bit).
English
1
0
5
101
Fern
Fern@hi_tysam·
@stochasticchasm @main_horse @kalomaze Seems interesting! Loss curves of the method and the baseline look pretty sus, the method is unique though so I'm curious what the actual value is
Fern tweet media
English
2
0
3
1.2K
Fern
Fern@hi_tysam·
@main_horse @kalomaze @stochasticchasm i'm not a fan of grad clipping, i think it's a monstrous bandaid that doesn't change underlying problems (only hides them), but: imo, if it happens, it should be @ per-example grads pre-aggregation optionally percentile based (e.g. .03% outliers under gaussian assumption, etc)
English
2
1
9
1.1K
main
main@main_horse·
@kalomaze @stochasticchasm if you want to leak an ongoing prime pretraining run, feel free otherwise we're just talking priors at this point
English
1
0
6
1.2K
Fern
Fern@hi_tysam·
@ID_AA_Carmack @actualhog Agreed, details would be great! As a speedrunner, v curious at the opts/diffs here. Is there a good high level summary of what you did/anything clever/any suspicions you might have about it? Curious to chat more.
English
0
0
2
2.3K
John Carmack
John Carmack@ID_AA_Carmack·
@actualhog Details, please! Reaching human level performance on all the Atari game in an hour would indeed be SOTA.
English
14
5
685
54.2K
actual hog
actual hog@actualhog·
one nn learns every atari game at once in realtime from scratch in one hour on a 4090. 56 seconds of gpu time per game. no pause or reset or memory peeking just 60fps color images. i was an rl skeptic two weeks ago. now i don't know. just trained this today.
English
36
68
1.2K
167.9K
Fern
Fern@hi_tysam·
@cloneofsimo (that said, the extreme discontinuity there for the minimum value is surprising to me, that's really neat!)
English
0
0
0
82
Fern
Fern@hi_tysam·
@cloneofsimo yes, this not at all surprising! i've tweeted about this before, it's not the beta1 or beta2s that matter as much as their ratios (this is a fun problem to work out why, but i can give the answer if you want!)
English
1
0
0
124
John Carmack
John Carmack@ID_AA_Carmack·
I recently learned about Cayley transforms. Similar to how you can parameterize a 3x3 rotation matrix by 3 Euler angles or a 4 element quaternion, Cayley transforms allow you to parameterize an N dimensional rotation matrix with just N*(N-1)/2 unique values in a skew-symmetric matrix, saving more than half the parameters and guaranteeing that the matrix will always be orthogonal. Unfortunately, the transformation involves a linalg.solv() or pinverse(), so it gets slow with thousands of dimensions. Still, I am happy to have this in my mental toolbox now! en.wikipedia.org/wiki/Cayley_tr…
English
80
87
1.9K
183.8K
Fern
Fern@hi_tysam·
@ID_AA_Carmack also some room for householder-like reflections here due to the uniqueness of elements ala a lot of the work done by songlin yang (also iiuc, could be wrong)
English
1
0
0
375
Fern
Fern@hi_tysam·
@ID_AA_Carmack the reduction makes sense since the degrees of freedom drops by 1 for every additional value you add due to the orthogonality (iiuc) curious if this could be done in a computation-efficient-way w/ newton-schulz
English
1
0
4
714
Fern
Fern@hi_tysam·
@AlxSp_ @_arohan_ It's like DeltaNet, but instead of minimizing the residual between adjacent state values inside of a block, we instead use an entire network to do it. They do go with a linear readout projection instead of using distance conditional stuff or an MLP, either is fine, i pref dist.
English
0
0
2
137
Alex Speicher
Alex Speicher@AlxSp_·
@hi_tysam @_arohan_ How much do you think the hierarchical part of the architecture actually improves it? Seems like injecting the input embed into each recurrent block stabilizes the training a lot already
English
1
0
1
46
Fern
Fern@hi_tysam·
@YouJiacheng Not terribly, it's the introduction of a linear readout head that biases the latent space towards either ignoring it or collapsing, you need (and can definitely use) something much more flexible and not tradeoff performance IMO. Otherwise it's just wasting params/flops 👍
Fern tweet media
English
0
0
8
268
You Jiacheng
You Jiacheng@YouJiacheng·
@hi_tysam they said the latents need many steps to converge?
English
1
0
3
413
Fern
Fern@hi_tysam·
btw, one flaw of HRMs is the readout q_head will either cause representational collapse, or be ignored, or some thing in between what you really should be doing instead is curve-fitting on the abs of the cosine distance of successive vectors to determine halting, or such similar
Guan Wang@makingAGI

🚀Introducing Hierarchical Reasoning Model🧠🤖 Inspired by brain's hierarchical processing, HRM delivers unprecedented reasoning power on complex tasks like ARC-AGI and expert-level Sudoku using just 1k examples, no pretraining or CoT! Unlock next AI breakthrough with neuroscience. 🌟 📄Paper: arxiv.org/abs/2506.21734 💻Code: github.com/sapientinc/HRM

English
2
2
100
8.5K
Fern
Fern@hi_tysam·
@_arohan_ definitely having something that is latent-space structure invariant is the way to go IMO, by a long shot
English
1
0
2
145
Fern
Fern@hi_tysam·
@_arohan_ yes, definitely! curious coincidence, just posted (15 minutes before your post) that the halting criterion for HRMs is ill-suited, and that you need to treat them like ODEs to get the most out of them x.com/hi_tysam/statu…
Fern@hi_tysam

btw, one flaw of HRMs is the readout q_head will either cause representational collapse, or be ignored, or some thing in between what you really should be doing instead is curve-fitting on the abs of the cosine distance of successive vectors to determine halting, or such similar

English
1
0
2
389
Fern
Fern@hi_tysam·
this is because you can formulate the inner one as an ODE-like process. you can also use kalman filters. see you in 1-2 years!
English
0
0
13
660
Fern
Fern@hi_tysam·
(And this is work sponsored by @natfriedman and @danielgross through the AI research grant a number of years ago, many thanks to them for their support and helping this work get out there!)
English
0
0
4
359
Fern
Fern@hi_tysam·
Something that this work handles too is state expansion -- though there are different modules at each timestep, doing this seems to work quite well. The time-isotropic version used in HRMs allows for weight reuse, though there's so many ways you can slice or dice it.
English
1
0
2
357
Fern
Fern@hi_tysam·
If you're wondering how it's possible for HRMs to learn without full gradient backprop.... I did this in 2023! The gradient never passes beyond each block, yet it learns CIFAR10 very well (93.11%). I had a feeling it was going to be a really impactful, great to see it in HRMs!
Fern@hi_tysam

You don't need to backprop through discrete samples to learn an effective network. Introducing an architecture that achieves an impressive 93.11% on CIFAR10 just by predicting its own future state. This intends to be one key step in replacing RL w/ cross-entropy objectives. 🧵

English
1
1
12
683