Fern

2.2K posts

Fern

@hi_tysam

Neural network speedrunner and community-funded open source researcher. Set the CIFAR-10 record several times. Say hi!

Katılım Ocak 2023

221 Takip Edilen3K Takipçiler

Sabitlenmiş Tweet

Fern@hi_tysam·16 Oca

New NanoGPT training speed record: 3.28 FineWeb val loss in 3.17 minutes on 8xH100 Previous record (recreation): 3.32 minutes Lots of changes! - New token-dependent lm_head bias - Fused several ops - Multi-GPU grad bugfix - Steps 1390 -> 1350 - Semi-ortho init - (More in thread)

English

262

38.3K

Fern@hi_tysam·14 Eki

@eric_alcaide @stochasticchasm @main_horse @kalomaze oh hey good to see you again, long time no see yeah its definitely very sus, no idea whether it's good or not vs other alternatives

English

1.2K

Eric Alcaide@eric_alcaide·14 Eki

@hi_tysam @stochasticchasm @main_horse @kalomaze the first figure makes me want to stop reading this lol

English

20.6K

kalomaze@kalomaze·22 Eyl

well technically yes but what is usually logged is the "raw norm", and the clipping is applied globally before either of those things, no?

Lucas Beyer (bl16)@giffmana

@kalomaze No you are mixing things! What's usually clipped is the gradient norm. What the formula above computes (approximates) is the norm of the parameter update (before lr and wd). These are very different things.

English

11.2K

Fern@hi_tysam·23 Eyl

@evaninwords @main_horse @kalomaze @stochasticchasm @clashluke Yeah, defo for the input. Then the norms of each vector should be ~.7-1.3ish along each vector after it's done iirc I think there was some attempt to get a better estimator for this to reduce the number of steps but this seemed to be the computationally cheapest iirc

English

109

Evan Walters@evaninwords·23 Eyl

@hi_tysam @main_horse @kalomaze @stochasticchasm @clashluke I mean this line here, does unit norm before the NS iters

English

115

Fern@hi_tysam·23 Eyl

@stochasticchasm @main_horse @kalomaze Yeah. The idea is pretty unique. Could see it being especially helpful for e.g. RL when you do really want constrained updates to your network

English

stochasm@stochasticchasm·23 Eyl

@hi_tysam @main_horse @kalomaze I imagine they picked less stable hyperparams to show that this method is stable in cases where it would otherwise be unstable, but yeah not sure

English

Fern@hi_tysam·23 Eyl

@evaninwords @main_horse @kalomaze @stochasticchasm @clashluke Yes, same for the layerwise norm-scaling opts that were spicy, oh gosh that's 8 years ago now noooooo The unit norm bit does range from .7-1.3ish but I think the coefficients used for NS mean the vast majority are 1 and below (lot in the upper .7's IIRC)

English

Evan Walters@evaninwords·23 Eyl

@hi_tysam @main_horse @kalomaze @stochasticchasm Yeah I like that ZClip idea @clashluke once shared with me. Interestingly, if you use muon, momentum is normalized to unit norm layerwise anyway so pre-optimizer clipping won’t really matter (it’ll just change what goes into momentum a little bit).

English

101

Fern@hi_tysam·23 Eyl

@stochasticchasm @main_horse @kalomaze Seems interesting! Loss curves of the method and the baseline look pretty sus, the method is unique though so I'm curious what the actual value is

English

1.2K

stochasm@stochasticchasm·23 Eyl

@hi_tysam @main_horse @kalomaze I found this earlier that seems pretty chill arxiv.org/abs/2504.02507

English

Fern@hi_tysam·23 Eyl

@main_horse @kalomaze @stochasticchasm i'm not a fan of grad clipping, i think it's a monstrous bandaid that doesn't change underlying problems (only hides them), but: imo, if it happens, it should be @ per-example grads pre-aggregation optionally percentile based (e.g. .03% outliers under gaussian assumption, etc)

English

1.1K

main@main_horse·22 Eyl

@kalomaze @stochasticchasm if you want to leak an ongoing prime pretraining run, feel free otherwise we're just talking priors at this point

English

1.2K

Fern@hi_tysam·22 Eyl

@ID_AA_Carmack @actualhog Agreed, details would be great! As a speedrunner, v curious at the opts/diffs here. Is there a good high level summary of what you did/anything clever/any suspicions you might have about it? Curious to chat more.

English

2.3K

John Carmack@ID_AA_Carmack·22 Eyl

@actualhog Details, please! Reaching human level performance on all the Atari game in an hour would indeed be SOTA.

English

685

54.2K

actual hog@actualhog·22 Eyl

one nn learns every atari game at once in realtime from scratch in one hour on a 4090. 56 seconds of gpu time per game. no pause or reset or memory peeking just 60fps color images. i was an rl skeptic two weeks ago. now i don't know. just trained this today.

English

1.2K

167.9K

Fern@hi_tysam·18 Eyl

@cloneofsimo (that said, the extreme discontinuity there for the minimum value is surprising to me, that's really neat!)

English

Fern@hi_tysam·18 Eyl

@cloneofsimo yes, this not at all surprising! i've tweeted about this before, it's not the beta1 or beta2s that matter as much as their ratios (this is a fun problem to work out why, but i can give the answer if you want!)

English

124

Simo Ryu@cloneofsimo·21 Mar

Ok.... so I explored further to EXTREME beta values like 0.2.... Interestingly, large beta2 seems to be crucial, until beta1 also becomes small. Small beta1 allows small beta2.

Simo Ryu@cloneofsimo

I sweeped batchsize vs beta1 vs beta2 vs lr and plotted on optimal lr and batch size: And what the actual fuck man....

English

3.8K

Fern@hi_tysam·13 Eyl

@Gradientdinner @ID_AA_Carmack oh neat ❤️🔥

English

mikail@Gradientdinner·13 Eyl

@hi_tysam @ID_AA_Carmack x.com/gradientdinner…

mikail@Gradientdinner

@ID_AA_Carmack It can be done without solvers and only with matmuls: arxiv.org/abs/2002.01113 (gpu 😉)

QME

John Carmack@ID_AA_Carmack·12 Eyl

I recently learned about Cayley transforms. Similar to how you can parameterize a 3x3 rotation matrix by 3 Euler angles or a 4 element quaternion, Cayley transforms allow you to parameterize an N dimensional rotation matrix with just N*(N-1)/2 unique values in a skew-symmetric matrix, saving more than half the parameters and guaranteeing that the matrix will always be orthogonal. Unfortunately, the transformation involves a linalg.solv() or pinverse(), so it gets slow with thousands of dimensions. Still, I am happy to have this in my mental toolbox now! en.wikipedia.org/wiki/Cayley_tr…

English

1.9K

183.8K

Fern@hi_tysam·12 Eyl

@ID_AA_Carmack also some room for householder-like reflections here due to the uniqueness of elements ala a lot of the work done by songlin yang (also iiuc, could be wrong)

English

375

Fern@hi_tysam·12 Eyl

@ID_AA_Carmack the reduction makes sense since the degrees of freedom drops by 1 for every additional value you add due to the orthogonality (iiuc) curious if this could be done in a computation-efficient-way w/ newton-schulz

English

714

Fern@hi_tysam·4 Ağu

@AlxSp_ @_arohan_ It's like DeltaNet, but instead of minimizing the residual between adjacent state values inside of a block, we instead use an entire network to do it. They do go with a linear readout projection instead of using distance conditional stuff or an MLP, either is fine, i pref dist.

English

137

Alex Speicher@AlxSp_·4 Ağu

@hi_tysam @_arohan_ How much do you think the hierarchical part of the architecture actually improves it? Seems like injecting the input embed into each recurrent block stabilizes the training a lot already

English

rohan anil@_arohan_·4 Ağu

Reminds me of implicit-layers-tutorial.org

Guan Wang@makingAGI

🚀Introducing Hierarchical Reasoning Model🧠🤖 Inspired by brain's hierarchical processing, HRM delivers unprecedented reasoning power on complex tasks like ARC-AGI and expert-level Sudoku using just 1k examples, no pretraining or CoT! Unlock next AI breakthrough with neuroscience. 🌟 📄Paper: arxiv.org/abs/2506.21734 💻Code: github.com/sapientinc/HRM

English

3.7K

Fern@hi_tysam·4 Ağu

@YouJiacheng Not terribly, it's the introduction of a linear readout head that biases the latent space towards either ignoring it or collapsing, you need (and can definitely use) something much more flexible and not tradeoff performance IMO. Otherwise it's just wasting params/flops 👍

English

268

You Jiacheng@YouJiacheng·4 Ağu

@hi_tysam they said the latents need many steps to converge?

English

413

Fern@hi_tysam·4 Ağu

btw, one flaw of HRMs is the readout q_head will either cause representational collapse, or be ignored, or some thing in between what you really should be doing instead is curve-fitting on the abs of the cosine distance of successive vectors to determine halting, or such similar

Guan Wang@makingAGI

English

100

8.5K

Fern@hi_tysam·4 Ağu

@_arohan_ definitely having something that is latent-space structure invariant is the way to go IMO, by a long shot

English

145

Fern@hi_tysam·4 Ağu

@_arohan_ yes, definitely! curious coincidence, just posted (15 minutes before your post) that the halting criterion for HRMs is ill-suited, and that you need to treat them like ODEs to get the most out of them x.com/hi_tysam/statu…

Fern@hi_tysam

English

389

Fern@hi_tysam·4 Ağu

this is because you can formulate the inner one as an ODE-like process. you can also use kalman filters. see you in 1-2 years!

English

660

Fern@hi_tysam·4 Ağu

(And this is work sponsored by @natfriedman and @danielgross through the AI research grant a number of years ago, many thanks to them for their support and helping this work get out there!)

English

359

Fern@hi_tysam·4 Ağu

Something that this work handles too is state expansion -- though there are different modules at each timestep, doing this seems to work quite well. The time-isotropic version used in HRMs allows for weight reuse, though there's so many ways you can slice or dice it.

English

357

Fern@hi_tysam·4 Ağu

If you're wondering how it's possible for HRMs to learn without full gradient backprop.... I did this in 2023! The gradient never passes beyond each block, yet it learns CIFAR10 very well (93.11%). I had a feeling it was going to be a really impactful, great to see it in HRMs!

Fern@hi_tysam

You don't need to backprop through discrete samples to learn an effective network. Introducing an architecture that achieves an impressive 93.11% on CIFAR10 just by predicting its own future state. This intends to be one key step in replacing RL w/ cross-entropy objectives. 🧵

English

683

Keşfet

@eric_alcaide @stochasticchasm @main_horse @kalomaze @evaninwords @clashluke @ID_AA_Carmack @actualhog