Ethan

12.2K posts

Ethan banner
Ethan

Ethan

@torchcompiled

trying to feel the magic. cofounder at @leonardoai directing research at @canva

SF - sydney - florida Katılım Nisan 2022
868 Takip Edilen9.4K Takipçiler
Sabitlenmiş Tweet
Ethan
Ethan@torchcompiled·
personally I feel like the inflection point was early 2022. The sweet spot where clip-guided diffusion was just taking off, forcing unconditional models to be conditional through strange patchwork of CLIP evaluating slices of the canvas at a time. It was like improv, always trying to riff of mistakes and sitting right at the fine line between interesting and incoherent.
Ethan tweet mediaEthan tweet mediaEthan tweet mediaEthan tweet media
EPROM@eprombeats

Image synthesis used to look so good. These are from 2021. I feel like this was an inflection point, and the space has metastasized into something abhorrent today (Grok, etc). Even with no legible representational forms, there was so much possibility in these images.

English
29
41
709
236.8K
Ethan
Ethan@torchcompiled·
Have you ever gotten tired of boring plain linear layers and wanted a more complex function? We find that attaching low rank nonlinear residual functions can significantly accelerate pretraining, with an identified variant, CosNet, consistently observing 20+% wallclock speedup!
Ethan tweet media
English
15
42
244
29.4K
Igor Kotenkov
Igor Kotenkov@stalkermustang·
Lol, my friend has just suggested OAI will be testing their Automated AI research interns on these challenges (announced as a series, not just 1) Great idea, if true
Igor Kotenkov tweet media
Igor Kotenkov@stalkermustang

OpenAI is about to launch "Parameter Golf" challenge, $1M in compute grants > train the best language model that fits in a 16MB artifact and trains in under 10 minutes on 8xH100s, evaluated by compression on the FineWeb validation set (tokenizer-agnostic, bits per byte). github.com/openai/paramet… inspired by @karpathy 's NanoGPT speedrunning The challenge runs from March 18th to April 30th. In June, they plan to hire a small cohort of early-career researchers, targeting current undergraduate students and recent graduates, including Olympiad medalists and elite competitors.

English
2
5
121
17.5K
lito
lito@litocoen·
people messaging me if this is real crazy how much of a blindspot australia is for many people i have lived in a bunch of places but nothing beats australia in terms of quality of life there’s high trust, people are friendly, lots of natural ressources, developed economy, far from conflict, incredible weather it’s hard not to be happy down here
@echoesofworld

Sydney

English
82
24
464
52.8K
Ethan
Ethan@torchcompiled·
if this is an issue for asymmetry in the LLM head, would we expect it to similarly apply to the up matrices of FFN? Paper mentions softmax affecting the rank of the representation is a factor here, but curious if activation functions could play a similar role.
Nathan Godey@nthngdy

🧵New paper: "Lost in Backpropagation: The LM Head is a Gradient Bottleneck" The output layer of LLMs destroys 95-99% of your training signal during backpropagation, and this significantly slows down pretraining 👇

English
2
2
35
4.9K
Ethan
Ethan@torchcompiled·
@code_star From my understanding it effectively is DenseNet to start, plus a dynamic selection method for choosing residuals at a given layer
English
0
0
3
259
Ethan
Ethan@torchcompiled·
@wickedbrok I guess why not fuse the alpha into the down layer?
English
1
0
1
30
Brok
Brok@wickedbrok·
quick fix, my phrasing got cut off. initialization regime is what dictates the optimal LR for LoRA. To your point, I don't think that's recommended, you might want to use both alpha and lr as alpha directly scales the weights and lr the updates. so high lr can be costly and again you include alpha to find the sweet spot.
English
1
0
2
63
Ethan
Ethan@torchcompiled·
LoRA needs a higher learning rate, but why? A general rule of thumb that we often use smaller learning rates for larger models and vice versa and LoRA may follow that pattern. But what governs this? At least in part, it may be related to the initialization and range of values we use for a given matrix to ensure that it preserves variance of hidden states in expectation. Smaller matrices are initialized with a larger standard deviation (or bounds if uniform init). Specifically, the xavier init considers the square root of the average of input and output dimensions. Loras, where R<
English
3
6
96
6.9K
Ethan
Ethan@torchcompiled·
@wickedbrok Ah I generally ditch the alpha for just higher lr
English
1
0
2
40
Brok
Brok@wickedbrok·
the reason down weight is random (non-zero) and up weight is zero is to allow lora to act as FullFT early on without corrupting the residual weight space and if B is non-zero, learning is disabled (I show that by sampling B weights from a Gaussian distrib). So early on, gradients for A are zero where the "steering" is happening to learn the downstream task. So, to push A out of its slow phase, we need to scale B and to scale B, we have to scale the LR (or whatever scale factor affecting it like alpha). So generally, a higher LR is needed by alpha also compensate for that, thus the initialization regime which really includes whatever scales adapters' weights (can be lr, alpha, init method, rank). Wanted to explain why this might indirectly reveal a singularity in the scaling laws governing everything but my response is already too wordy:)
English
1
0
2
48
Ethan
Ethan@torchcompiled·
Ah nice! The closest I can think of is Lora+ which I use all the time now, but I don’t think it as much dictates choosing lr based on size but that the up weight should be higher than down weight, which makes sense as common init for Lora is Kaiming based on down weight and zeros for up weight, and given massive difference between rank and dim I think is the better choice than averaging dimensions. The other theory I have, that’s very vibes based, is that it may be more stable/fluid in the face of gradient noise to let one weight handle most of the adjustment while the other moves slowly
English
1
0
2
218
Brok
Brok@wickedbrok·
That was literally my research for a month. I could converge to some interesting findings on minimal setups but lack compute to generalize. I also think this work reveals the singularity of scaling laws. Would be amazing to partner with someone to dig deeper into this. github.com/Brokttv/Lora-W…
English
1
0
5
323
Ethan
Ethan@torchcompiled·
@rudzinskimaciej I am pretty interested in this direction in practice I’m not sure how feasible it is and if you’d be paying more for memory access than flops saved
English
1
0
1
8
Rudzinski Maciej
Rudzinski Maciej@rudzinskimaciej·
I'm asking not due to assumed shape of response and overflows as you are 100% right but due to newer GPUs going in this direction in theory well pre prepared LUTs could aproximate well operations - if we would normalize properly which is easier with LUT as it can output categories not real values as with this level of precision it makes a bit more sense the LUT can in sense auto normalize for free and as you might need more layers to fully use the gain of int4 (obvious) it should give even biger gains with your method as residuals we want in lower precision regime neverless superb paper I'm all for nonlinearities of this kind and will find nice method (post) that would gain form it in moment as it gets even more interesting what you did in case of looping layers
English
2
0
0
23
Ethan
Ethan@torchcompiled·
@rudzinskimaciej I think the precision here is fairly important given the typical range of outputs, IIUC high range dtypes can be important to avoid overflow during dot product reductions in matmuls. Cosine is fully bounded -1/+1 so if we could, I’d put my budget into mantissa
English
1
0
0
17
Rudzinski Maciej
Rudzinski Maciej@rudzinskimaciej·
@torchcompiled Shouldn't we create them with int4 in mind? the posibility space if you take constrints into account is not that large and enumerable - then it could be a LUT
English
1
0
0
23
Ethan
Ethan@torchcompiled·
@paws4puzzles It’s a handwavey summarization, but effectively the trig functions are curved everywhere. One thing that definitely is not great is if the derivative is too large, we generally want some lipschitz constraint. Even doing cos(2x) as an initialization did hurt in my experience.
English
0
0
1
34
Puzzle Paws
Puzzle Paws@paws4puzzles·
@torchcompiled More nonlinear = better? Intriguing hypothesis. Would bet there's a ceiling somewhere before overhead eats the gains.
English
1
0
1
32
Ethan
Ethan@torchcompiled·
@francoisfleuret I don’t think this is a fair comparison. That would be valid if the corpus/database was solely the works of Shakespeare, but realistically our corpus is effectively all language ever produced and searchable via semantic queries that relate to the content
English
0
0
2
214
Ethan
Ethan@torchcompiled·
@DdelAlamo I want to say there was a Unet approach made entirely from transformer blocks, where instead of additive skip connections you’d perform cross attention to the residuals
English
0
0
1
295
Diego del Alamo
Diego del Alamo@DdelAlamo·
So I can't say I've ever seen residual cross-attention before (where the final representations attend to earlier representations of the input data); is there any literature on when and where to use this?
Diego del Alamo tweet media
Rishabh Anand@rishabh16_

🚨 New preprint!!! Introducting Zatom-1, a multi-modal generative foundation model for 3D small molecules and materials that operates fully in ambient space. Its embeddings are also useful for downstream molecular predictive tasks (properties, MLIPs, etc). 1/n

English
4
5
66
11.5K