Ethan

12.3K posts

Ethan banner
Ethan

Ethan

@torchcompiled

trying to feel the magic. research at @canva | cofounder at @leonardoai

sydney - florida - SF Katılım Nisan 2022
869 Takip Edilen9.6K Takipçiler
Sabitlenmiş Tweet
Ethan
Ethan@torchcompiled·
personally I feel like the inflection point was early 2022. The sweet spot where clip-guided diffusion was just taking off, forcing unconditional models to be conditional through strange patchwork of CLIP evaluating slices of the canvas at a time. It was like improv, always trying to riff of mistakes and sitting right at the fine line between interesting and incoherent.
Ethan tweet mediaEthan tweet mediaEthan tweet mediaEthan tweet media
EPROM@eprombeats

Image synthesis used to look so good. These are from 2021. I feel like this was an inflection point, and the space has metastasized into something abhorrent today (Grok, etc). Even with no legible representational forms, there was so much possibility in these images.

English
29
41
707
239.4K
Ethan
Ethan@torchcompiled·
@NicholasBardy The idea of directly optimizing FID is interesting, but going a step further and showing how it compares to other embedding choices, which still has fine effects on quality, can reduce FID. Demonstrating pretty solidly the fallibility of the metric
English
0
0
3
87
Ethan
Ethan@torchcompiled·
A few thoughts for looped transformers, both some ideas and reuse of closely-related Universal transformer literature. - Probably biggest one, and something I contend with a bit. We observe (Remarkable Robustness of LLMs: Stages of Inference) the last few transformer layers in transformers have a very specific role in preparing hidden states for readout at LLM head, acting more as a filter/somewhat reducing richness of representations previously built-up. It could be worth setting aside N transformer layers with entirely unique weights on the output side separate from the loop - This may also be relevant to having a couple of unique input layers outside the loop as well. - norm scale/bias params can easily be unique per layer per loop, this is cheap possible additional expression - on loops past the first, weights can be made slightly unique/adapted by using LoRAs, this has been done with UT - MoEUT effectively added MoE to universal transformers, doing the same for looped transformers may allow more computtional pathways across easy loop - the x0 residual addition might be worth some scrutiny, for two reasons. firstly, its a static addition, depending on how helpful its contribution and how much we'd like to weigh it, the residual stream may have to upsize/downsize its norm rather than having control over x0's magnitude. it could be worth either some form of learnable scaling or projection of x0. Secondly, x0 should effectively represent the preceding token. It feels like it'd be more helpful to have a somewhat processed feature, like taking a representation from the middle of the network on the first loop or similar? - universal transformer originally suggested timestep-style conditioning to tell model which loop its on, which may nicely adapt computation. Could imagine an embedding table per-loop/layer condition or a more complex function - Universal transformer originally proposed early halting, which I think is what Elastic Looped Transformers is going after - Experimenting with some weights not being shared across loops, or partially shared. i.e. same MLP up for intervals of 2 loops. Some operations might get away with parameter reuse more-so than others
English
1
2
30
1.7K
Ethan
Ethan@torchcompiled·
@_ueaj @_arohan_ @HessianFree @YouJiacheng I tried mixing ademamix with muon a while back, granted at the paper’s suggested hparams for Adam (0.9999, a=8.0), tried blending both before or after NS but didn’t find much of a benefit
English
1
0
1
96
ueaj
ueaj@_ueaj·
I think multiscale muon is might be a better approximation of this paper. Instead of taking a separate inner step, you have multiple momentum buffers running at different speeds. This way the upper momentum buffers get the part of the gradient that is common to both steps that it then extrapolates out to produce the update. It's not exactly the same because the outer momentum loop is run every iteration, just with an appropriately lower lr and higher momentum. A kind of continuous version of nexus?
ueaj tweet media
Rosinality@rosinality

The concept of closeness, which means the distance between local optima of training and task distributions. If we optimize this then it would be possible to achieve better loss on OOD tasks while having the same pretraining loss. Maybe a bit close to meta learning? As the closeness relies on the task distribution it depends on mixing of data sources during pretraining.

English
1
8
48
4.5K
Ethan
Ethan@torchcompiled·
A difference of masking input vs loss from my understanding, which is materially different given masked positions are removed information for model to attend to.
English
0
0
5
521
Ethan
Ethan@torchcompiled·
Ambient diffusion was a really cool paper in this space. Given corrupted data, corrupt even further, such that the model can’t infer what was genuine corruption vs synthetically added. In expectation you recover distribution. And similar claimed benefits on avoiding memorization.
Ethan tweet mediaEthan tweet media
Massimiliano Viola@massiviola01

Training a diffusion model has always been synonymous with one idea: add some noise to an image, then learn to remove it. Since we know where we started and where we ended up, the natural thing to do is to ask the model to recover the signal everywhere. But is it really needed?

English
3
9
109
12.5K
Yuxin Fang
Yuxin Fang@CV_novel_plume·
I’ve run a lot of experiments on Muon and its variants, and I’d bet that in this setting, the Muon baseline will be very hard to beat.
Keller Jordan@kellerjordan0

Modded-NanoGPT Optimization Benchmark Hundreds of neural network optimizers have been proposed in the literature, recently including dozens citing Muon: MARS, SWAN, REG, ADANA, Newton-Muon, TrasMuon, AdaMuon, HTMuon, COSMOS, Conda, ASGO, SAGE, and Magma, to name a few. The majority of this innovation is happening in the public research community. But the community currently lacks a widely accepted, easily accessible way to compare and make sense of the deluge of methods. As a result, promising new ideas get buried, and spurious results go unchallenged. To help address these issues, I'm releasing a new optimization benchmark. It's designed for maximum simplicity and speed: Just a single file containing ~350 lines of plain PyTorch, which can complete a baseline LM training within 20 minutes of booting up a fresh 8xH100 machine. It also works with {1,2,4}xH100 or A100. These attributes make the new benchmark more accessible than any prior work. The rules are simple: The optimization algorithm can be changed arbitrarily, with the goal being to minimize the number of training steps needed to reach 3.28 val loss on FineWeb (this is the same target loss as in the main speedrun). Modifying the architecture or dataloader, on the other hand, is not allowed. Wallclock time is unlimited, in order to give a fair chance to optimizers which would need kernel work or larger scale to become wallclock-efficient. Like the main NanoGPT speedrun, submissions are open, and new results will be publicly broadcast. Beyond just improving the step count record, another goal of the benchmark is to collaboratively produce well-tuned baselines for as many optimizers as possible. For example, any improvement to the benchmark's best hyperparameters for AdamW would be considered a worthwhile new result. This benchmark is not intended to be the final measure of optimizer quality across all domains. Convenient shared experimental infrastructure which covers the full space of possibilities -- across varying batch size, tokens per parameter, model scale, epoch count, and architecture -- is desirable, but far beyond the current status quo. This benchmark is only meant to be one step towards that goal. To start the benchmark off, I've spent ~20 runs tuning baselines for Muon and AdamW. From time to time over the next few weeks, I'll add another optimizer from the literature, with my best effort at finding good hyperparameters. Researchers interested in neural network optimization are invited to join in by picking an optimizer and giving it a try on the benchmark. All optimizers are welcome, and even runs that don't necessarily have the best hyperparameters are desirable additions to the repo, because each new run adds to the collective knowledge.

English
9
3
62
17.8K
Ethan
Ethan@torchcompiled·
@tianylin Definitely could, it does increase communication load, but should follow the same path of comms momentum does?
English
1
0
1
203
Ethan
Ethan@torchcompiled·
@YouJiacheng Not exactly clear how the problems stated can’t be reduced to faster hot memory and/or larger storage which seems to be very general chip improvements.
English
1
0
5
381
Ethan
Ethan@torchcompiled·
What if we could do Muon with just one Newton step while also achieving better loss?
Ethan tweet media
English
6
24
223
23.5K
Ethan
Ethan@torchcompiled·
In the end we see not only lower loss but a solid amount of time spared on optimizer step
Ethan tweet mediaEthan tweet media
English
0
0
13
720
Ethan
Ethan@torchcompiled·
This skips the math specifics but covers really most of the important details. But I recommend checking out the algorithm! It’s a pretty neat property of symmetric matrices
Ethan tweet mediaEthan tweet media
English
1
0
8
980
Ethan
Ethan@torchcompiled·
@CevherLIONS @_arohan_ It’s a bit ironic, because most discussed optimizers do some sparse approximation of the second moment except for Muon
English
1
0
3
431
Volkan Cevher
Volkan Cevher@CevherLIONS·
Nice thread. I would avoid calling these “second-order.” They are better understood as first-order methods with non-Euclidean geometry (i.e., different norms / preconditioners). Much of the recent progress is computational (Newton–Schulz, distributed / low-rank), not conceptual but extremely impactful. A unified perspective here: arxiv.org/abs/2511.11163
rohan anil@_arohan_

All the hullaboo around Muon made me look at @jxbz ‘s beautiful paper’s appendix arxiv.org/pdf/2409.20325 It has a proper intellectual credit towards how we ended up at Muon, which is the newton schulz iteration for a fast impl of shampoo with b2=0.0 case if you didnt have access to a fast eigh Funnily, a first version of soap is the appendix of my 2020 paper. N=2, its clear research that goes into appendix of papers are usually good for writing another paper. There was further improvement by the community around finding better ways to speed up newton schulz iteration. There was nerdsnipe for some msft researchers to make a distributed version via low rank approx. Only after discussion on X did these researchers find more commonality between shampoo and Xi Lin’s work on PSGD (he has been solo authoring SOTA methods until @HessianFree joined him) that most of community was finding hard to understand (as most people in labs and community aren’t trained to be good at difficult linear algebra) I think the OG is @CevherLIONS spectral SGD work. it also could be that Schmidhuber and Yann also figured this out 20 years ago and we didn’t read their papers.

English
6
13
182
22K
Ethan
Ethan@torchcompiled·
Feels a lot like unresolved/lately resolved latent noise that manifests as more high frequency details I don’t entirely remember what I was tinkering with at the time, but a few occasions have run into this effect. I wanna say it was something along the lines of attempting a bilinear latent upscale in the middle of the diffusion process, injecting noise to “heal” raw interpolation of latents, and following up with high CFG
English
0
0
1
65
Ollin Boer Bohan
Ollin Boer Bohan@madebyollin·
@max_pe2002 @gabeeegoooh @SwayStar123 Hmm, have all of your artifacty images been from multi-image chats? It looks like there's a strong bias towards texture/feature copying across images within a chat thread: x.com/madebyollin/st…
Ollin Boer Bohan@madebyollin

@JiaweiYang118 Some of the ChatGPT Images 2 weirdness is texture leakage from context images (see this thread reddit.com/r/ChatGPT/comm…) You can visualize this by asking for three unrelated images in the same chat (note how the flower texture persists in the circled area).

English
2
0
1
143
Max
Max@max_pe2002·
what do we think about these gpt image 2 artifacts. Do you guys think the decoder is trained as a GAN/LDM/ PixelDM
Max tweet mediaMax tweet mediaMax tweet media
English
3
0
5
418