Ethan

12.3K posts

Ethan

@torchcompiled

trying to feel the magic. research at @canva | cofounder at @leonardoai

sydney - florida - SF Katılım Nisan 2022

869 Takip Edilen9.6K Takipçiler

Sabitlenmiş Tweet

Ethan@torchcompiled·14 Şub

personally I feel like the inflection point was early 2022. The sweet spot where clip-guided diffusion was just taking off, forcing unconditional models to be conditional through strange patchwork of CLIP evaluating slices of the canvas at a time. It was like improv, always trying to riff of mistakes and sitting right at the fine line between interesting and incoherent.

EPROM@eprombeats

Image synthesis used to look so good. These are from 2021. I feel like this was an inflection point, and the space has metastasized into something abhorrent today (Grok, etc). Even with no legible representational forms, there was so much possibility in these images.

English

707

239.4K

Ethan@torchcompiled·3d

@NicholasBardy The idea of directly optimizing FID is interesting, but going a step further and showing how it compares to other embedding choices, which still has fine effects on quality, can reduce FID. Demonstrating pretty solidly the fallibility of the metric

English

Nicholas Bardy@NicholasBardy·3d

@torchcompiled What do you like so much about it?

English

Ethan@torchcompiled·3d

Incredible work, hope this gets the attention it deserves.

Jiawei Yang@JiaweiYang118

Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space. Now it is 0.75, and can be even lower. Many wonder how. I thought it might end as a small FID prank: simple and deliberate. It started with one question: can FID be optimized directly, and what does it reveal? Introducing FD-loss.

English

2.4K

Ethan@torchcompiled·3d

A few thoughts for looped transformers, both some ideas and reuse of closely-related Universal transformer literature. - Probably biggest one, and something I contend with a bit. We observe (Remarkable Robustness of LLMs: Stages of Inference) the last few transformer layers in transformers have a very specific role in preparing hidden states for readout at LLM head, acting more as a filter/somewhat reducing richness of representations previously built-up. It could be worth setting aside N transformer layers with entirely unique weights on the output side separate from the loop - This may also be relevant to having a couple of unique input layers outside the loop as well. - norm scale/bias params can easily be unique per layer per loop, this is cheap possible additional expression - on loops past the first, weights can be made slightly unique/adapted by using LoRAs, this has been done with UT - MoEUT effectively added MoE to universal transformers, doing the same for looped transformers may allow more computtional pathways across easy loop - the x0 residual addition might be worth some scrutiny, for two reasons. firstly, its a static addition, depending on how helpful its contribution and how much we'd like to weigh it, the residual stream may have to upsize/downsize its norm rather than having control over x0's magnitude. it could be worth either some form of learnable scaling or projection of x0. Secondly, x0 should effectively represent the preceding token. It feels like it'd be more helpful to have a somewhat processed feature, like taking a representation from the middle of the network on the first loop or similar? - universal transformer originally suggested timestep-style conditioning to tell model which loop its on, which may nicely adapt computation. Could imagine an embedding table per-loop/layer condition or a more complex function - Universal transformer originally proposed early halting, which I think is what Elastic Looped Transformers is going after - Experimenting with some weights not being shared across loops, or partially shared. i.e. same MLP up for intervals of 2 loops. Some operations might get away with parameter reuse more-so than others

English

1.7K

Ethan@torchcompiled·4d

@_ueaj @_arohan_ @HessianFree @YouJiacheng I tried mixing ademamix with muon a while back, granted at the paper’s suggested hparams for Adam (0.9999, a=8.0), tried blending both before or after NS but didn’t find much of a benefit

English

ueaj@_ueaj·4d

I'm an optimizer epistemics person not a optimizer math guy so I'm not sure if there's a formal connection or not. Paging optimizer people @_arohan_ @HessianFree @YouJiacheng publish.obsidian.md/ueaj/Machine+L…

English

400

ueaj@_ueaj·4d

I think multiscale muon is might be a better approximation of this paper. Instead of taking a separate inner step, you have multiple momentum buffers running at different speeds. This way the upper momentum buffers get the part of the gradient that is common to both steps that it then extrapolates out to produce the update. It's not exactly the same because the outer momentum loop is run every iteration, just with an appropriately lower lr and higher momentum. A kind of continuous version of nexus?

Rosinality@rosinality

The concept of closeness, which means the distance between local optima of training and task distributions. If we optimize this then it would be possible to achieve better loss on OOD tasks while having the same pretraining loss. Maybe a bit close to meta learning? As the closeness relies on the task distribution it depends on mixing of data sources during pretraining.

English

4.5K

Ethan@torchcompiled·4d

A difference of masking input vs loss from my understanding, which is materially different given masked positions are removed information for model to attend to.

English

521

Ethan@torchcompiled·4d

Ambient diffusion was a really cool paper in this space. Given corrupted data, corrupt even further, such that the model can’t infer what was genuine corruption vs synthetically added. In expectation you recover distribution. And similar claimed benefits on avoiding memorization.

Massimiliano Viola@massiviola01

Training a diffusion model has always been synonymous with one idea: add some noise to an image, then learn to remove it. Since we know where we started and where we ended up, the natural thing to do is to ask the model to recover the signal everywhere. But is it really needed?

English

109

12.5K

Ethan@torchcompiled·5d

@CV_novel_plume For DDP setting this should fit in

Ethan@torchcompiled

What if we could do Muon with just one Newton step while also achieving better loss?

English

Yuxin Fang@CV_novel_plume·5d

I’ve run a lot of experiments on Muon and its variants, and I’d bet that in this setting, the Muon baseline will be very hard to beat.

Keller Jordan@kellerjordan0

Modded-NanoGPT Optimization Benchmark Hundreds of neural network optimizers have been proposed in the literature, recently including dozens citing Muon: MARS, SWAN, REG, ADANA, Newton-Muon, TrasMuon, AdaMuon, HTMuon, COSMOS, Conda, ASGO, SAGE, and Magma, to name a few. The majority of this innovation is happening in the public research community. But the community currently lacks a widely accepted, easily accessible way to compare and make sense of the deluge of methods. As a result, promising new ideas get buried, and spurious results go unchallenged. To help address these issues, I'm releasing a new optimization benchmark. It's designed for maximum simplicity and speed: Just a single file containing ~350 lines of plain PyTorch, which can complete a baseline LM training within 20 minutes of booting up a fresh 8xH100 machine. It also works with {1,2,4}xH100 or A100. These attributes make the new benchmark more accessible than any prior work. The rules are simple: The optimization algorithm can be changed arbitrarily, with the goal being to minimize the number of training steps needed to reach 3.28 val loss on FineWeb (this is the same target loss as in the main speedrun). Modifying the architecture or dataloader, on the other hand, is not allowed. Wallclock time is unlimited, in order to give a fair chance to optimizers which would need kernel work or larger scale to become wallclock-efficient. Like the main NanoGPT speedrun, submissions are open, and new results will be publicly broadcast. Beyond just improving the step count record, another goal of the benchmark is to collaboratively produce well-tuned baselines for as many optimizers as possible. For example, any improvement to the benchmark's best hyperparameters for AdamW would be considered a worthwhile new result. This benchmark is not intended to be the final measure of optimizer quality across all domains. Convenient shared experimental infrastructure which covers the full space of possibilities -- across varying batch size, tokens per parameter, model scale, epoch count, and architecture -- is desirable, but far beyond the current status quo. This benchmark is only meant to be one step towards that goal. To start the benchmark off, I've spent ~20 runs tuning baselines for Muon and AdamW. From time to time over the next few weeks, I'll add another optimizer from the literature, with my best effort at finding good hyperparameters. Researchers interested in neural network optimization are invited to join in by picking an optimizer and giving it a try on the benchmark. All optimizers are welcome, and even runs that don't necessarily have the best hyperparameters are desirable additions to the repo, because each new run adds to the collective knowledge.

English

17.8K

Ethan@torchcompiled·5d

@tianylin Definitely could, it does increase communication load, but should follow the same path of comms momentum does?

English

203

TianyLin@tianylin·5d

the method might look appealing at first glance. but on sec thought -- the cost of 1x extra optimizer states might actually slow down the optimizer update in multi-node setups, bc the communication overhead is now doubled.

Ethan@torchcompiled

What if we could do Muon with just one Newton step while also achieving better loss?

English

1.9K

Ethan@torchcompiled·5d

This is a pretty awesome identity to exploit. Q = UVT M = USVT QMT = UVT (USVT)T = U VT V S UT # VT V -> becomes identity = USUT -> symmetric If Q and MT are misaligned, we have an error in symmetry, which reducing brings us closer to next polar factor

You Jiacheng@YouJiacheng

the idea that QM.T = USU.T should be symmetric is interesting. but it's kinda heuristic and has significant extra memory cost.

English

4.7K

Ethan@torchcompiled·5d

@YouJiacheng Not exactly clear how the problems stated can’t be reduced to faster hot memory and/or larger storage which seems to be very general chip improvements.

English

381

You Jiacheng@YouJiacheng·5d

looks wrong

Y Combinator@ycombinator

Inference Chips for Agent Workflows @sdianahu Most AI chips are designed for "prompt in, response out." Agents don't work that way. They loop, branch, and hold context across dozens of steps, and current GPUs hit 30–40% utilization as a result. That gap is where purpose-built silicon wins.

English

14.6K

Ethan@torchcompiled·6d

What if we could do Muon with just one Newton step while also achieving better loss?

English

223

23.5K

Ethan@torchcompiled·5d

In the end we see not only lower loss but a solid amount of time spared on optimizer step

English

720

Ethan@torchcompiled·5d

This skips the math specifics but covers really most of the important details. But I recommend checking out the algorithm! It’s a pretty neat property of symmetric matrices

English

980

Ethan@torchcompiled·5d

Forgot to make a thread on this so here it is. The premise is, can we find a transport from NS(mt-1) -> NS(mt-1) instead of performing NS to convergence with each optimizer step? For traversing Stiefel manifold, we have two general choices. - Tangent + retraction - geodesic

Ethan@torchcompiled

What if we could do Muon with just one Newton step while also achieving better loss?

English

5.7K

Ethan@torchcompiled·6d

@jxbz @Ji_Ha_Kim

QAM

1.4K

Ethan@torchcompiled·6d

ethansmith2000.substack.com/p/transport-mu…

ZXX

1.7K

Ethan@torchcompiled·27 Nis

When my brilliant new attention method violates causal masking

alex peysakhovich@alex_peys

tfw you hit just the right little architecture tweak

English

846

64.6K

Ethan@torchcompiled·26 Nis

@CevherLIONS @_arohan_ It’s a bit ironic, because most discussed optimizers do some sparse approximation of the second moment except for Muon

English

431

Volkan Cevher@CevherLIONS·25 Nis

Nice thread. I would avoid calling these “second-order.” They are better understood as first-order methods with non-Euclidean geometry (i.e., different norms / preconditioners). Much of the recent progress is computational (Newton–Schulz, distributed / low-rank), not conceptual but extremely impactful. A unified perspective here: arxiv.org/abs/2511.11163

rohan anil@_arohan_

All the hullaboo around Muon made me look at @jxbz ‘s beautiful paper’s appendix arxiv.org/pdf/2409.20325 It has a proper intellectual credit towards how we ended up at Muon, which is the newton schulz iteration for a fast impl of shampoo with b2=0.0 case if you didnt have access to a fast eigh Funnily, a first version of soap is the appendix of my 2020 paper. N=2, its clear research that goes into appendix of papers are usually good for writing another paper. There was further improvement by the community around finding better ways to speed up newton schulz iteration. There was nerdsnipe for some msft researchers to make a distributed version via low rank approx. Only after discussion on X did these researchers find more commonality between shampoo and Xi Lin’s work on PSGD (he has been solo authoring SOTA methods until @HessianFree joined him) that most of community was finding hard to understand (as most people in labs and community aren’t trained to be good at difficult linear algebra) I think the OG is @CevherLIONS spectral SGD work. it also could be that Schmidhuber and Yann also figured this out 20 years ago and we didn’t read their papers.

English

182

22K

Ethan@torchcompiled·24 Nis

Feels a lot like unresolved/lately resolved latent noise that manifests as more high frequency details I don’t entirely remember what I was tinkering with at the time, but a few occasions have run into this effect. I wanna say it was something along the lines of attempting a bilinear latent upscale in the middle of the diffusion process, injecting noise to “heal” raw interpolation of latents, and following up with high CFG

English

Ollin Boer Bohan@madebyollin·23 Nis

@max_pe2002 @gabeeegoooh @SwayStar123 Hmm, have all of your artifacty images been from multi-image chats? It looks like there's a strong bias towards texture/feature copying across images within a chat thread: x.com/madebyollin/st…

Ollin Boer Bohan@madebyollin

@JiaweiYang118 Some of the ChatGPT Images 2 weirdness is texture leakage from context images (see this thread reddit.com/r/ChatGPT/comm…) You can visualize this by asking for three unrelated images in the same chat (note how the flower texture persists in the circled area).

English

143

Max@max_pe2002·22 Nis

what do we think about these gpt image 2 artifacts. Do you guys think the decoder is trained as a GAN/LDM/ PixelDM

English

418

Keşfet

@NicholasBardy @_ueaj @_arohan_ @HessianFree @YouJiacheng @CV_novel_plume @tianylin @elonmusk