Georgios Vlassis (@gvlassis98) - Twitter Profili

Georgios Vlassis retweetledi

Amir Joudaki@AmirJoudaki·25 Nis

Neural nets don’t just forget. Sometimes, after long training, they lose the ability to learn at all. In our #ICLR2026 poster, we model Loss of Plasticity as gradient dynamics trapped in invariant manifolds: 🔴 frozen units, 🔵 cloned units. The video makes the traps visible.

English

16

52

611

100.4K

Georgios Vlassis retweetledi

Dan Alistarh@DAlistarh·25 Mar

Speedrunning GPT-2 is now routine thanks to @karpathy. But can we speedrun GPT3-175B? We attempted to match accuracy on a <$10K budget; while we didn't quite reach it, our first results show that quality data, engineering, and native FP4 can get close. Details in 🧵

English

4

22

170

12.4K

Georgios Vlassis retweetledi

Saleh Ashkboos@AshkboosSaleh·26 Oca

Our study on #optimizers and #quantization is accepted #ICLR26 @iclr_conf

Saleh Ashkboos@AshkboosSaleh

Happy to share our new study on the interaction between #optimizers and #quantization! We show how optimizer choice affects quantized model quality and why outlier-based metrics (like Kurtosis and MMR) often fail to predict performance. Paper: arxiv.org/pdf/2509.23500 [1/5]

English

0

2

11

951

Saleh Ashkboos@AshkboosSaleh·30 Eyl

Happy to share our new study on the interaction between #optimizers and #quantization! We show how optimizer choice affects quantized model quality and why outlier-based metrics (like Kurtosis and MMR) often fail to predict performance. Paper: arxiv.org/pdf/2509.23500 [1/5]

English

3

8

30

5.3K

Georgios Vlassis retweetledi

Accepted papers at TMLR@TmlrPub·28 Eyl

A thorough reproduction and evaluation of $\mu$P Georgios Vlassis, David Belius, Volodymyr Fomichov. Action editor: Anastasios Kyrillidis. openreview.net/forum?id=AFxEd… #hyperparameters #parameters #yang2021tuning

English

0

1

3

403

Georgios Vlassis@gvlassis98·4 Eki

@JamesWhate89993 @_arohan_ However, when you use Shampoo, you never actually use (1), or (2). And, in practice, the behavior that you get is very different from Muon, both in terms of loss, as well as in terms of error propagation behavior (eg see my figure above)

English

0

1

43

Georgios Vlassis@gvlassis98·4 Eki

@JamesWhate89993 @_arohan_ E.g. There are two conditions under which Shampoo is exactly the same as Muon. 1) If you assume the one sided version with β2=0, 2) If you use the two sided version with an exponent of 1/4 instead of 1/2.

English

1

0

34

James MMatrix@JamesWhate89993·2 Eki

Interesting result: AdamW has strong performance in terms of quantized model quality - outperforming soap/scion/muon etc. Would be interesting to verify if this observation is correct, and if it holds at larger scale, as I thought Adam trained model is harder to quantize.

Saleh Ashkboos@AshkboosSaleh

Happy to share our new study on the interaction between #optimizers and #quantization! We show how optimizer choice affects quantized model quality and why outlier-based metrics (like Kurtosis and MMR) often fail to predict performance. Paper: arxiv.org/pdf/2509.23500 [1/5]

English

1

0

2

308

Georgios Vlassis@gvlassis98·3 Eki

@JamesWhate89993 Which makes sense if you realize that networks trained with different optimizers might propagate noise differently. A nice visualization of this is Figure 3 (X might compress this).

English

2

0

3

59

Georgios Vlassis@gvlassis98·3 Eki

@JamesWhate89993 Nevertheless, to me, the most interesting observation is that the Max-to-median ratio of the activations, which is used in a lot of quantization studies, is a bad predictor of quantization performance when you use different optimizers.

English

1

0

49

Georgios Vlassis@gvlassis98·3 Eki

@evaninwords @HessianFree Nevertheless, we are meeting Omead later today to see which newer version of PSGD we should try. If you have any input/feedback/suggestions we would be more than happy to hear it too :).

English

0

3

47

Georgios Vlassis@gvlassis98·3 Eki

@evaninwords @HessianFree ii) Most of the quantization error actually comes from activation quantization, not weight quantization. In our paper, we find that the spectral norm of the weights (which we link to quantization error propagation) barely changes after INT4 weight quantization for all optimizers.

English

1

0

3

44

Evan Walters@evaninwords·2 Eki

Very interesting paper out of ETH! The most interesting takeaway for me is that different optimizers result in distinct error propagation signatures throughout the model after quantization.

Saleh Ashkboos@AshkboosSaleh

Happy to share our new study on the interaction between #optimizers and #quantization! We show how optimizer choice affects quantized model quality and why outlier-based metrics (like Kurtosis and MMR) often fail to predict performance. Paper: arxiv.org/pdf/2509.23500 [1/5]

English

1

2

5

917

Georgios Vlassis@gvlassis98·3 Eki

@evaninwords IMO the logical next step would be to design an optimizer/architecture/method that explicitly takes that into account. If you go through the maths in section 3.2, you will see that the quantity of interest is the "gain".

English

0

1

20

Georgios Vlassis@gvlassis98·3 Eki

@evaninwords Hello Evan! Glad you like the idea! I completely agree that the most interesting finding is that the quantization error propagation profiles are different.

English

1

0

1

20

Georgios Vlassis@gvlassis98·3 Eki

@HessianFree @AshkboosSaleh Nevertheless, we are in talks with Omead about testing a newer version of PSGD :).

English

0

1

29

Georgios Vlassis@gvlassis98·3 Eki

@HessianFree @AshkboosSaleh From what I saw, it just uses most of the code of Evan Walters (github.com/evanatyourserv…), which is in turn based on Xi-Lin's original PSGD repo.

English

1

0

1

25

Georgios Vlassis

Keşfet