Wu Lin

102 posts

Wu Lin

@LinYorker

Postdoctoral fellow at @VectorInst. ML PhD at UBC. Mathematical and computational structures for ML. Geometric and algebraic methods.

Toronto, Canada 参加日 Ağustos 2020

40 フォロー中283 フォロワー

固定されたツイート

Wu Lin@LinYorker·16 Tem

#ICML2024 Can We Remove the Square-Root in Adaptive Methods? arxiv.org/abs/2402.03496 Root-free (RF) methods are better on CNNs and competitive on Transformers compared to root-based methods (AdamW) Removing the root makes matrix methods faster: Root-free Shampoo in BFloat16 /1

English

12.6K

Wu Lin@LinYorker·1d

@weijie444 Looks like a KFAC-based method with modern clipping? G(ZZ^T)^{-1} is known as the FOOF update arxiv.org/abs/2201.12250 while msgn() can be interpreted as "generalized (preconditioned) gradient norm clipping" arxiv.org/abs/2506.01913 .

Weijie Su@weijie444

We released "The Newton--Muon Optimizer" . We show that Muon is secretly an implicit Newton method, and use this insight to build a better one. 1/n Paper: arxiv.org/abs/2604.01472

English

3.8K

Wu Lin@LinYorker·3 Ara

@MarkSchmidtUBC @EmtiyazKhan This work is a joint effort with Scott C. Lowe, @f_dangel, @runame_, Zikun Xu, and @RogerGrosse. Stay tuned for more updates coming soon.

English

Wu Lin@LinYorker·3 Ara

This work builds on my ICML 2019 paper (with @MarkSchmidtUBC and @EmtiyazKhan), extending a variational Bayes-based geometric framework to modern NN optimization. It can be used to design methods for Bayesian inference, numerical optimization, and gradient-free optimization.

English

102

Wu Lin@LinYorker·3 Ara

Within an information-geometric framework, we reconnect Shampoo/SOAP with both classical quasi-Newton ideas and Gaussian whitening, and develop practical methods that naturally handle tensor-valued weights in language model pre-training. arxiv.org/abs/2509.03378 opt-ml workshop

English

Wu Lin がリツイート

Runa Eschenhagen@runame_·20 Kas

1/9 In practice, the Shampoo optimizer crucially relies on several heuristics. In our NeurIPS 2025 spotlight paper, we investigate the role of learning rate grafting and infrequent preconditioner updates in Shampoo by decomposing its preconditioner. arxiv.org/abs/2506.03595

English

13.2K

Wu Lin がリツイート

Jonathan Lorraine@jonLorraine9·27 Kas

Huge thanks to my amazing collaborators. This project was led by @juhanbae along with @LinYorker and @RogerGrosse. Supported (indirectly) by @Anthropic, @NVIDIA, @VectorInst, @UofTCompSci/@UofTArtSci/@UofT

English

527

Wu Lin がリツイート

Thomas Möllenhoff@tmoellenhoff·10 Kas

Are you LoRA fine-tuning LLMs and looking for easy ways to get improvements in accuracy? And also Bayesian uncertainty on top for free? Then check our recent work, accepted @neurips24fitml workshop! arxiv.org/abs/2411.04421

English

7.4K

Wu Lin@LinYorker·5 Eki

Natural gradient descent: (steepest) gradient descent under a norm induced by the Fisher matrix yorkerlin.github.io/posts/2021/10/… Riemannian gradient descent (with geodesic retraction) : gradient descent in Riemannian normal coordinate

Frank Nielsen@FrnkNlsn

At Maximum Likelihood Estimator: Key property: observed Fisher information = Fisher information 2nd order Taylor expansion of likelihood: - likelihood curvature = Fisher information - radius of osculating circle=Variance of MLE for large sample size

English

4.1K

Wu Lin@LinYorker·6 Eyl

Some hardcore theory people complain that "second-order" methods in DL do not have a superlinear convergence rate. At the same time, they are happy to consider SGD a first-order method with only a sublinear rate.

English

125

Wu Lin@LinYorker·3 Eyl

@typedfemale @F_Vaggi @j_mcgraph FYI, many assumptions for convergence analysis may not hold in practice, according to this paper: arxiv.org/abs/2407.01825.

English

102

Wu Lin@LinYorker·3 Eyl

@typedfemale @F_Vaggi @j_mcgraph Like LBFGS, many second-order methods in DL are indeed quasi-Newton methods.

English

112

typedfemale@typedfemale·19 Eyl

i've noticed many (including myself) say "2nd second-order" in machine learning to refer to a set of optimizers (FOOF, shampoo) that... don't actually use second-order information? idk it's weird

English

14.2K

Wu Lin がリツイート

Elad Hazan@HazanPrinceton·6 Ağu

My talk on spectral transformers, given at Princeton workshop on learning in dynamical systems, is now online: youtube.com/watch?v=D_NwH5…

YouTube

English

256

30.7K

Wu Lin がリツイート

Frank Nielsen@FrnkNlsn·27 Mar

Lin (1991) definition of Jensen-Shannon divergence JS(p,q)= (KL(p:(p+q)/2)+KL(q:(p+q)/2))/2 is a *variational divergence* defined by JS(p,q)=min_{c} (KL(p:c)+KL(q:c))/2 where optimum value is c=(p+q)/2 Defined as information radius by Sibson (1969) 👉mdpi.com/1099-4300/23/4…

English

119

11.6K

Wu Lin@LinYorker·2 Ağu

Looking forward to making non-diagonal methods faster and simpler.

MLCommons@MLCommons

@MLCommons #AlgoPerf results are in! 🏁 $50K prize competition yielded 28% faster neural net training with non-diagonal preconditioning beating Nesterov Adam. New SOTA for hyperparameter-free algorithms too! Full details in our blog. mlcommons.org/2024/08/mlc-al… #AIOptimization #AI

English

244

Wu Lin@LinYorker·31 Tem

We can use NGD to obtain many methods: natural evolution strategies (CMA-ES, NES), natural policy gradients, natural-gradient variational inference, exact Bayesian inference on conjugate models, Newton's method, root-free adaptive methods, and Riemannian GD on submanifolds.

English

504

ディスカバー

@weijie444 @MarkSchmidtUBC @EmtiyazKhan @f_dangel @runame_ @RogerGrosse @juhanbae @Anthropic