Pierre Foret

84 posts

Pierre Foret

Pierre Foret

@Foret_p

QR at Citadel Securities. Ex Google AI resident.

New York, USA Katılım Ekim 2015
233 Takip Edilen407 Takipçiler
Sabitlenmiş Tweet
Pierre Foret
Pierre Foret@Foret_p·
Introducing SAM: An easy-to-use algorithm derived by connecting PAC Bayesian bounds and geometry of the loss landscape. Achieves SOTA on benchmark image tasks (0.3% error on CIFAR10, 3.9% on CIFAR100) and drastically improves label noise robustness. arxiv.org/abs/2010.01412
English
6
38
152
0
Pierre Foret retweetledi
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
Excited to share our #ICML2022 paper "Towards Understanding Sharpness-Aware Minimization"! Why does m-sharpness matter in m-SAM? Can we explain the benefits of m-SAM on simple models? Which other interesting properties does m-SAM show? Paper: arxiv.org/abs/2206.06232 🧵1/n
Maksym Andriushchenko tweet media
English
4
32
196
0
Pierre Foret retweetledi
Hossein Mobahi
Hossein Mobahi@TheGradient·
Are you a strong PhD student interested in doing cutting edge research at @GoogleAI? I have an opening for student researcher position to explore open problems and extensions of Sharpness-Aware Minimization (SAM) w/ @bneyshabur. Please refer to tinyurl.com/4nfarsvt.
Hossein Mobahi tweet media
English
4
22
118
0
Pierre Foret
Pierre Foret@Foret_p·
@_arohan_ @TheGradient Indeed, not syncing the perturbations is pretty critical to SAM's success (see the section about M-sharpness in the paper)
English
0
0
1
0
Pierre Foret retweetledi
Aran Komatsuzaki
Aran Komatsuzaki@arankomatsuzaki·
Sharpness-Aware Minimization Improves Language Model Generalization SAM substantially improves performance on SuperGLUE, GLUE, Web Questions, Natural Questions, Trivia QA, and TyDiQA by encourageing convergence to flatter minima w/ minimal overhead. arxiv.org/abs/2110.08529
Aran Komatsuzaki tweet media
English
1
16
89
0
Pierre Foret
Pierre Foret@Foret_p·
@thanhnguyentang @matthen2 If each particle is independent, each particle probably only need to keep the random seed used to generate the path increments
English
1
0
3
0
Thanh Nguyen-Tang
Thanh Nguyen-Tang@thanhnguyentang·
@matthen2 Each particle must have a path information so that a winning one can be traced back. It seems implausible (especially the particles do not share information) to keep path info for a huge number of particles for as long as a winning one is determined.
English
1
0
0
0
Matt Henderson
Matt Henderson@matthen2·
the dumbest way to solve a maze? simulate a gas of thousands of particles diffusing from the start point, until one particle reaches the exit. trace back the winning particle
English
404
4K
30.1K
0
Pierre Foret retweetledi
AK
AK@_akhaliq·
When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations pdf: arxiv.org/pdf/2106.01548… abs: arxiv.org/abs/2106.01548 +5.3% and +11.0% top-1 accuracy on ImageNet for ViT-B/16 and MixerB/16, with the simple Inception-style preprocessing
AK tweet mediaAK tweet media
English
7
124
474
0
Pierre Foret retweetledi
Olivier Grisel
Olivier Grisel@ogrisel·
Interesting empirical study of the geometry of the loss landscape of Vision Transformers and MLP-Mixers and study of the critical impact of Sharpness Aware Minimization (SAM) for those architectures.
AK@_akhaliq

When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations pdf: arxiv.org/pdf/2106.01548… abs: arxiv.org/abs/2106.01548 +5.3% and +11.0% top-1 accuracy on ImageNet for ViT-B/16 and MixerB/16, with the simple Inception-style preprocessing

English
0
6
28
0
Pierre Foret retweetledi
Hossein Mobahi
Hossein Mobahi@TheGradient·
Excited to see Sharpness-Aware Minimization (SAM optimizer) we have proposed recently (w/ @Foret_p @bneyshabur and Kleiner) is becoming a persistent component in recent state-of-the-art records 😇
AK@_akhaliq

Drawing Multiple Augmentation Samples Per Image During Training Efficiently Decreases Test Error pdf: arxiv.org/pdf/2105.13343… abs: arxiv.org/abs/2105.13343 ImageNet SOTA of 86.8% top-1 accuracy after just 34 epochs of training with an NFNet-F5 using the SAM optimizer

English
0
7
39
0
Pierre Foret retweetledi
KDnuggets
KDnuggets@kdnuggets·
We don’t need to worry about #Overfitting anymore? Sharpness-Aware Minimization, seeks parameters that lie in neighborhoods having uniformly low loss; results in a min-max optimization formulation with efficient gradient descent #MachineLearning buff.ly/38VJTOf
KDnuggets tweet media
English
0
14
21
0
Pierre Foret
Pierre Foret@Foret_p·
@RisingSayak Great stuff! Is this syncing epsilon across replicas ? On a TPU (8 chips for this one I think?) I would expect the benefits of SAM to be amplified by not syncing epsilon accross the devices (one perturbation per sub-batch). Could be a cool improvement if it's not already the case
English
0
0
0
0
Pierre Foret
Pierre Foret@Foret_p·
@imos You can of course emulate this on a single device with data accumulation, but it becomes tedious and the wall clock time might suffer (although NFNet using a subset of the batch to compute the SAM epsilon is a great trick)
English
1
0
0
0
いもす
いもす@imos·
@Foret_p SAM is very impressive to me! Can I ask why SAM is used only in the largest model of pre-trained NFNet models? I guess that SAM behaves like finding an ensemble solution efficiently and needs more parameters to represent it. Do you have any observations?
English
2
0
1
0
Pierre Foret
Pierre Foret@Foret_p·
@imos so SAM on TPU minimize m-sharpness for a small m, which leads to the biggest boosts. That's why I assume we will mostly see SAM applied to larger nets that require TPU or multiple GPU, where it really shines. 3/3
English
0
0
0
0
Pierre Foret
Pierre Foret@Foret_p·
@imos SAM usually works well for smaller models, but the best results are obtained when using a lot of data parallelism (see section about M-sharpness in the SAM paper). Because the largest nets are trained on a lot of tpu chips, each chip computes epsilon for few samples... 2/3
English
1
0
1
0
Pierre Foret retweetledi