Sidak Pal Singh

51 posts

Sidak Pal Singh

Sidak Pal Singh

@unregularized

Research Scientist at Google DeepMind, working on Gemini. (prev. PhD at ETH Zürich & MPI-IS Tübingen.) No second-hand opinions. They are absolutely my own ;)

New York शामिल हुए Ekim 2022
105 फ़ॉलोइंग484 फ़ॉलोवर्स
Sidak Pal Singh
Sidak Pal Singh@unregularized·
Belated life update: 🎓 PhD — done 🔬 Joined Google in NYC 🗽as a Research Scientist ♊️ Gemini: now more than just my star sign :)
English
25
11
555
29.1K
Sidak Pal Singh
Sidak Pal Singh@unregularized·
🚀 TOMORROW afternoon at ICLR: Learn about the directionality of optimization trajectories in neural nets and how it inspires a potential way to make LLM pretraining more efficient ♻️ (Poster# 585, hall 2b)
Sidak Pal Singh@unregularized

Ever wondered how the optimization trajectories are like when training neural nets & LLMs🤔? Do they contain a lot of twists 💃 and turns, or does the direction largely remain the same🛣️? We explore this in our work for LLMs (upto 12B params) + ResNets on ImageNet. Key findings👇

English
0
1
6
2.1K
Sidak Pal Singh
Sidak Pal Singh@unregularized·
Don't miss out our spotlight ✨paper at ICLR 🇸🇬 about the loss landscape of Transformers and their special heterogeneous structure, done together with great collaborators! x.com/wormaniec/stat…
Weronika Ormaniec@wormaniec

Ever wondered how the loss landscape of Transformers differs from that of other architectures? Or which Transformer components make its loss landscape unique? With @unregularized & @f_dangel, we explore this via the Hessian in our #ICLR2025 spotlight paper! Key insights👇 1/8

English
0
2
16
1.4K
Sidak Pal Singh
Sidak Pal Singh@unregularized·
@savvyRL :) I think the email address used there seems to suggest somebody doing it for him.. but you never know haha
English
1
1
2
1K
Sidak Pal Singh रीट्वीट किया
Alice Bizeul
Alice Bizeul@AliceBizeul·
✨New Preprint ✨ Ever thought that reconstructing masked pixels for image representation learning seems sub-optimal? In our new preprint, we show how masking principal components—rather than raw pixel patches— improves Masked Image Modelling (MIM). Find out more below 🧵
Alice Bizeul tweet media
English
17
61
527
48.4K
Sidak Pal Singh
Sidak Pal Singh@unregularized·
@mayank98shri @TheGradient @ynd @baharanm The bulk + outliers notion isn't wrong. The key is to understand how sharpness reduction is happening. GN can lead to spurious sharpness reduction; while NME reduces sharpness through adapting the geometry of the model itself. In vision, these ways are balanced, but not for LLMs.
English
0
0
2
48
Mayank Shrivastava
Mayank Shrivastava@mayank98shri·
@TheGradient @unregularized @ynd @baharanm In the bulk + outliers structure of the loss Hessian , I thought that the bulk came from the NME matrix and the outliers were due to the Gauss-Newton matrix which should have contributed to the sharpness, interesting to see that NME also contributes to sharpness?
English
1
0
1
81
Hossein Mobahi
Hossein Mobahi@TheGradient·
(1/2) Ever wondered why Sharpness-Aware Minimization (SAM) yields greater generalization gains in vision than in NLP? I'll discuss this at @UCLA CS-201 seminar February 18th, relating it to the balance of SAM's impact on logit statistics vs model geometry. cs.ucla.edu/upcoming-event…
English
1
10
58
6.1K
Sidak Pal Singh रीट्वीट किया
Yann N. Dauphin
Yann N. Dauphin@ynd·
Don’t miss our poster shedding more light on sharpness regularization at NeurIPS tomorrow neurips.cc/virtual/2024/p…
English
0
4
6
2.7K
Rishabh Agarwal
Rishabh Agarwal@agarwl_·
@pratyushmaini Is this mostly anecdotal? I thought current models can solve harder math problems than JEE ..
English
2
0
3
890
Pratyush Maini
Pratyush Maini@pratyushmaini·
my new found guilty pleasure is watching the new reasoning models struggle by think-maxxing them with questions from JEE Advanced
English
4
0
45
5.9K
Sidak Pal Singh
Sidak Pal Singh@unregularized·
Reinventing things has a bad rep in today's age. But is it really that bad? Maybe it's something to be even cultivated, like selectively? The second post in this series of blogs is now out. Let's have a deeper look at this overused trope! wovencircuits.substack.com/p/reinventing-…
Sidak Pal Singh tweet media
English
0
0
2
307
Sidak Pal Singh
Sidak Pal Singh@unregularized·
I’m exploring a new form of writing—threads of human curiosity woven through the circuits of AI, crafting reflections that are, in the end, fully machine-generated, yet in a way profoundly human. wovencircuits.substack.com/p/the-spirit-b…
English
0
0
0
211
Sidak Pal Singh
Sidak Pal Singh@unregularized·
Come, let's scale up the building one floor, And, layer up the neural networks once more. Soon our buildings will touch the sky, And, our computers will bear AGI. A quaint little hut in the mountains is out of fashion, Satisfaction has no gradients for backpropagation. ~Fitoor
English
0
0
5
304
Sidak Pal Singh
Sidak Pal Singh@unregularized·
@kellerjordan0 @bozavlado QK params and V params have very different behavior in terms of their curvature. So grouping them together is not ideal. I believe you could still try preconditioning QK params together, and keeping V separate.
English
1
0
1
53
Keller Jordan
Keller Jordan@kellerjordan0·
@bozavlado The reason they were initially orthogonalized/preconditioned together was because NanoGPT represents QKV as a single parameter by default, for efficiency reasons.
English
1
0
9
1.4K
Keller Jordan
Keller Jordan@kellerjordan0·
NanoGPT speedrunning update: @bozavlado discovered that the new optimizer performs ~3% better if we orthogonalize the QKV updates separately rather than together. I replicated this and found that it also holds for SOAP; it was used in yesterday’s record. x.com/bozavlado/stat…
Vlado Boza@bozavlado

@kellerjordan0 Yes, but with your older code (with warmup and w/o scaling by number of elements). Also this could be seed dependent, etc. Take it with very huge grain of salt.

English
8
10
115
19.5K
Sidak Pal Singh
Sidak Pal Singh@unregularized·
“Hypotheses are nets: only he who casts will catch.” - Novalis
Sidak Pal Singh tweet media
English
0
0
4
327
Sidak Pal Singh
Sidak Pal Singh@unregularized·
Come to our posters today at 3:30 pm (Straus 2) to know more! :)
English
0
0
0
156
Sidak Pal Singh
Sidak Pal Singh@unregularized·
Poster 1: Sharpness/Flatness are much talked about: better minima, Sharpness aware minimization, Edge-of-Stability, and so on. But what really is sharpness? What exactly does it quantify, besides the surface-level definition? How are the eigenvalues and eigenvectors really like?
English
2
0
0
273
Sidak Pal Singh
Sidak Pal Singh@unregularized·
Poster 2: Linear Mode Connectivity (LMC) is yet another popular feature of neural loss landscapes. But how does LMC arise in the first place? How should the landscape be structured to allow LMC? Are barriers present just at the end, or do they start much early?
English
0
0
1
225