Sidak Pal Singh

51 posts

Sidak Pal Singh

@unregularized

Research Scientist at Google DeepMind, working on Gemini. (prev. PhD at ETH Zürich & MPI-IS Tübingen.) No second-hand opinions. They are absolutely my own ;)

New York Inscrit le Ekim 2022

105 Abonnements484 Abonnés

Tweet épinglé

Sidak Pal Singh@unregularized·26 Tem

📢I'll be presenting two posters, at #ICML2024 HiLD workshop (Straus 2) today (assuming no further ✈️ delays): - Closed form of the Hessian spectrum for some neural networks openreview.net/forum?id=gW30R… - Landscaping Linear Mode Connectivity openreview.net/forum?id=OSNMq…

English

3.5K

Sidak Pal Singh@unregularized·18 Kas

quite good actually. coding’s so smooth! x.com/GoogleDeepMind…

Google DeepMind@GoogleDeepMind

This is Gemini 3: our most intelligent model that helps you learn, build and plan anything. It comes with state-of-the-art reasoning capabilities, world-leading multimodal understanding, and enables new agentic coding experiences. 🧵

English

166

Sidak Pal Singh@unregularized·2 Ağu

sure the numbers are great, but screw that! it’s your phd, try wild and original ideas even if you fail once or twice at least. twitter.com/gabriberton/st…

Gabriele Berton@gabriberton

A few numbers from my PhD: 8 first-author top-conference (CVPR/ICCV/ECCV) papers 100% acceptance rate per paper 80% acceptance rate per submission 1 invited long talk at CVPR tutorial 5 top-conf demos (acceptance rate 100% vs ~30% average) ~2k GitHub stars

English

903

Sidak Pal Singh@unregularized·13 Tem

when you go beyond linear mode connectivity, interesting things happen 😮👇 x.com/Theus__A/statu…

Alexander Theus@TheusResearch

1/ 🚨 New paper alert! 🚨 We explore a key question in deep learning: Can independently trained Transformers be linearly connected in weight space — without a loss barrier? Yes — if you uncover their rich symmetries. 📄 arXiv: arxiv.org/abs/2506.22712

English

601

Sidak Pal Singh@unregularized·18 Haz

Belated life update: 🎓 PhD — done 🔬 Joined Google in NYC 🗽as a Research Scientist ♊️ Gemini: now more than just my star sign :)

English

555

29.1K

Sidak Pal Singh@unregularized·24 Nis

🚀 TOMORROW afternoon at ICLR: Learn about the directionality of optimization trajectories in neural nets and how it inspires a potential way to make LLM pretraining more efficient ♻️ (Poster# 585, hall 2b)

Sidak Pal Singh@unregularized

Ever wondered how the optimization trajectories are like when training neural nets & LLMs🤔? Do they contain a lot of twists 💃 and turns, or does the direction largely remain the same🛣️? We explore this in our work for LLMs (upto 12B params) + ResNets on ImageNet. Key findings👇

English

2.1K

Sidak Pal Singh@unregularized·22 Nis

Don't miss out our spotlight ✨paper at ICLR 🇸🇬 about the loss landscape of Transformers and their special heterogeneous structure, done together with great collaborators! x.com/wormaniec/stat…

Weronika Ormaniec@wormaniec

Ever wondered how the loss landscape of Transformers differs from that of other architectures? Or which Transformer components make its loss landscape unique? With @unregularized & @f_dangel, we explore this via the Hessian in our #ICLR2025 spotlight paper! Key insights👇 1/8

English

1.4K

Sidak Pal Singh@unregularized·19 Şub

@savvyRL :) I think the email address used there seems to suggest somebody doing it for him.. but you never know haha

English

Rosanne Liu@savvyRL·19 Şub

The most notable thing about DeepSeek papers is that the CEO Wenfeng Liang always uploads them himself

DeepSeek@deepseek_ai

🚀 Introducing NSA: A Hardware-Aligned and Natively Trainable Sparse Attention mechanism for ultra-fast long-context training & inference! Core components of NSA: • Dynamic hierarchical sparse strategy • Coarse-grained token compression • Fine-grained token selection 💡 With optimized design for modern hardware, NSA speeds up inference while reducing pre-training costs—without compromising performance. It matches or outperforms Full Attention models on general benchmarks, long-context tasks, and instruction-based reasoning. 📖 For more details, check out our paper here: arxiv.org/abs/2502.11089

English

380

36.7K

Sidak Pal Singh retweeté

Alice Bizeul@AliceBizeul·12 Şub

✨New Preprint ✨ Ever thought that reconstructing masked pixels for image representation learning seems sub-optimal? In our new preprint, we show how masking principal components—rather than raw pixel patches— improves Masked Image Modelling (MIM). Find out more below 🧵

English

527

48.4K

Sidak Pal Singh@unregularized·6 Şub

@mayank98shri @TheGradient @ynd @baharanm The bulk + outliers notion isn't wrong. The key is to understand how sharpness reduction is happening. GN can lead to spurious sharpness reduction; while NME reduces sharpness through adapting the geometry of the model itself. In vision, these ways are balanced, but not for LLMs.

English

Mayank Shrivastava@mayank98shri·6 Şub

@TheGradient @unregularized @ynd @baharanm In the bulk + outliers structure of the loss Hessian , I thought that the bulk came from the NME matrix and the outliers were due to the Gauss-Newton matrix which should have contributed to the sharpness, interesting to see that NME also contributes to sharpness?

English

Hossein Mobahi@TheGradient·6 Şub

(1/2) Ever wondered why Sharpness-Aware Minimization (SAM) yields greater generalization gains in vision than in NLP? I'll discuss this at @UCLA CS-201 seminar February 18th, relating it to the balance of SAM's impact on logit statistics vs model geometry. cs.ucla.edu/upcoming-event…

English

6.1K

Sidak Pal Singh retweeté

Yann N. Dauphin@ynd·10 Ara

Don’t miss our poster shedding more light on sharpness regularization at NeurIPS tomorrow neurips.cc/virtual/2024/p…

English

2.7K

Sidak Pal Singh@unregularized·23 Kas

@agarwl_ @pratyushmaini It would be interesting to see the rank these models get haha. Predictions? :)

English

Rishabh Agarwal@agarwl_·21 Kas

@pratyushmaini Is this mostly anecdotal? I thought current models can solve harder math problems than JEE ..

English

890

Pratyush Maini@pratyushmaini·21 Kas

my new found guilty pleasure is watching the new reasoning models struggle by think-maxxing them with questions from JEE Advanced

English

5.9K

Sidak Pal Singh@unregularized·21 Kas

Reinventing things has a bad rep in today's age. But is it really that bad? Maybe it's something to be even cultivated, like selectively? The second post in this series of blogs is now out. Let's have a deeper look at this overused trope! wovencircuits.substack.com/p/reinventing-…

English

307

Sidak Pal Singh@unregularized·17 Kas

I’m exploring a new form of writing—threads of human curiosity woven through the circuits of AI, crafting reflections that are, in the end, fully machine-generated, yet in a way profoundly human. wovencircuits.substack.com/p/the-spirit-b…

English

211

Sidak Pal Singh@unregularized·26 Eki

Come, let's scale up the building one floor, And, layer up the neural networks once more. Soon our buildings will touch the sky, And, our computers will bear AGI. A quaint little hut in the mountains is out of fashion, Satisfaction has no gradients for backpropagation. ~Fitoor

English

304

Sidak Pal Singh@unregularized·10 Eki

@kellerjordan0 @bozavlado QK params and V params have very different behavior in terms of their curvature. So grouping them together is not ideal. I believe you could still try preconditioning QK params together, and keeping V separate.

English

Keller Jordan@kellerjordan0·10 Eki

@bozavlado The reason they were initially orthogonalized/preconditioned together was because NanoGPT represents QKV as a single parameter by default, for efficiency reasons.

English

1.4K

Keller Jordan@kellerjordan0·10 Eki

NanoGPT speedrunning update: @bozavlado discovered that the new optimizer performs ~3% better if we orthogonalize the QKV updates separately rather than together. I replicated this and found that it also holds for SOAP; it was used in yesterday’s record. x.com/bozavlado/stat…

Vlado Boza@bozavlado

@kellerjordan0 Yes, but with your older code (with warmup and w/o scaling by number of elements). Also this could be seed dependent, etc. Take it with very huge grain of salt.

English

115

19.5K

Sidak Pal Singh@unregularized·7 Eki

At this paper count, recalling all the paper names would already be a big feat :) twitter.com/peter_richtari…

Peter Richtarik@peter_richtarik

Source: papercopilot.com/paper-list/neu…

English

344

Sidak Pal Singh@unregularized·2 Eki

“Hypotheses are nets: only he who casts will catch.” - Novalis

English

327

Sidak Pal Singh@unregularized·7 Ağu

At last some attempts to change the status quo: authors with three or more papers are obligated to review for ICLR x.com/PreetumNakkira…

Preetum Nakkiran@PreetumNakkiran

Review requirements! (And 10pg limit!)

English

601

Sidak Pal Singh@unregularized·26 Tem

Come to our posters today at 3:30 pm (Straus 2) to know more! :)

English

156

Sidak Pal Singh@unregularized·26 Tem

Poster 1: Sharpness/Flatness are much talked about: better minima, Sharpness aware minimization, Edge-of-Stability, and so on. But what really is sharpness? What exactly does it quantify, besides the surface-level definition? How are the eigenvalues and eigenvectors really like?

English

273

Sidak Pal Singh@unregularized·26 Tem

English

3.5K

Sidak Pal Singh@unregularized·26 Tem

Poster 2: Linear Mode Connectivity (LMC) is yet another popular feature of neural loss landscapes. But how does LMC arise in the first place? How should the landscape be structured to allow LMC? Are barriers present just at the end, or do they start much early?

English

225

Découvrir

@savvyRL @mayank98shri @TheGradient @ynd @baharanm @UCLA @agarwl_ @pratyushmaini