Neil Mallinar

2

139

Neil Mallinar@nmallinar·11 Tem

Super excited to share that we have an Oral presentation for this paper next week at ICML! It will be on Tuesday at 10am (Oral 1E) in West Ballroom D, I'll be presenting 4th at 10:45am :) Our poster will be on Wednesday at 11am and I encourage you to stop by and chat!

English

3

18

1.2K

Neil Mallinar@nmallinar·4 May

@emanouks 🙏🏾❤️

QME

30

Miss Mineragua@emanouks·3 May

The only grok I’m concerned with

Grokking modular arithmetic is widely studied for the seemingly unique emergent abilities of neural networks. Instead, we find that iteratively solving a kernel machine and estimating the Average Gradient Outer Product (AGOP) recovers this phenomenon identically:

English

0

2

156

Neil Mallinar@nmallinar·2 May

Happy to share that we got a spotlight at ICML for this work, see y'all there!!

Grokking modular arithmetic is widely studied for the seemingly unique emergent abilities of neural networks. Instead, we find that iteratively solving a kernel machine and estimating the Average Gradient Outer Product (AGOP) recovers this phenomenon identically:

English

14

461

Neil Mallinar@nmallinar·5 Kas

@matthistory Maybe they are going to do a heist together, or sing karaoke! I cannot wait to find out

English

1

59

Neil Mallinar@nmallinar·5 Kas

@matthistory I'd like to order one duck riding on top of a horse please

English

0

2

79

Neil Mallinar@nmallinar·5 Kas

@emanouks @tacobellhoarder Banana

Türkçe

2

33

Miss Mineragua@emanouks·5 Kas

@tacobellhoarder It’s Tito’s turn to pick a movie and he picked minions 3 rise of gru

English

0

3

44

Miss Mineragua@emanouks·5 Kas

Threes a coven fours a crowd

English

0

2

152

Neil Mallinar أُعيد تغريده

amirhesam abedsoltan@Amirhesam_A·18 Eki

Two generalization regimes in ICL: (1) context-scaling, where performance improves with more in-context examples, and (2) task-scaling, where performance improves with more pre-training tasks. While MLPs show task-scaling but not context-scaling, arxiv.org/abs/2410.12783

English

2

3

331

Neil Mallinar@nmallinar·10 Ağu

Consider my beautiful day uninterrupted 🥲 Alas the research work calls me back

English

3

209

Neil Mallinar@nmallinar·31 Tem

@thdbui @pfau Anyway I enjoyed your paper and would love to get a chance to discuss these topics further sometime and hear more about your observations!

English

2

107

Neil Mallinar@nmallinar·31 Tem

@thdbui @pfau Another difference we see compared to grokking in low-rank settings like k-parity is that the circulant features we learn for modular arithmetic (MA) are full rank! It wasn't obvious to us that you could do MA with kernels as the MA experiments we see all use neural nets still

English

0

109

David Pfau@pfau·31 Tem

You can do grokking with kernel methods too.

Grokking modular arithmetic is widely studied for the seemingly unique emergent abilities of neural networks. Instead, we find that iteratively solving a kernel machine and estimating the Average Gradient Outer Product (AGOP) recovers this phenomenon identically:

English

16

2.8K

Neil Mallinar@nmallinar·31 Tem

@avishvj In a hole in the ground there lived a kernel...

English

82

Avish Vijayaraghavan@avishvj·31 Tem

@nmallinar The kernel lore deepens

English

Stat.ML Papers@StatMLPapers

0

2

98

Neil Mallinar@nmallinar·30 Tem

Grokking modular arithmetic is widely studied for the seemingly unique emergent abilities of neural networks. Instead, we find that iteratively solving a kernel machine and estimating the Average Gradient Outer Product (AGOP) recovers this phenomenon identically:

Emergence in non-neural models: grokking modular arithmetic via average gradient outer product ift.tt/mCS5hTE

English

2

15

85

18K

Neil Mallinar أُعيد تغريده

Daniel Beaglehole@dbeagleholeCS·30 Tem

Iterating kernel ridgeless regression with AGOP computation groks modular arithmetic… and this grokking is remarkably similar to the phenomenon in neural networks. I found these results very surprising!

Grokking modular arithmetic is widely studied for the seemingly unique emergent abilities of neural networks. Instead, we find that iteratively solving a kernel machine and estimating the Average Gradient Outer Product (AGOP) recovers this phenomenon identically:

English

3

31

3K

Neil Mallinar@nmallinar·30 Tem

Please check out our paper here: arxiv.org/abs/2407.20199 This was an amazing collaboration with Daniel Beaglehole (@dbeagleholeCS), Libin Zhu (@BusyZhu), Adit Radhakrishnan, Parthe Pandit (@PartheP), and Misha Belkin.

English

5

349

Neil Mallinar@nmallinar·30 Tem

In our setting, grokking appears to occur solely due to feature learning. We decouple from neural architectures and gradient-descent based optimization by using kernels equipped with feature learning through AGOP and find many of the same phenomena as observed in neural networks.

English

0

3

233

Neil Mallinar@nmallinar·30 Tem

By the relation between circulant matrices and the Discrete Fourier Transform, we theoretically show that a quadratic kernel equipped with circulant features implements the same generalizing solution as neural networks - the Fourier Multiplication Algorithm found in prior work.

English

219

Neil Mallinar@nmallinar·30 Tem

The relationship between the NFM and neural network AGOP has been noted in prior work. arxiv.org/abs/2212.13881 In settings where weight decay, or trace(NFM), induces grokking we find that AGOP regularization, or trace(AGOP), does the same.

English

0

2

208

Neil Mallinar@nmallinar·30 Tem

As before, initializing features in the neural network using a random circulant dramatically reduces the time-to-generalization.

English

0

2

209

Neil Mallinar@nmallinar·30 Tem

We additionally find that instantiating features using random circulant matrices leads to generalization in standard Gaussian and quadratic kernels, suggesting that no additional structure beyond a general, asymmetric circulant is necessary to solve modular arithmetic.

English

0

2

412

Neil Mallinar@nmallinar·30 Tem

The progress measures of circulant deviation and AGOP alignment tend to steadily improve in the early iterations of neural networks as well, suggesting that feature learning is taking place in spite of unchanging test loss and accuracy.

English