Daniel Beaglehole

229 posts

Daniel Beaglehole banner
Daniel Beaglehole

Daniel Beaglehole

@dbeagleholeCS

ML PhD @UCSanDiego. Deep learning from first principles (+ applications) Topics include xRFM, AGOP, Colonel Blotto

Katılım Ağustos 2021
552 Takip Edilen411 Takipçiler
Daniel Beaglehole retweetledi
Shubhendu Trivedi
Shubhendu Trivedi@_onionesque·
The idea of using locality has appeared from time to time in the multi-index literature. Here is a nice operationalization, using a "local expected gradient product (EGOP)"-based continuously varying index subspace. arxiv.org/abs/2601.07061
English
1
1
7
706
Damek
Damek@damekdavis·
New paper studies when spectral gradient methods (e.g., Muon) help in deep learning: 1. We identify a pervasive form of ill-conditioning in DL: post-activations matrices are low-stable rank. 2. We then explain why spectral methods can perform well despite this. Long thread
Damek tweet media
English
11
67
336
97.2K
Daniel Beaglehole
Daniel Beaglehole@dbeagleholeCS·
@damekdavis Seems really satisfying! Does the intuition carry over to Adam? Also have you tried training kernel ridge regression?
English
0
0
0
56
Damek
Damek@damekdavis·
# Improving training by starting SpecGD later By the way, knowing this nr / st speedup condition suggests a natural idea: run a few steps of gradient descent first before running SpecGD. In experiments, nr(G) along the GD trajectory is usually large. Testing this, starting SpecGD from iteration 3 or from the peak nr along GD, does help. In random feature models, we can actually justify this strategy. Indeed, we show that although nr(G) can be O(1) at the first step of gradient descent, it becomes Θ(d) after a single step. Moreover, it stays that large for at least Θ(d) steps (as is visible in the figure where d = 200). This means we can indeed use the two-step strategy above: 1. Take steps of gradient descent until nr(G) gets large. 2. Once it’s large, take a SpecGD (or Muon-style) step and enjoy an Ω(d) speedup over GD. So the nr > st condition seems useful in random feature models. The question is how to generalize it to MLPs and transformers.
GIF
English
3
0
14
1.8K
Daniel Beaglehole
Daniel Beaglehole@dbeagleholeCS·
Really nice to see independent validation of xRFM on tabular data. On a benchmark of 300 datasets, xRFM outperforms all Gradient Boosted Trees - XGBoost, CatBoost, LightGBM, etc. - and is in-line with the top 1-2 neural networks (such as TabPFNv2) arxiv.org/abs/2407.00956
Daniel Beaglehole tweet media
English
2
3
11
875
Daniel Beaglehole
Daniel Beaglehole@dbeagleholeCS·
Our method combines tree-based spatial partitions with feature learning kernel machines (tabular optimized based Recursive Feature Machine models, arxiv.org/abs/2508.10053).
English
1
0
4
138
Damek
Damek@damekdavis·
Update: we were able to close the gap between neural networks and reweighted kernel methods on sparse hierarchical functions with hypercube data. Interestingly the kernel methods outperform carefully tuned networks in our tests.
Damek tweet mediaDamek tweet mediaDamek tweet media
Damek@damekdavis

we wrote a paper about learning 'sparse' and 'hierarchical' functions with data dependent kernel methods. you just 'iteratively reweight' the coordinates by the gradients of the prediction function. typically 5 iterations suffices.

English
5
30
236
19.6K
Daniel Beaglehole
Daniel Beaglehole@dbeagleholeCS·
@damekdavis I'd be curious to see if other kernels from our xRFM paper can help btw. The generalized L_p^q kernels like $$k(x,z)=exp(-||x-z||_p^q)$$ for 0 < p < q improve performance quite a bit. I suspect they help learning sparse coordinates as they break rotational invariance.
English
1
0
3
342
Edward Milsom
Edward Milsom@edward_milsom·
Excited to announce I'll be starting this September 2025 as a Lecturer (Assistant Professor) at the University of Bath! I will continue my research on deep learning foundations, and am open to ideas for collaborations. (Pictured: Bath. Not pictured: University of Bath)
Edward Milsom tweet media
English
4
3
34
2.8K
Daniel Beaglehole
Daniel Beaglehole@dbeagleholeCS·
Our method combines RFMs with tree-based splits to achieve log linear scaling (basically linear but we have use median computation at every tree node) in the number of samples.
Daniel Beaglehole tweet media
English
1
0
5
227
Daniel Beaglehole retweetledi
Stat.ML Papers
Stat.ML Papers@StatMLPapers·
xRFM: Accurate, scalable, and interpretable feature learning models for tabular data ift.tt/P3pDd9f
English
0
4
16
3K
vik
vik@vikhyatk·
ok you guys have convinced me that pip is bad. trying out conda
English
188
5
1.1K
113.1K
Daniel Beaglehole retweetledi
Neil Mallinar
Neil Mallinar@nmallinar·
Super excited to share that we have an Oral presentation for this paper next week at ICML! It will be on Tuesday at 10am (Oral 1E) in West Ballroom D, I'll be presenting 4th at 10:45am :) Our poster will be on Wednesday at 11am and I encourage you to stop by and chat!
Neil Mallinar tweet media
English
1
3
18
1.2K
Daniel Beaglehole retweetledi
Weixuan Wang
Weixuan Wang@WeixuanWang66·
🚨 What if you could hijack any LLM's brain using external expert models? ExpertSteer does exactly that! 🧠⚡ Meet ExpertSteer: a breakthrough that lets you inject expert knowledge into any LLM, guiding its responses without updating model parameters.
Weixuan Wang tweet media
English
1
2
7
757