Calc Consulting

12.5K posts

Calc Consulting banner
Calc Consulting

Calc Consulting

@CalcCon

Calculation Consulting is a boutique consultancy that specializes in machine learning, AI, and data science

San Francisco Katılım Ocak 2013
3.2K Takip Edilen4.3K Takipçiler
Zhepei Wei
Zhepei Wei@weizhepei·
💡Finding 1: RLVR weight updates are extremely low-rank A simple rank-1 SVD approximation of weight deltas already recovers most RLVR performance across training. 🧵[3/n]
Zhepei Wei tweet media
English
2
1
5
438
Zhepei Wei
Zhepei Wei@weizhepei·
😢RLVR is powerful but expensive 🤯Imagine using <20% RLVR training while achieving 100% performance? Sounds surprising? We show that minimal RLVR training is enough to know where training is going, and predict future ckpts at no training cost! 📃tinyurl.com/minimal-rlvr 🧵[1/n]
Zhepei Wei tweet media
English
3
26
162
12.9K
Calc Consulting retweetledi
Deep philosophy
Deep philosophy@DeepPhilo_HQ·
Deep philosophy tweet media
ZXX
5
44
222
2.9K
BURKOV
BURKOV@burkov·
@CalcCon Check on the arxiv page: the license isn't CC, so you gave only to arXiv the license to host your paper. I, for example, cannot host it on ChapterPal.
English
3
0
0
52
Calc Consulting
Calc Consulting@CalcCon·
𝐃𝐞𝐭𝐞𝐜𝐭𝐢𝐧𝐠 𝐨𝐯𝐞𝐫𝐟𝐢𝐭𝐭𝐢𝐧𝐠 𝐢𝐧 𝐍𝐞𝐮𝐫𝐚𝐥 𝐍𝐞𝐭𝐰𝐨𝐫𝐤𝐬 𝐝𝐮𝐫𝐢𝐧𝐠 𝐥𝐨𝐧𝐠-𝐡𝐨𝐫𝐢𝐳𝐨𝐧 𝐠𝐫𝐨𝐤𝐤𝐢𝐧𝐠 𝐮𝐬𝐢𝐧𝐠 𝐑𝐚𝐧𝐝𝐨𝐦 𝐌𝐚𝐭𝐫𝐢𝐱 𝐓𝐡𝐞𝐨𝐫𝐲 Hari K. Prakash, Charles H Martin arxiv.org/abs/2605.12394
English
15
6
76
23.2K
BURKOV
BURKOV@burkov·
@CalcCon Not a CC license -> impossible to copy and host elsewhere.
English
2
0
0
130
naokiss
naokiss@naokiss·
Which album are you going with? 🅰️ Animalize 🅱️ Asylum
naokiss tweet media
English
28
2
30
1.1K
Calc Consulting
Calc Consulting@CalcCon·
finally, once you understand, all of this, you will see that it is straightforward to show that the layers of a NN can readably be described using Wilsonian Renormalization Group using a simple power counting argument And you can even offer a reason for why Muon works In our next paper…
English
0
0
1
100
Calc Consulting
Calc Consulting@CalcCon·
Finally, this helps explain why WeightWatcher α ≈ 2 is often associated with optimal performance. α ≈ 2 marks the critical boundary where layers are strongly correlated enough to encode useful structure, but not so heavy-tailed that they become dominated by unstable, non-self-averaging fluctuations.
Calc Consulting tweet media
English
0
0
3
114
Calc Consulting
Calc Consulting@CalcCon·
This is the same reason the weightwatcher α < 2 is a signal of potential overfitting ! Let 𝐗 be the covariance matrix of the layer weight matrix 𝐖 𝐗 = (1/N)𝐖ᵀ𝐖 Diagonalize it, fit the eigenvalues to a power law with exponent α=2 If α < 2, the fitted spectral tail is in the infinite-variance regime. This suggests that the layer may be dominated by extreme, non-self-averaging correlations rather than stable, well-distributed learned structure.
English
0
0
4
101
Calc Consulting
Calc Consulting@CalcCon·
While everyone is thinking about how to bound the generalization error, they never ask, what does it mean to be unbounded ? Traps cause overfitting because they have non-vanishing variance. And it is impossible to form a bound of any kind on a system if the variance stays finite as the system grows
Calc Consulting tweet media
English
0
0
6
115
Calc Consulting
Calc Consulting@CalcCon·
More broadly, we find that some foundation-scale LLMs, like the OpenAI GPT OSS 20b and 120b, have an unusual number of Correlation Traps. They may be inadvertently overfitting their training data in potentially harmful ways.
Calc Consulting tweet media
English
0
0
3
89
Calc Consulting
Calc Consulting@CalcCon·
Traps can be harmful or benign. They can be distinguished using a simple Jensen-Shannon Divergence Abalation Test The test is easy to run. 1) Remove the trap from the model, replace it with a random vector 2) Pass random inputs through the 'ablated' model and the orginal model 3) Measure the Jensen-Shannon Divergence (JSD) between the output logits of both models The classification is: • Harmful trap: replacement changes logits and improves or hurts the test accuracy. • Benign trap: replacement has a negligible effect on the test accuracy.
Calc Consulting tweet media
English
0
0
3
65
Calc Consulting
Calc Consulting@CalcCon·
Traps tend to be either highly localized or highly delocalized. These are analogous to phenomena that arise in physics in the study of phase transitions A localized trap is like Anderson localization, A delocalized trap is like Bose–Einstein condensate and is also very similar to what happens in the Curie–Weiss mean-field model of magnetization
English
0
0
3
63
Calc Consulting
Calc Consulting@CalcCon·
But why do Correlation Traps cause the test accuracy to drop ? The MP distribution is the self-averaging baseline. Traps violate the baseline; they are non-self-averaging. This tends to make the test error non-self-averaging. That is, the test error can not be bounded.
Calc Consulting tweet media
English
0
0
3
64
Calc Consulting
Calc Consulting@CalcCon·
What is a Correlation Trap? And why do they cause the test accuracy to drop ? If we randomize a layer weight matrix elementwise,W-> rand(W), the matrix should now behave as if its elements are i.i.d. In particular, The new eigenvalues rand(W) should obey the Marchenko-Pastur (MP) law of Random Matrix Theory (RMT), to within finite size Tracy-Widom (TW) fluctuations A Correlation Trap is a large eigenvalue appearing well beyond the right edge of the MP+TW baseline
Calc Consulting tweet media
English
0
0
3
84
Calc Consulting
Calc Consulting@CalcCon·
Detecting Correlation Traps is straightforward; the weightwatcher tool does it fr you automatically.
Calc Consulting tweet media
English
0
0
3
81