Calc Consulting

12.5K posts

Calc Consulting

@CalcCon

Calculation Consulting is a boutique consultancy that specializes in machine learning, AI, and data science

San Francisco Katılım Ocak 2013

3.2K Takip Edilen4.3K Takipçiler

Calc Consulting@CalcCon·1h

@elonmusk Think the victims deserve it.

English

Elon Musk@elonmusk·2h

Yes

X Freeze@XFreeze

Elon Musk on the left’s fundamental moral flaw: “The fundamental moral flaw of the left is empathy for the criminals and not empathy for the victims” They feel sorry for the criminals but show zero empathy for the actual victims There’s also been immense unconstitutional judicial overreach that was never intended and it’s destroying the public’s faith in the legal system This needs to stop Put the victims first. Restore real justice

QST

3.9K

10.5K

58.5K

3.7M

Calc Consulting@CalcCon·14h

@_SatanWatch 🤣

QME

102

SatanWatch 👿@_SatanWatch·20h

The first guy in America to wear a beard was so hated for it that he used to get into street fights with people trying to forcibly shave him.

Ramin Nasibov@RaminNasibov

What historical fact sounds fake but is true?

English

1.5K

11.9K

449.6K

Calc Consulting@CalcCon·14h

@weizhepei So weird. So wild.

English

Zhepei Wei@weizhepei·22h

💡Finding 1: RLVR weight updates are extremely low-rank A simple rank-1 SVD approximation of weight deltas already recovers most RLVR performance across training. 🧵[3/n]

English

438

Zhepei Wei@weizhepei·22h

😢RLVR is powerful but expensive 🤯Imagine using <20% RLVR training while achieving 100% performance? Sounds surprising? We show that minimal RLVR training is enough to know where training is going, and predict future ckpts at no training cost! 📃tinyurl.com/minimal-rlvr 🧵[1/n]

English

162

12.9K

Calc Consulting retweetledi

Deep philosophy@DeepPhilo_HQ·1d

ZXX

222

2.9K

Calc Consulting@CalcCon·17h

@burkov The new version is up

English

BURKOV@burkov·1d

@CalcCon Check on the arxiv page: the license isn't CC, so you gave only to arXiv the license to host your paper. I, for example, cannot host it on ChapterPal.

English

Calc Consulting@CalcCon·2d

𝐃𝐞𝐭𝐞𝐜𝐭𝐢𝐧𝐠 𝐨𝐯𝐞𝐫𝐟𝐢𝐭𝐭𝐢𝐧𝐠 𝐢𝐧 𝐍𝐞𝐮𝐫𝐚𝐥 𝐍𝐞𝐭𝐰𝐨𝐫𝐤𝐬 𝐝𝐮𝐫𝐢𝐧𝐠 𝐥𝐨𝐧𝐠-𝐡𝐨𝐫𝐢𝐳𝐨𝐧 𝐠𝐫𝐨𝐤𝐤𝐢𝐧𝐠 𝐮𝐬𝐢𝐧𝐠 𝐑𝐚𝐧𝐝𝐨𝐦 𝐌𝐚𝐭𝐫𝐢𝐱 𝐓𝐡𝐞𝐨𝐫𝐲 Hari K. Prakash, Charles H Martin arxiv.org/abs/2605.12394

English

23.2K

Calc Consulting@CalcCon·1d

@burkov The CC version should post Friday

English

BURKOV@burkov·1d

@CalcCon Not a CC license -> impossible to copy and host elsewhere.

English

130

Calc Consulting@CalcCon·1d

@burkov Changed the license. Might take a day to process. Thanks !!!

English

Calc Consulting@CalcCon·1d

@burkov I'll update it now standby

English

Calc Consulting@CalcCon·1d

@naokiss

QME

naokiss@naokiss·1d

Which album are you going with? 🅰️ Animalize 🅱️ Asylum

English

1.1K

Calc Consulting@CalcCon·1d

finally, once you understand, all of this, you will see that it is straightforward to show that the layers of a NN can readably be described using Wilsonian Renormalization Group using a simple power counting argument And you can even offer a reason for why Muon works In our next paper…

English

100

Calc Consulting@CalcCon·2d

Finally, this helps explain why WeightWatcher α ≈ 2 is often associated with optimal performance. α ≈ 2 marks the critical boundary where layers are strongly correlated enough to encode useful structure, but not so heavy-tailed that they become dominated by unstable, non-self-averaging fluctuations.

English

114

Calc Consulting@CalcCon·2d

This is the same reason the weightwatcher α < 2 is a signal of potential overfitting ! Let 𝐗 be the covariance matrix of the layer weight matrix 𝐖 𝐗 = (1/N)𝐖ᵀ𝐖 Diagonalize it, fit the eigenvalues to a power law with exponent α=2 If α < 2, the fitted spectral tail is in the infinite-variance regime. This suggests that the layer may be dominated by extreme, non-self-averaging correlations rather than stable, well-distributed learned structure.

English

101

Calc Consulting@CalcCon·2d

While everyone is thinking about how to bound the generalization error, they never ask, what does it mean to be unbounded ? Traps cause overfitting because they have non-vanishing variance. And it is impossible to form a bound of any kind on a system if the variance stays finite as the system grows

English

115

Calc Consulting@CalcCon·2d

More broadly, we find that some foundation-scale LLMs, like the OpenAI GPT OSS 20b and 120b, have an unusual number of Correlation Traps. They may be inadvertently overfitting their training data in potentially harmful ways.

English

Calc Consulting@CalcCon·2d

Traps can be harmful or benign. They can be distinguished using a simple Jensen-Shannon Divergence Abalation Test The test is easy to run. 1) Remove the trap from the model, replace it with a random vector 2) Pass random inputs through the 'ablated' model and the orginal model 3) Measure the Jensen-Shannon Divergence (JSD) between the output logits of both models The classification is: • Harmful trap: replacement changes logits and improves or hurts the test accuracy. • Benign trap: replacement has a negligible effect on the test accuracy.

English

Calc Consulting@CalcCon·2d

Traps tend to be either highly localized or highly delocalized. These are analogous to phenomena that arise in physics in the study of phase transitions A localized trap is like Anderson localization, A delocalized trap is like Bose–Einstein condensate and is also very similar to what happens in the Curie–Weiss mean-field model of magnetization

English

Calc Consulting@CalcCon·2d

But why do Correlation Traps cause the test accuracy to drop ? The MP distribution is the self-averaging baseline. Traps violate the baseline; they are non-self-averaging. This tends to make the test error non-self-averaging. That is, the test error can not be bounded.

English

Calc Consulting@CalcCon·2d

What is a Correlation Trap? And why do they cause the test accuracy to drop ? If we randomize a layer weight matrix elementwise,W-> rand(W), the matrix should now behave as if its elements are i.i.d. In particular, The new eigenvalues rand(W) should obey the Marchenko-Pastur (MP) law of Random Matrix Theory (RMT), to within finite size Tracy-Widom (TW) fluctuations A Correlation Trap is a large eigenvalue appearing well beyond the right edge of the MP+TW baseline

English

Calc Consulting@CalcCon·2d

Detecting Correlation Traps is straightforward; the weightwatcher tool does it fr you automatically.

English

Keşfet

@elonmusk @_SatanWatch @weizhepei @burkov @naokiss @BarackObama @taylorswift13 @cristiano