Shuming Hu

1.6K posts

Shuming Hu

@ShumingHu

distinguished machine not learning engineer

Oakland, CA Katılım Temmuz 2012

937 Takip Edilen486 Takipçiler

Shuming Hu@ShumingHu·21h

That's a cool technique but not super relavent for long-context capability we eventually want, as using a short-horizon teacher model for local correction will not be able to take into account long context for few-shot learning, CoT etc which requires context outside tearcher model's horizon. 1k tokens sounds very little for videos but they can demo ICL for text, which is exactly the point. However, reading self-forcing make realize I could be wrong here: average youtube video length is around 5min but seldom is a single long-shot. It could be data bottlenecked here rather than due to archetecture difficiency. I heard video simulation for minecraft env in dreamerv4 can roll out 20mins w/o any issue.

English

115

Andreas Köpf@neurosp1ke·23h

@ShumingHu GPT2 has 1k tokens context length. Video gen via self-forcing is >4min possible, e.g. see huggingface.co/papers/2510.02…

English

311

Andreas Köpf@neurosp1ke·1d

Disagree, you cannot call LLMs anti-bitter lesson because the sequence being modeled is language (=prior knowledge). There is a difference between a clever method and a clever target distribution. Also declaring next-token prediction as „not self-supervised“ because „tokens are labels“ is strange (then also all targets in a loss function are labels) - only because teacher-forcing is used doesn’t mean it is no longer autoassociative self-supervised learning.

Hugo@robonaissance

x.com/i/article/2036…

English

2.9K

Shuming Hu@ShumingHu·1d

I don't think LLM-style AR decoding is very general across all on modality/tasks today. Today's video predictions are probably using models and data much bigger than GPT2. Yet their AR roll out quite often degrade to noise after certain length. However, GPT2 trained on 10B tokens can be rolled out almost infinitely without seeing random/garbage tokens as output (looping/repetive output still have some structure). I suspect there's specific fit for token information desntity to AR transformers used today.

English

196

Andreas Köpf@neurosp1ke·1d

@ShumingHu There is probably always a generalist-specialist trade-off (no free lunch). For transformers we know that they can model all kinds of signals, video, audio, stock-market, robot-actions etc. And it has non-trivial representations internally - it is a smart 'stochastic parrot'.

English

291

Shuming Hu@ShumingHu·2d

@prime_cai My personal experience is also that no advantage from TP.

English

452

Prime (Shengqu) Cai@prime_cai·2d

CP + EP for the win. Am I right to say the community is generally and gradually dropping TP?

Cursor@cursor_ai

We're releasing a technical report describing how Composer 2 was trained.

English

9.7K

Shuming Hu@ShumingHu·20 Mar

no worries :) current thinking: we are doing NTP which are conditional probability distributions(CPD). but ultimately it's modeling the the distribution of sequences, which is not very sensitive to tokenizer changes. Better tokenizer just makes the factoring into CPD more efficient so easier to model sequences. The distribution over sequences are the true data distribtion we are trying to learn, which won't conflict with zipfian grokking.

English

Jasper Gilley@0xjasper·20 Mar

@ShumingHu I think that's what I interpreted you as saying anyway lol, but makes sense

English

293

Shuming Hu@ShumingHu·13 Mar

" This is the structure of a **relaxation oscillator** — slow recovery (exponential decay under weight decay, ~10^4 epochs) punctuated by fast collapse events (exponential growth under positive feedback, ~50 epochs). The model can repeatedly *find* the correct solution but never *maintain* it. " My injuries and recoveries in a nutshell.

Jasper Gilley@0xjasper

x.com/i/article/2031…

English

827

Shuming Hu@ShumingHu·20 Mar

@0xjasper On 2nd thought, a better tokenizer only increases zipfian exponent for token distribution (unigram) but probably not any n-gram.

English

Jasper Gilley@0xjasper·19 Mar

@ShumingHu Oh woah, that’s a great point! Seems true to me I’ll have to think about what the implications might be

English

321

Shuming Hu@ShumingHu·20 Mar

@SemiAnalysis_ How dare you drag our Oakland queen American sweetheart into your nerd fight of a LLM rat race?

English

403

SemiAnalysis@SemiAnalysis_·20 Mar

Olympian Gold Medalist Alysa Liu, recently went viral for her Teen Vogue rant on OpenAI Codex. “I can see why Sam Altman open sourced Codex. Clearly the experience is significantly worse than Claude Code. I was unable to feel the AGI using Codex. As oppose to using Claude Code, I felt the enlightenment coming and support UBI ”

English

1.2K

234K

Shuming Hu@ShumingHu·19 Mar

@MayankMish98 😅

QME

Mayank Mishra@MayankMish98·19 Mar

@ShumingHu good luck! its a nightmare to train on state tracking tasks 🤣

English

582

Shuming Hu@ShumingHu·19 Mar

🤯 let's repro this!

Mayank Mishra@MayankMish98

[3/N] State tracking: M²RNN achieves perfect generalization on the S₃ permutation task at sequence lengths 4x longer than training. Linear RNNs like Gated DeltaNet fail entirely, and even DeltaProduct degrades at unseen lengths.

English

1.3K

Shuming Hu@ShumingHu·19 Mar

This makes sense: Early timesteps contain more possible futures, as model learns the marginal denoising behavior from datasets. Later timesteps prune them as the plausible ones are more obvious. The tricky thing is "working memory" kind of scratchpad where memory is not present in marginal. "Move... to the right and back after 5s": The datasets contain objects moved away (t=1s, 2s).. and back at t=5s. But there's no maginal where objects stay at original position at t=2s? However, leaving objects at original position across t from 0s to 5s serve as "memory" for final appearance @ t=5s. x.com/xxunhuang/stat…

English

140

Shuming Hu@ShumingHu·19 Mar

This is even more suprising to me than chain-of-thought (CoT) for text. After all, LLM pretraining contains CoT-like texts. How do these emerge for video diffusion models? diffusion objectives don't contain CoT.

Zhongang Cai@caizhongang

Video might be the next intelligence substrate. Strikingly, video models are beginning to exhibit the same emergent reasoning behaviors first observed in LLMs—multi-path search, self-correction, and layer specialization. We demystify video reasoning and show it doesn’t happen frame-by-frame, but along diffusion steps. 🔗 wruisi.com/demystifying_v… 📄 huggingface.co/papers/2603.16… So, what's next? ;)

English

1.6K

Shuming Hu@ShumingHu·19 Mar

no problem. Just a random thought: if tokens for internet of text follows zipfian distribution: P(k) ~ 1/k^s where k is unigram similarly, another zipfian distribution for n-gram (fixed n), then a better tokenizer should produce zipfian distribution with bigger zipfian exponent s, right?

English

109

Jasper Gilley@0xjasper·13 Mar

@ShumingHu Yeah, sorry about that 😅

English

890

Shuming Hu@ShumingHu·18 Mar

@_albertgu kernels not available in mamba_ssm yet? github.com/state-spaces/m…

English

160

Albert Gu@_albertgu·17 Mar

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!

English

314

1.6K

423.9K

Shuming Hu@ShumingHu·13 Mar

@seungwookh Does this mean one should re-init MLP after NCA pre-pretraining? as it already helps OpenWebText and doesn't hurt CodeParrot.

English

111

Seungwook Han@seungwookh·12 Mar

Can language models learn useful priors without ever seeing language? We pre-pre-train transformers on neural cellular automata — fully synthetic, zero language. This improves language modeling by up to 6%, speeds up convergence by 40%, and strengthens downstream reasoning. Surprisingly, it even beats pre-pre-training on natural text! Blog: hanseungwook.github.io/blog/nca-pre-p… (1/n)

English

260

1.7K

242.9K

Shuming Hu@ShumingHu·12 Mar

tbh, i was thinking if this works well, we can generate more complicated patterns from bigger random init models.

English

106

Shuming Hu@ShumingHu·12 Mar

🧐

Seungwook Han@seungwookh

ART

545

Shuming Hu@ShumingHu·12 Mar

@YouJiacheng Oh that’s right!

English

You Jiacheng@YouJiacheng·12 Mar

@ShumingHu v_i is computed from x_i, and we have skip connection and many FFNs. current token's info can be easily computed.

English

884

Shuming Hu@ShumingHu·12 Mar

The only difference between these two is softmax in attn is over only preivous tokens or it includes current token? XSA is surprising to me as I thought current token's value contribution to next token is important.

You Jiacheng@YouJiacheng

interesting. IIRC, excluding current token's KV by attention mask (i.e. remove the diagonal) doesn't work! Hypothesis: this effectively makes current token to be an attention sink.

English

1.3K

Shuming Hu@ShumingHu·12 Mar

XSA worked better over longer sequences too x.com/zhaisf/status/…

Shuangfei Zhai@zhaisf

English

216

Keşfet

@prime_cai @0xjasper @SemiAnalysis_ @MayankMish98 @elonmusk @BarackObama @taylorswift13 @cristiano