Shuming Hu

1.6K posts

Shuming Hu banner
Shuming Hu

Shuming Hu

@ShumingHu

distinguished machine not learning engineer

Oakland, CA Katılım Temmuz 2012
937 Takip Edilen486 Takipçiler
Shuming Hu
Shuming Hu@ShumingHu·
That's a cool technique but not super relavent for long-context capability we eventually want, as using a short-horizon teacher model for local correction will not be able to take into account long context for few-shot learning, CoT etc which requires context outside tearcher model's horizon. 1k tokens sounds very little for videos but they can demo ICL for text, which is exactly the point. However, reading self-forcing make realize I could be wrong here: average youtube video length is around 5min but seldom is a single long-shot. It could be data bottlenecked here rather than due to archetecture difficiency. I heard video simulation for minecraft env in dreamerv4 can roll out 20mins w/o any issue.
English
0
0
0
115
Andreas Köpf
Andreas Köpf@neurosp1ke·
Disagree, you cannot call LLMs anti-bitter lesson because the sequence being modeled is language (=prior knowledge). There is a difference between a clever method and a clever target distribution. Also declaring next-token prediction as „not self-supervised“ because „tokens are labels“ is strange (then also all targets in a loss function are labels) - only because teacher-forcing is used doesn’t mean it is no longer autoassociative self-supervised learning.
Hugo@robonaissance

x.com/i/article/2036…

English
5
1
15
2.9K
Shuming Hu
Shuming Hu@ShumingHu·
I don't think LLM-style AR decoding is very general across all on modality/tasks today. Today's video predictions are probably using models and data much bigger than GPT2. Yet their AR roll out quite often degrade to noise after certain length. However, GPT2 trained on 10B tokens can be rolled out almost infinitely without seeing random/garbage tokens as output (looping/repetive output still have some structure). I suspect there's specific fit for token information desntity to AR transformers used today.
English
1
0
0
196
Andreas Köpf
Andreas Köpf@neurosp1ke·
@ShumingHu There is probably always a generalist-specialist trade-off (no free lunch). For transformers we know that they can model all kinds of signals, video, audio, stock-market, robot-actions etc. And it has non-trivial representations internally - it is a smart 'stochastic parrot'.
English
1
0
0
291
Shuming Hu
Shuming Hu@ShumingHu·
@prime_cai My personal experience is also that no advantage from TP.
English
1
0
4
452
Shuming Hu
Shuming Hu@ShumingHu·
no worries :) current thinking: we are doing NTP which are conditional probability distributions(CPD). but ultimately it's modeling the the distribution of sequences, which is not very sensitive to tokenizer changes. Better tokenizer just makes the factoring into CPD more efficient so easier to model sequences. The distribution over sequences are the true data distribtion we are trying to learn, which won't conflict with zipfian grokking.
English
0
0
0
37
Jasper Gilley
Jasper Gilley@0xjasper·
@ShumingHu I think that's what I interpreted you as saying anyway lol, but makes sense
English
1
0
0
293
Shuming Hu
Shuming Hu@ShumingHu·
" This is the structure of a **relaxation oscillator** — slow recovery (exponential decay under weight decay, ~10^4 epochs) punctuated by fast collapse events (exponential growth under positive feedback, ~50 epochs). The model can repeatedly *find* the correct solution but never *maintain* it. " My injuries and recoveries in a nutshell.
Jasper Gilley@0xjasper

x.com/i/article/2031…

English
1
0
4
827
Shuming Hu
Shuming Hu@ShumingHu·
@0xjasper On 2nd thought, a better tokenizer only increases zipfian exponent for token distribution (unigram) but probably not any n-gram.
English
1
0
1
56
Jasper Gilley
Jasper Gilley@0xjasper·
@ShumingHu Oh woah, that’s a great point! Seems true to me I’ll have to think about what the implications might be
English
1
0
1
321
Shuming Hu
Shuming Hu@ShumingHu·
@SemiAnalysis_ How dare you drag our Oakland queen American sweetheart into your nerd fight of a LLM rat race?
English
0
0
1
403
SemiAnalysis
SemiAnalysis@SemiAnalysis_·
Olympian Gold Medalist Alysa Liu, recently went viral for her Teen Vogue rant on OpenAI Codex. “I can see why Sam Altman open sourced Codex. Clearly the experience is significantly worse than Claude Code. I was unable to feel the AGI using Codex. As oppose to using Claude Code, I felt the enlightenment coming and support UBI ”
SemiAnalysis tweet mediaSemiAnalysis tweet media
English
65
42
1.2K
234K
Mayank Mishra
Mayank Mishra@MayankMish98·
@ShumingHu good luck! its a nightmare to train on state tracking tasks 🤣
English
1
0
6
582
Shuming Hu
Shuming Hu@ShumingHu·
This makes sense: Early timesteps contain more possible futures, as model learns the marginal denoising behavior from datasets. Later timesteps prune them as the plausible ones are more obvious. The tricky thing is "working memory" kind of scratchpad where memory is not present in marginal. "Move... to the right and back after 5s": The datasets contain objects moved away (t=1s, 2s).. and back at t=5s. But there's no maginal where objects stay at original position at t=2s? However, leaving objects at original position across t from 0s to 5s serve as "memory" for final appearance @ t=5s. x.com/xxunhuang/stat…
English
0
0
0
140
Shuming Hu
Shuming Hu@ShumingHu·
no problem. Just a random thought: if tokens for internet of text follows zipfian distribution: P(k) ~ 1/k^s where k is unigram similarly, another zipfian distribution for n-gram (fixed n), then a better tokenizer should produce zipfian distribution with bigger zipfian exponent s, right?
English
1
0
1
109
Albert Gu
Albert Gu@_albertgu·
The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!
Albert Gu tweet media
English
37
314
1.6K
423.9K
Shuming Hu
Shuming Hu@ShumingHu·
@seungwookh Does this mean one should re-init MLP after NCA pre-pretraining? as it already helps OpenWebText and doesn't hurt CodeParrot.
Shuming Hu tweet media
English
0
0
0
111
Seungwook Han
Seungwook Han@seungwookh·
Can language models learn useful priors without ever seeing language? We pre-pre-train transformers on neural cellular automata — fully synthetic, zero language. This improves language modeling by up to 6%, speeds up convergence by 40%, and strengthens downstream reasoning. Surprisingly, it even beats pre-pre-training on natural text! Blog: hanseungwook.github.io/blog/nca-pre-p… (1/n)
Seungwook Han tweet media
English
48
260
1.7K
242.9K
Shuming Hu
Shuming Hu@ShumingHu·
tbh, i was thinking if this works well, we can generate more complicated patterns from bigger random init models.
English
0
0
0
106
You Jiacheng
You Jiacheng@YouJiacheng·
@ShumingHu v_i is computed from x_i, and we have skip connection and many FFNs. current token's info can be easily computed.
English
1
0
5
884