Alexandre TL

621 posts

Alexandre TL

@AlexandreTL2

Intern at @DragonLLM in Paris. (Pre|post)-training LLMs

Montpellier, France Katılım Ocak 2020

329 Takip Edilen1.2K Takipçiler

Sabitlenmiş Tweet

Alexandre TL@AlexandreTL2·31 Tem

muP works great for Mamba ! Zero-shot transfered the learning rate from a 172k model to a 105 model. Now part of mamba.py 👇🧵

English

9.3K

Alexandre TL@AlexandreTL2·8h

@AuxiumTrinity It should work too, I tried it on amazon.com with a random US zipcode and it seems to work

English

407

__tunir.b@AuxiumTrinity·17h

@AlexandreTL2 how would you get one of these in the US?

English

637

Alexandre TL@AlexandreTL2·22h

So I have been busy these last few months! Happy to present : The Little Book of Reinforcement Learning

English

1.6K

88.6K

Alexandre TL@AlexandreTL2·8h

@kushalj001 @BatxDev0 I tested it on amazon.com with different US zip codes and it seems to show up and works, idk ?

English

Kushal@kushalj001·18h

@AlexandreTL2 @BatxDev0 does not show up in US, it seems?

English

Alexandre TL@AlexandreTL2·8h

@BoltzmannTweet 😎

QME

389

Boltzmann Account@BoltzmannTweet·17h

@AlexandreTL2 Loved your other book! Ordered!

English

615

Alexandre TL@AlexandreTL2·8h

@Ravi_Sharma @loganthorneloe Happy reading😎

English

239

Ravi Sharma@Ravi_Sharma·13h

@AlexandreTL2 @loganthorneloe Ordered

English

354

Alexandre TL@AlexandreTL2·18h

@PhilippSiedler Happy reading!

English

727

Philipp Siedler@PhilippSiedler·19h

@AlexandreTL2 Ordered ☑️

English

895

Alexandre TL@AlexandreTL2·18h

@BatxDev0 search "Little Book of Reinforcement Learning" on Amazon!

English

1.2K

Batu Taskan@BatxDev0·19h

@AlexandreTL2 Where can I buy it, looks nicd

English

1.4K

Alexandre TL@AlexandreTL2·18h

@GiannisEllinas Free PDF is available here github.com/alxndrTL/littl…

English

798

Γιάννης Έλληνας@GiannisEllinas·19h

@AlexandreTL2 Someone upload a free pdf please.

English

902

Alexandre TL@AlexandreTL2·22h

Of course, all the credits goes to @francoisfleuret for this format (The Little Book of Deep Learning fleuret.org/francois/lbdl.…), which I adapted from deep learning to reinforcement learning!

English

3.3K

Alexandre TL@AlexandreTL2·22h

The book, along with its supplementary material, is available for free here: github.com/alxndrTL/littl… Or you can buy a paperback copy on Amazon by looking up its name (sold at cost) It is a much revised version of a video series I gave in 2020 here: @alexandretl/courses" target="_blank" rel="nofollow noopener">youtube.com/@alexandretl/c…

English

158

6.8K

Alexandre TL@AlexandreTL2·3d

@loiccabannes Really interesting bravo! FastPKM uses an iterative reading process (a bit like Just Read Twice? I didnt fully understand) and it seems to really help. I wonder why they did not advertise it more

English

Loic cabannes@loiccabannes·3d

Introducing Sparse Delta Memory (SDM) - The first work of my PhD 🎓. SDM combines the GatedDeltaNet update with Product Key sparsity, enabling a recurrent state 3000x larger at the same FLOPs and significantly improving long-context performance. Let RNNs be sparse!

English

584

48.8K

Alexandre TL@AlexandreTL2·6d

@ZakShark Merci pour le shout-out!!

Français

Zak 🦈 (e/acc)@ZakShark·6d

Super vidéo de @AlexandreTL2 sur l'état de la recherche sur les architectures des LLMs en 2026 : attention différentielle, PostNorm vs PreNorm, muP, MoE etc... youtu.be/SqyHPlEM40Q?is…

YouTube

Français

1.2K

Alexandre TL@AlexandreTL2·19 Haz

@giffmana @wightmanr there is competition on that one

English

123

Lucas Beyer (bl16)@giffmana·19 Haz

@wightmanr waveformer That would sound pretty dope

English

2.7K

Ross Wightman@wightmanr·19 Haz

Also some SE vibes... but I'm calling this one now, wait for it, the wiggleformer!

Eric W. Tramel@fujikanaeda

U-Net 😭

English

17.5K

Alexandre TL@AlexandreTL2·11 Haz

@SeunghyunSEO7 We were actively thinking about that at some point, indeed could be very interesting

English

174

Seunghyun Seo@SeunghyunSEO7·11 Haz

not a related thing but does a "scaling law speedrun" exist? (bit weird name tho). not about just reaching target loss in fewer iterations, but a competition where you submit your power law and see whose scales better. (this plot is made with matplotlib, not real data points)

Keller Jordan@kellerjordan0

I've added two optimizers to the public benchmark: (1) Shampoo (with its original 1/4 power). (2) Spectral descent, which is equivalent to both Muon(mu=0) and Shampoo(b1=b2=0). Result: Shampoo falls halfway between Muon & Adam; Spectral descent is ~2x slower. Thread below 1/6

English

6.2K

Alexandre TL@AlexandreTL2·10 Haz

@che_shr_cat more or less the same as in arxiv.org/abs/2403.01643…

English

274

Grigory Sapunov@che_shr_cat·9 Haz

1/ We have spent years optimizing KV cache via head-sharing (GQA/MQA), but we ignored a fundamental assumption: why do Transformers need three separate Q, K, and V projections in the first place? Turns out, they don't. Merging them unlocks massive memory savings. 🧵

English

310

23.1K

Alexandre TL@AlexandreTL2·5 Haz

@stochasticchasm awesome thank you

English

stochasm@stochasticchasm·4 Haz

alright, let's do this one more time

Nathan Lambert@natolambert

We have another 65 page frontier model report from Nvidia to read @eliebakouch @stochasticchasm and gang

English

214

21.5K

Alexandre TL@AlexandreTL2·13 May

@NousResearch very similar to patch-level training!! arxiv.org/abs/2407.12665

English

272

Nous Research@NousResearch·13 May

Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.