Alexandre TL

603 posts

Alexandre TL banner
Alexandre TL

Alexandre TL

@AlexandreTL2

Intern at @DragonLLM in Paris. (Pre|post)-training LLMs

Montpellier, France Katılım Ocak 2020
325 Takip Edilen827 Takipçiler
Sabitlenmiş Tweet
Alexandre TL
Alexandre TL@AlexandreTL2·
muP works great for Mamba ! Zero-shot transfered the learning rate from a 172k model to a 105 model. Now part of mamba.py 👇🧵
Alexandre TL tweet media
English
2
8
71
7.9K
Nous Research
Nous Research@NousResearch·
Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.
Nous Research tweet media
English
150
419
3.7K
442.1K
Aidan McLaughlin
Aidan McLaughlin@aidan_mclau·
one of my all-time favorite plots
Aidan McLaughlin tweet media
English
20
73
2.1K
231.2K
Bibek Poudel
Bibek Poudel@bibek_poudel_·
@hive_echo @aidan_mclau Point at which there is a change in temperature schedule (lowered), makes the AlphaGo policy slightly more deterministic afterwards (notice the curve variation before and after).
English
1
0
3
428
Will Held
Will Held@WilliamBarrHeld·
@AlexandreTL2 We've definitely found that Hyperball does even better with a plain linear decay (and I suspect this might be some of what's causing the spikes)! But yes, we've found Hyperball to outperform my old Cautious AdamW setup, even with WSD.
English
2
2
11
4.8K
Will Held
Will Held@WilliamBarrHeld·
Our 1e23 "Delphi" (~25B param model trained for ~600B tokens) run for Marin has entered its learning rate decay phase. Lots of spikes at this scale, very scary! Despite that, the run is looking on track to be close to our pre-registered scaling laws predictions. Stay tuned...
Will Held tweet mediaWill Held tweet media
Percy Liang@percyliang

In Marin, we are trying to get really good at scaling laws. We have trained models up to 1e22 FLOPs and have made a prediction of the loss at 1e23 FLOPs, which @WilliamBarrHeld is running. This prediction is preregistered on GitHub, so we'll see in a few days how accurate our prediction was. What we want is not just a single model but a training recipe that scales reliably.

English
6
11
120
46.2K
Alexandre TL
Alexandre TL@AlexandreTL2·
@QuasarModels Interesting! The KDA baseline seems already very strong tho, 99% at 10M context length. Is it the same model as in Table 7 ? Also, would it be possible to have the raw NIAH scores for the two models of Table 7?
Alexandre TL tweet media
English
3
0
4
212
Quasar
Quasar@QuasarModels·
This is Quasar Attention, the mechanism behind the upcoming Quasar models, designed to support context lengths of up to 5 million tokens. Attention has long been a bottleneck for processing extended context. Standard attention mechanisms struggle to scale beyond ~200k tokens in training, creating a ceiling on how much information models can reliably use. One approach to solving this has been linear attention methods, such as gated delta attention (used in Qwen 3.5) or Kimi delta attention. These improve efficiency and allow longer sequences, but introduce trade-offs: instability at extreme lengths, quality degradation, and in practice, they are not strictly linear. Quasar Attention takes a different approach. It uses a continuous-time formulation, implemented as a fully matrix-based system rather than relying on vector-state approximations. In practice, this improves stability, reduces cost, and maintains performance as sequence length increases. In internal stress tests at 50 million tokens, KDA-based approaches begin to lose stability, while Quasar Attention remains stable. This allows performance to hold as sequence length increases, rather than degrading beyond a fixed threshold. On BABILong, a Quasar-based model pretrained on 20B tokens and fine-tuned on 16k sequences was evaluated on contexts ranging from 1 million to 10 million tokens, maintaining consistent performance across that range. By contrast, models using gated delta attention show significant degradation at longer lengths, in some cases dropping to ~10% performance at 10 million tokens. (Note: results are indicative; setups are not directly comparable) On RULER benchmarks, a Quasar-10B model (built on Qwen 3.5 with frozen base weights and Quasar Attention added), pretrained on 200B tokens, achieved 87% at 1 million tokens, outperforming significantly larger baselines, including Qwen3 80B, under the same evaluation conditions. Taken together, this points to a shift in where long-context performance is won or lost: not in model size alone, but in the attention mechanism itself. Quasar Attention represents a step change in long-context modelling, setting a new standard for stability and performance at scale. We thank @TargonCompute for the compute and for being our compute provider and long-term partner in training the upcoming Quasar models Here is the link to our paper 👇
Quasar tweet media
English
23
82
251
110.1K
Ji-Ha
Ji-Ha@Ji_Ha_Kim·
Blog Post - Lion-K CCWD: Corrected Cautious Weight Decay and Hyperparameter Transfer Derivation of Lion-K with Corrected Cautious Weight Decay (CCWD) and transformation rules for hyperparameter transfer fixing Complete(d)P momentum
Ji-Ha tweet media
English
5
10
57
10.1K
Alexandre TL
Alexandre TL@AlexandreTL2·
@Ji_Ha_Kim uhm I meant in general, even if batch size is not fixed, it disappears in the ratio : mB/mD = (B'/B)/(D'/D) where D'=B'*num_iters and D=B*num_iters_base so the B' and the B cancel
English
1
0
2
38
Ji-Ha
Ji-Ha@Ji_Ha_Kim·
@AlexandreTL2 m_B := B'/B so for fixed batch size B'=B gives m_B = 1 so yeah only data part remains
Ji-Ha tweet media
English
1
0
2
98
Alexandre TL retweetledi
Kaiyue Wen
Kaiyue Wen@wen_kaiyue·
@AlexandreTL2 Sorry this is in fact a typo! We did the same scaling as MuP
English
1
0
0
687
Kaiyue Wen
Kaiyue Wen@wen_kaiyue·
(1/n) Introducing Hyperball — an optimizer wrapper that keeps weight & update norm constant and lets you control the effective (angular) step size directly. Result: sustained speedups across scales + strong hyperparameter transfer.
Kaiyue Wen tweet media
English
27
125
709
201.6K
Alexandre TL
Alexandre TL@AlexandreTL2·
@dvruette regular attention in place of differential attention has the same loss, but worse recall in the benchmarks. "Coût" is loss
Alexandre TL tweet media
English
1
0
4
168