Arkadii

4 posts

Arkadii

Arkadii

@ArkadiiBessonov

Pre-training @poolsideai | Ex @yandex

Katılım Nisan 2026
35 Takip Edilen256 Takipçiler
Arkadii
Arkadii@ArkadiiBessonov·
Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached. per-tensor vs blockwise vs MXFP8. Why pretraining has so much structure here: forward + backward is 3 matmuls (Fprop, Dgrad, Wgrad) across 3 tensor roles (weights, activations, gradients). Each role wants its own scale layout — and that's where all the complexity lives. The three recipes differ in how the scale is attached — granularity, dtype, layout: — Per-tensor: one scale for the whole tensor. Simplest, least robust to outliers. — Blockwise: 1×128 / 128×128 tiles, FP32 scales. The DeepSeek-V3 style. — MXFP8: 1×32 blocks + E8M0 scale. Native on Blackwell. One rule ties it all together: the scale must stay constant along the matmul's contracted dimension. That single constraint derives every tile geometry above — nothing here is arbitrary. I drew every layout out, per recipe and per matmul, so the geometry is concrete instead of hand-wavy. Full walkthrough in my blogpost (link in comments)!
Arkadii tweet media
English
3
17
151
31K
Arkadii retweetledi
Poolside
Poolside@poolsideai·
Today we’re publishing the technical report behind Laguna M.1 and Laguna XS.2. This report opens up more of what went into them: Model Factory, pre-training data, distributed training, post-training, agent RL, quantization, and evaluation. poolside.ai/assets/laguna/…
Poolside tweet media
English
15
89
429
331.4K