Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached.
per-tensor vs blockwise vs MXFP8.
Why pretraining has so much structure here: forward + backward is 3 matmuls (Fprop, Dgrad, Wgrad) across 3 tensor roles (weights, activations, gradients). Each role wants its own scale layout — and that's where all the complexity lives.
The three recipes differ in how the scale is attached — granularity, dtype, layout:
— Per-tensor: one scale for the whole tensor. Simplest, least robust to outliers.
— Blockwise: 1×128 / 128×128 tiles, FP32 scales. The DeepSeek-V3 style.
— MXFP8: 1×32 blocks + E8M0 scale. Native on Blackwell.
One rule ties it all together: the scale must stay constant along the matmul's contracted dimension. That single constraint derives every tile geometry above — nothing here is arbitrary.
I drew every layout out, per recipe and per matmul, so the geometry is concrete instead of hand-wavy.
Full walkthrough in my blogpost (link in comments)!
Today we’re publishing the technical report behind Laguna M.1 and Laguna XS.2.
This report opens up more of what went into them: Model Factory, pre-training data, distributed training, post-training, agent RL, quantization, and evaluation.
poolside.ai/assets/laguna/…
Today we’re releasing Laguna XS.2, Poolside’s first open-weight model.
It’s a 33B total / 3B active MoE model built for agentic coding and long-horizon tasks.
Trained fully in-house on our own stack. Runs on a single GPU. Released under Apache 2.0.
Links 👇
Weights: huggingface.co/poolside/Lagun…
API: platform.poolside.ai
Blog: poolside.ai/blog/laguna-a-…