driss guessous
388 posts


@MainzOnX @philipbankier ehhh, I very much like the codex app gui - which it sounds like this doesnt address right?
English

@drisspg Currently think the best is using Claude code + codex and remotely controlling via tailscale. @philipbankier has a great system for this coming out I heard 🤔
English

@ellev3n11 I believe this is in works in torch directly, but idt it’s ready yet. Maybe someone like @drisspg would know more
English

おれはいつまでもにおうコードと関連するログとかを1時間ぐらい調査して修正は数行みたいな仕事をやっていきたいしこれが一番楽しい
msk@crcrpar
温もりある、手書きの、プルリクエスト github.com/pytorch/pytorc…
日本語

@drisspg Could you please share a snippet to replicate the nvFP4 numbers. How are you doing the quantization? Is nvFP4 now a native torch dtype? And how do you know that you are using the nvFP4 tensor-cores? Thanks!
English

I've been brainstorming episodes for the next season of PyTorch Developer Podcast.
DTensor
StridedShard, FSDP-TP order
Redistributing a DTensor
Prefetching vs Bucketing
History of FSDP in PyTorch
Multiprocessing: DataParallel versus DistributedDataParallel
Monarch
Parallelism Zoo
Mixture of Experts and Expert Parallelism
The Peak Memory Triangle: Activations
FSDP and CPU Offloading
Overlap: How to get it (Prefetching, Pipelining, Async TP)
Differentiable collectives and variance
Local map: global versus local SPMD
FSDP vs TP
Symmetric memory
Uneven sharding and FSDP
LocalTensor
Composable parallelism via DTensor
Pipeline parallelism
Functional collectives and wait
Device mesh; process group initialization
Distributed checkpointing
Activation checkpointing
Placement: Partial reductions
Implicit versus explicit prefetching
RNG in a distributed setting
Distributed optimizers, Zero, Shampoo, Muon
torchtitan
torchrun / torchx
Choosing your parallelism from first principles / roofline analysis
Mixture of Experts: as large as possible, expert routing as fine as possible (more sparsity the better) by minimizing hidden dim
GB200
MXFP8 (1x128, 128x128, transposes)
Stats of a training job: loss curve, MFU, expert balance
Bitwise determinism: when you can expect it
Distributed inference
RL from an infra perspective
Basics of observabilty on jobs
NCCL timeout
English

Here is a Cute PR: github.com/pytorch/helion…
Helion has awesome infra for parameter autotuning, and with this you can use it to find optimal configs for kernels written in other languages as well
English












