driss guessous

388 posts

driss guessous

driss guessous

@drisspg

bytes and nuggets @pytorch

Katılım Aralık 2023
225 Takip Edilen1.4K Takipçiler
stochasm
stochasm@stochasticchasm·
@drisspg elite show, what a deep pull
English
1
0
5
232
jon
jon@JonofFive·
@drisspg HBM 4e will offer 3GB rectangular dies x 16 high stacks for a 48 GB cuboid
English
1
0
0
48
driss guessous
driss guessous@drisspg·
This computer looks like a turbine
driss guessous tweet media
English
0
0
4
308
Adam Mainz
Adam Mainz@MainzOnX·
@drisspg Currently think the best is using Claude code + codex and remotely controlling via tailscale. @philipbankier has a great system for this coming out I heard 🤔
English
2
0
1
93
driss guessous
driss guessous@drisspg·
What is the best codex app like program that I can run with multi agent backend support? I want to run it on my Mac but be able to ssh to remote machines Any pointers much appreciated :)
English
2
0
6
613
Matej Sirovatka
Matej Sirovatka@m_sirovatka·
@ellev3n11 I believe this is in works in torch directly, but idt it’s ready yet. Maybe someone like @drisspg would know more
English
1
0
6
794
Federico Cassano
Federico Cassano@ellev3n11·
does anyone know how a good samaritan can run their model with bf16 master weights parameters but fp32 gradients without allocating a huge flat fp32 gradient buffer? i.e. have a tensor with dtype bf16 but grad dtype fp32?
English
1
1
20
3.8K
Artificially Intelligent
Artificially Intelligent@ArtiIntelligent·
@drisspg Could you please share a snippet to replicate the nvFP4 numbers. How are you doing the quantization? Is nvFP4 now a native torch dtype? And how do you know that you are using the nvFP4 tensor-cores? Thanks!
English
1
0
0
250
driss guessous
driss guessous@drisspg·
[DGX Spark Tid Bit] Although advertised as 1 pflop *sparse* flops the peak MM perf I have been able to get with cublas is - BF16 torch_mm: 104.80 TFLOPs - MXFP8 nn.functional.scaled_mm: 176.87 TFLOPs - NVFP4 nn.functional.scaled_mm: 356.95 TFLOPs
English
6
1
35
3.2K
driss guessous
driss guessous@drisspg·
I need to prove this but im like 95% sure that cursor throttles down your agent's throughput when you dont have the window actively open
English
1
0
15
6.3K
Edward Z. Yang
Edward Z. Yang@ezyang·
I've been brainstorming episodes for the next season of PyTorch Developer Podcast. DTensor StridedShard, FSDP-TP order Redistributing a DTensor Prefetching vs Bucketing History of FSDP in PyTorch Multiprocessing: DataParallel versus DistributedDataParallel Monarch Parallelism Zoo Mixture of Experts and Expert Parallelism The Peak Memory Triangle: Activations FSDP and CPU Offloading Overlap: How to get it (Prefetching, Pipelining, Async TP) Differentiable collectives and variance Local map: global versus local SPMD FSDP vs TP Symmetric memory Uneven sharding and FSDP LocalTensor Composable parallelism via DTensor Pipeline parallelism Functional collectives and wait Device mesh; process group initialization Distributed checkpointing Activation checkpointing Placement: Partial reductions Implicit versus explicit prefetching RNG in a distributed setting Distributed optimizers, Zero, Shampoo, Muon torchtitan torchrun / torchx Choosing your parallelism from first principles / roofline analysis Mixture of Experts: as large as possible, expert routing as fine as possible (more sparsity the better) by minimizing hidden dim GB200 MXFP8 (1x128, 128x128, transposes) Stats of a training job: loss curve, MFU, expert balance Bitwise determinism: when you can expect it Distributed inference RL from an infra perspective Basics of observabilty on jobs NCCL timeout
English
21
29
388
21.1K
Dwarak
Dwarak@DwaraknathG·
Hey all, I will be at GTC next week talking about all the work my team and I did on large-scale MoE training in JAX on GPUs! We decided early on to have a fully dropless training stack to avoid token dropping. (1/7)
English
2
11
103
14.8K
driss guessous
driss guessous@drisspg·
Here is a Cute PR: github.com/pytorch/helion… Helion has awesome infra for parameter autotuning, and with this you can use it to find optimal configs for kernels written in other languages as well
English
1
1
29
1.2K