Tristan Rice

553 posts

Tristan Rice

@rice_fry

Machine Learning + Distributed Systems + Hardware Hacking SWE @pytorch, tweets are personal opinions https://t.co/419A7MGhlH I don't use Twitter much anymore

Seattle / Vancouver Katılım Ağustos 2013

181 Takip Edilen5.7K Takipçiler

Tristan Rice@rice_fry·9 Şub

@vr4300 Been slowly getting back into self driving things, figured it'd be fun to hack on the hw4 NPU. Still baby steps though. It would be super cool to run my own models on it

English

George Morgan@vr4300·8 Şub

@rice_fry You are back!

English

119

Tristan Rice@rice_fry·8 Şub

Very late news but looks like TRIP v2 (Tesla HW4 NPU) has 48MB of SRAM, up from 32MB on HW3

English

1.8K

Tristan Rice retweetledi

SemiAnalysis@SemiAnalysis_·22 Eki

Meta has open sourced their CTran library that natively works with AMD & NVIDIA GPUs 🚀. Previously, if u want multiple NVIDIA GPUs to work together on an workload, you must used the NVIDIA NCCL library. Although NCCL's source code is public, it does not have an open governance model, does not have open CI, employs an "code dump" update model, is not GitHub first, and rarely accepts external contributions. Previously, If you want multiple GPUs to work together on an workload, you must used the AMD fork called RCCL library, which is a delayed fork of NVIDIA's NCCL. With CTran, it is 1 unified library and allows for adding new like Bruck's in an way such that the code can be shared between different AI GPU types. Furthermore, Meta has open sourced NCCLX (NCCL extended) which is their production-tested collective library that powered all Llama training and uses the unified CTran library. Meta is the creator & main maintainer of PyTorch and is well trusted in the open source community. NVIDIA continues to be the leader in collective libraries but Jensen must not taken it for granted given the heavily increased competition in the open source collective communication space. Just like how TRTLLM moved to an GitHub first development when facing heavy competition from SGLang/vLLM, Jensen should seriously consider moving NCCL to GitHub first open development model due to the competition in the collective front too. To draw parallel comparisons to the inference engine world, Collective Communication Libraries are moving from the 2021 "FasterTransformer" era to the 2025 "SGLang/vLLM/TRTLLM" era. The main competitors in the collective library space include China's DeepEP library, AMD's new MORI, AMD's upcoming MORI-CCL, Meta's CTran & NCCLX, NVIDIA's NCCL (which has released their new NCCL Device API, NCCL's new GPU-Initiated Networking, etc). Competition breeds innovation! 🚀

English

332

30.1K

Tristan Rice@rice_fry·20 Haz

It's been a blast working on this! I'm very excited about the future of fault tolerant training

PyTorch@PyTorch

torchft + TorchTitan: 1200+ failures, no checkpoints, model convergence. A Llama 3 model was trained across 300 L40S GPUs with synthetic failures every 15s. No restarts. No rollbacks. Just asynchronous recovery and continued progress. 📘 hubs.la/Q03t1Z0b0 #PyTorch #DistributedTraining #FaultTolerance #OpenSourceAI

English

6.3K

Tristan Rice retweetledi

Mark Saroufim@marksaroufim·16 Nis

If you’re excited about optimizing code that runs equally well on a single or thousands of GPUs and if you have the ability to submit a single substantial PR to a major OSS library, we want you on the PyTorch team - especially if you’re early in your career.

English

281

60.9K

Tristan Rice retweetledi

Soumith Chintala@soumithchintala·16 Nis

If GPU optimization and systems problems excite you, why limit your impact to a single company or lab? Working on PyTorch allows you to ship impact to the entire AI industry! We're hiring across experiences -- junior and senior engineers. Read more below 👇

Mark Saroufim@marksaroufim

English

299

33K

Tristan Rice@rice_fry·5 Nis

@StasBekman You could pass `NCCL_DEBUG_SUBSYS=ALLOC` which may give some info on the memory allocations

English

157

Tristan Rice@rice_fry·5 Nis

@StasBekman That buffer is 4MB per *peer*. How many GPUs are you running with and what network topology and collectives are you using? I'm trying to figure out if we can track this but AFAIK the PyTorch profiler tracks PyTorch allocations and we can't inspect into NCCL internal allocations

English

179

Stas Bekman@StasBekman·3 Nis

While analyzing unaccounted for by torch memory profiler gpu memory today I discovered torch.distributed takes a nice bite out of available gpu memory. And slightly more memory with more gpus. For details and to reproduce with a script see: #torchdistributed-memory-usage" target="_blank" rel="nofollow noopener">github.com/stas00/ml-engi…

English

Tristan Rice@rice_fry·3 Nis

@alex_peys @main_horse @PyTorch I just double checked, the DDP wrapper and any dist.* operation throws an error if init_process_group hasn't been called torchrun working with any script even if it's not distributed is actually pretty useful -- I've used it for doing bulk inference from time to time

English

alex peysakhovich@alex_peys·3 Nis

@main_horse @PyTorch i mean, i typically have if statements to wrap in ddp for single gpu testing, not sure if just pure ddp would have yelled at me here

English

108

alex peysakhovich@alex_peys·3 Nis

hilarious bug/feature is that @PyTorch allows you to torchrun processes without an init_process_group line in the code, so it runs, but you silently just get K independent copies of your model training - then you debug why your loss takes so long to go down

English

666

Tristan Rice@rice_fry·3 Nis

@StasBekman You can also control the NCCL buffer size via env #nccl-buffsize" target="_blank" rel="nofollow noopener">docs.nvidia.com/deeplearning/n…

English

Tristan Rice@rice_fry·3 Nis

@StasBekman This is with the NCCL backend? I expect this is due to NCCL's lazy init. NCCL doesn't create the pairs/buffers until the first call under normal usage. Can you try calling init_process_group w/ device_id which should cause eager initialization of the memory

English

119

Tristan Rice retweetledi

🇺🇦TheSlava@b0noi·10 Oca

Meta has unveiled a resilient training solution for large models with PyTorch: github.com/pytorch-labs/t… 🚀 Even better, their detailed design doc is publicly available: #heading=h.hhh64jbwbx4c" target="_blank" rel="nofollow noopener">docs.google.com/document/d/1OZ… The solution works generically at the replica level—silencing unhealthy replicas & reintegrating them when healthy. Simple, yes, means many elasticity strategies are not possible, but at the same time highly generalizable and reliable 💡 #AI #PyTorch

English

943

Tristan Rice retweetledi

Arthur Douillard@Ar_Douillard·6 Mar

one more implementation of DiLoCo to do distributed training! @PyTorch's TorchFT fault tolerance package has an implementation of DiLoCo hopefully soon a Streaming DiLoCo too?

English

3.6K

Stas Bekman@StasBekman·3 Oca

The @PyTorch team are working on a new super important tool: github.com/pytorch-labs/t… This repository implements techniques for doing a per-step fault tolerance so you can keep training if errors occur without interrupting the entire training job. Some big companies already have this as a proprietary solution, so it's great this super important solution is worked on for the rest of us. It's a prototype at the moment that already works for DDP

English

129

839

46K

Tristan Rice@rice_fry·7 Oca

@timzaman @StasBekman @PyTorch Likewise! I finally get to build the full PyTorch fault tolerance system I've always wanted haha Let me know if you'd like to chat -- I'm sure you have a lot of insights into fault tolerance at scale

English

126

Tim Zaman@timzaman·7 Oca

@rice_fry @StasBekman @PyTorch Good to see you again Tristan!

English

184

Tristan Rice@rice_fry·7 Oca

@StasBekman @timzaman @PyTorch That's the plan for now -- this repo is a staging ground for these fault tolerance approaches. If things just make sense, we'll upstream it into standard PyTorch Distributed. There's a number of changes we're making in PTD to make torchft (and FT in general) work better

English

181

Stas Bekman@StasBekman·5 Oca

@timzaman @PyTorch I hear you, Tim. This makes sense. I guess the suggestion should be made for the open source training frameworks to build fault-tolerance in.

English

338

Tristan Rice@rice_fry·7 Oca

@mattydtweetz @StasBekman @PyTorch The other consideration is if you have a differing number of workers some workers may finish their shard of the dataset before others. We've discussed writing a fault tolerant distributed dataloader that can automatically rebalance but still in the idea phase

English

Tristan Rice@rice_fry·7 Oca

@mattydtweetz @StasBekman @PyTorch That's the case with the stock PyTorch dataloader but not a hard requirement On error the "should_commit" operation returns false and we discard the step -- with a custom dataloader you can detect this and reuse the batch instead

English

116

Tristan Rice@rice_fry·20 Eki

@R_Z_S @alanlcit yeah -- there's a normalized hidden fcw

English

Randall@R_Z_S·20 Eki

@alanlcit @rice_fry I turned off forward collision warnings because they were falsely affecting my driving score - and the app still records them.

Ashby, MA 🇺🇸 English

100

Tristan Rice@rice_fry·11 Nis

Got a sample of the Tesla Insurance telemetry data. The insurance records are on a per drive basis. Here's the fields: * Unique Drive ID * Record Version * Car Firmware Version * Driver Profile Name * Start / End Time * Drive Duration * Start / End Odometer (1/2)

English

550

Tristan Rice@rice_fry·28 Oca

@dav_ell Plenoxel paper is a good read alexyu.net/plenoctrees/ though it uses a sparse voxel octree which isn't feasible to output from a dense NN model. A voxel grid + SH in #L21" target="_blank" rel="nofollow noopener">github.com/facebookresear… would be cool to explore

English

284

Tristan Rice@rice_fry·28 Oca

@dav_ell Yup -- that's basically how NeRF works with a voxel representation. I've tried this a couple of times but my attempts haven't worked out super well -- might be good as an extra loss. Using spherical harmonics is a better option instead of my approach with one color per voxel

English

341

Tristan Rice@rice_fry·24 Tem

Curious what I've been up to in the past 6 months? 😅 I've been working on a novel approach to depth and occupancy understanding for my FSD models! It's much simpler than existing techniques and directly learns the 3D representation ⬇️

English

305

Keşfet

@vr4300 @StasBekman @alex_peys @main_horse @PyTorch @timzaman @elonmusk @BarackObama