Tristan Rice

553 posts

Tristan Rice banner
Tristan Rice

Tristan Rice

@rice_fry

Machine Learning + Distributed Systems + Hardware Hacking SWE @pytorch, tweets are personal opinions https://t.co/419A7MGhlH I don't use Twitter much anymore

Seattle / Vancouver Katılım Ağustos 2013
181 Takip Edilen5.7K Takipçiler
Tristan Rice
Tristan Rice@rice_fry·
@vr4300 Been slowly getting back into self driving things, figured it'd be fun to hack on the hw4 NPU. Still baby steps though. It would be super cool to run my own models on it
English
0
0
1
34
Tristan Rice
Tristan Rice@rice_fry·
Very late news but looks like TRIP v2 (Tesla HW4 NPU) has 48MB of SRAM, up from 32MB on HW3
English
2
0
27
1.8K
Tristan Rice retweetledi
SemiAnalysis
SemiAnalysis@SemiAnalysis_·
Meta has open sourced their CTran library that natively works with AMD & NVIDIA GPUs 🚀. Previously, if u want multiple NVIDIA GPUs to work together on an workload, you must used the NVIDIA NCCL library. Although NCCL's source code is public, it does not have an open governance model, does not have open CI, employs an "code dump" update model, is not GitHub first, and rarely accepts external contributions. Previously, If you want multiple GPUs to work together on an workload, you must used the AMD fork called RCCL library, which is a delayed fork of NVIDIA's NCCL.  With CTran, it is 1 unified library and allows for adding new like Bruck's in an way such that the code can be shared between different AI GPU types. Furthermore, Meta has open sourced NCCLX (NCCL extended) which is their production-tested collective library that powered all Llama training and uses the unified CTran library. Meta is the creator & main maintainer of PyTorch and is well trusted in the open source community. NVIDIA continues to be the leader in collective libraries but Jensen must not taken it for granted given the heavily increased competition in the open source collective communication space. Just like how TRTLLM moved to an GitHub first development when facing heavy competition from SGLang/vLLM, Jensen should seriously consider moving NCCL to GitHub first open development model due to the competition in the collective front too. To draw parallel comparisons to the inference engine world, Collective Communication Libraries are moving from the 2021 "FasterTransformer" era to the 2025 "SGLang/vLLM/TRTLLM" era. The main competitors in the collective library space include China's DeepEP library, AMD's new MORI, AMD's upcoming MORI-CCL, Meta's CTran & NCCLX, NVIDIA's NCCL (which has released their new NCCL Device API, NCCL's new GPU-Initiated Networking, etc). Competition breeds innovation! 🚀
SemiAnalysis tweet media
English
2
43
332
30.1K
Tristan Rice retweetledi
Mark Saroufim
Mark Saroufim@marksaroufim·
If you’re excited about optimizing code that runs equally well on a single or thousands of GPUs and if you have the ability to submit a single substantial PR to a major OSS library, we want you on the PyTorch team - especially if you’re early in your career.
English
5
32
281
60.9K
Tristan Rice retweetledi
Soumith Chintala
Soumith Chintala@soumithchintala·
If GPU optimization and systems problems excite you, why limit your impact to a single company or lab? Working on PyTorch allows you to ship impact to the entire AI industry! We're hiring across experiences -- junior and senior engineers. Read more below 👇
Mark Saroufim@marksaroufim

If you’re excited about optimizing code that runs equally well on a single or thousands of GPUs and if you have the ability to submit a single substantial PR to a major OSS library, we want you on the PyTorch team - especially if you’re early in your career.

English
6
36
299
33K
Tristan Rice
Tristan Rice@rice_fry·
@StasBekman You could pass `NCCL_DEBUG_SUBSYS=ALLOC` which may give some info on the memory allocations
English
0
0
0
157
Tristan Rice
Tristan Rice@rice_fry·
@StasBekman That buffer is 4MB per *peer*. How many GPUs are you running with and what network topology and collectives are you using? I'm trying to figure out if we can track this but AFAIK the PyTorch profiler tracks PyTorch allocations and we can't inspect into NCCL internal allocations
English
1
0
2
179
Stas Bekman
Stas Bekman@StasBekman·
While analyzing unaccounted for by torch memory profiler gpu memory today I discovered torch.distributed takes a nice bite out of available gpu memory. And slightly more memory with more gpus. For details and to reproduce with a script see: #torchdistributed-memory-usage" target="_blank" rel="nofollow noopener">github.com/stas00/ml-engi…
Stas Bekman tweet media
English
2
1
20
2K
Tristan Rice
Tristan Rice@rice_fry·
@alex_peys @main_horse @PyTorch I just double checked, the DDP wrapper and any dist.* operation throws an error if init_process_group hasn't been called torchrun working with any script even if it's not distributed is actually pretty useful -- I've used it for doing bulk inference from time to time
Tristan Rice tweet media
English
1
0
1
80
alex peysakhovich
alex peysakhovich@alex_peys·
@main_horse @PyTorch i mean, i typically have if statements to wrap in ddp for single gpu testing, not sure if just pure ddp would have yelled at me here
English
1
0
0
108
alex peysakhovich
alex peysakhovich@alex_peys·
hilarious bug/feature is that @PyTorch allows you to torchrun processes without an init_process_group line in the code, so it runs, but you silently just get K independent copies of your model training - then you debug why your loss takes so long to go down
English
1
0
3
666
Tristan Rice
Tristan Rice@rice_fry·
@StasBekman You can also control the NCCL buffer size via env #nccl-buffsize" target="_blank" rel="nofollow noopener">docs.nvidia.com/deeplearning/n…
English
1
0
1
62
Tristan Rice
Tristan Rice@rice_fry·
@StasBekman This is with the NCCL backend? I expect this is due to NCCL's lazy init. NCCL doesn't create the pairs/buffers until the first call under normal usage. Can you try calling init_process_group w/ device_id which should cause eager initialization of the memory
English
1
0
0
119
Tristan Rice retweetledi
🇺🇦TheSlava
🇺🇦TheSlava@b0noi·
Meta has unveiled a resilient training solution for large models with PyTorch: github.com/pytorch-labs/t… 🚀 Even better, their detailed design doc is publicly available: #heading=h.hhh64jbwbx4c" target="_blank" rel="nofollow noopener">docs.google.com/document/d/1OZ… The solution works generically at the replica level—silencing unhealthy replicas & reintegrating them when healthy. Simple, yes, means many elasticity strategies are not possible, but at the same time highly generalizable and reliable 💡 #AI #PyTorch
English
0
1
5
943
Tristan Rice retweetledi
Arthur Douillard
Arthur Douillard@Ar_Douillard·
one more implementation of DiLoCo to do distributed training! @PyTorch's TorchFT fault tolerance package has an implementation of DiLoCo hopefully soon a Streaming DiLoCo too?
Arthur Douillard tweet media
English
2
7
69
3.6K
Stas Bekman
Stas Bekman@StasBekman·
The @PyTorch team are working on a new super important tool: github.com/pytorch-labs/t… This repository implements techniques for doing a per-step fault tolerance so you can keep training if errors occur without interrupting the entire training job. Some big companies already have this as a proprietary solution, so it's great this super important solution is worked on for the rest of us. It's a prototype at the moment that already works for DDP
English
20
129
839
46K
Tristan Rice
Tristan Rice@rice_fry·
@timzaman @StasBekman @PyTorch Likewise! I finally get to build the full PyTorch fault tolerance system I've always wanted haha Let me know if you'd like to chat -- I'm sure you have a lot of insights into fault tolerance at scale
English
0
0
0
126
Tristan Rice
Tristan Rice@rice_fry·
@StasBekman @timzaman @PyTorch That's the plan for now -- this repo is a staging ground for these fault tolerance approaches. If things just make sense, we'll upstream it into standard PyTorch Distributed. There's a number of changes we're making in PTD to make torchft (and FT in general) work better
English
1
0
5
181
Stas Bekman
Stas Bekman@StasBekman·
@timzaman @PyTorch I hear you, Tim. This makes sense. I guess the suggestion should be made for the open source training frameworks to build fault-tolerance in.
English
1
0
1
338
Tristan Rice
Tristan Rice@rice_fry·
@mattydtweetz @StasBekman @PyTorch The other consideration is if you have a differing number of workers some workers may finish their shard of the dataset before others. We've discussed writing a fault tolerant distributed dataloader that can automatically rebalance but still in the idea phase
English
0
0
1
94
Tristan Rice
Tristan Rice@rice_fry·
@mattydtweetz @StasBekman @PyTorch That's the case with the stock PyTorch dataloader but not a hard requirement On error the "should_commit" operation returns false and we discard the step -- with a custom dataloader you can detect this and reuse the batch instead
English
1
0
2
116
Randall
Randall@R_Z_S·
@alanlcit @rice_fry I turned off forward collision warnings because they were falsely affecting my driving score - and the app still records them.
Ashby, MA 🇺🇸 English
1
0
0
100
Tristan Rice
Tristan Rice@rice_fry·
Got a sample of the Tesla Insurance telemetry data. The insurance records are on a per drive basis. Here's the fields: * Unique Drive ID * Record Version * Car Firmware Version * Driver Profile Name * Start / End Time * Drive Duration * Start / End Odometer (1/2)
English
25
97
550
0
Tristan Rice
Tristan Rice@rice_fry·
@dav_ell Plenoxel paper is a good read alexyu.net/plenoctrees/ though it uses a sparse voxel octree which isn't feasible to output from a dense NN model. A voxel grid + SH in #L21" target="_blank" rel="nofollow noopener">github.com/facebookresear… would be cool to explore
English
0
0
0
284
Tristan Rice
Tristan Rice@rice_fry·
@dav_ell Yup -- that's basically how NeRF works with a voxel representation. I've tried this a couple of times but my attempts haven't worked out super well -- might be good as an extra loss. Using spherical harmonics is a better option instead of my approach with one color per voxel
Tristan Rice tweet media
English
1
0
0
341
Tristan Rice
Tristan Rice@rice_fry·
Curious what I've been up to in the past 6 months? 😅 I've been working on a novel approach to depth and occupancy understanding for my FSD models! It's much simpler than existing techniques and directly learns the 3D representation ⬇️
Tristan Rice tweet media
English
12
35
305
0