
Tristan Rice
553 posts

Tristan Rice
@rice_fry
Machine Learning + Distributed Systems + Hardware Hacking SWE @pytorch, tweets are personal opinions https://t.co/419A7MGhlH I don't use Twitter much anymore




torchft + TorchTitan: 1200+ failures, no checkpoints, model convergence. A Llama 3 model was trained across 300 L40S GPUs with synthetic failures every 15s. No restarts. No rollbacks. Just asynchronous recovery and continued progress. 📘 hubs.la/Q03t1Z0b0 #PyTorch #DistributedTraining #FaultTolerance #OpenSourceAI

If you’re excited about optimizing code that runs equally well on a single or thousands of GPUs and if you have the ability to submit a single substantial PR to a major OSS library, we want you on the PyTorch team - especially if you’re early in your career.
























