clockworkio

340 posts

clockworkio banner
clockworkio

clockworkio

@clockworkio

https://t.co/BXAeZpbDtP builds software that optimizes GPU clusters for fault tolerance, deterministic performance and increased utilization.

Palo Alto, California Katılım Nisan 2021
94 Takip Edilen78 Takipçiler
Sabitlenmiş Tweet
clockworkio
clockworkio@clockworkio·
Today we’re launching TorchPass — Workload Fault Tolerance. GPU failures, network disruptions, driver bugs — every fault forces a full job restart. Hours of compute, gone. TorchPass makes faults invisible to the workload. Training continues. No restarts. No lost progress.
clockworkio tweet media
English
3
1
4
133
clockworkio
clockworkio@clockworkio·
𝟗𝟐% 𝐨𝐟 𝐭𝐡𝐞 𝐜𝐨𝐬𝐭 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐛𝐞𝐭𝐰𝐞𝐞𝐧 𝐆𝐏𝐔 𝐜𝐥𝐨𝐮𝐝 𝐩𝐫𝐨𝐯𝐢𝐝𝐞𝐫𝐬 𝐡𝐚𝐬 𝐧𝐨𝐭𝐡𝐢𝐧𝐠 𝐭𝐨 𝐝𝐨 𝐰𝐢𝐭𝐡 𝐆𝐏𝐔 𝐩𝐫𝐢𝐜𝐢𝐧𝐠. It’s goodput: how much of your cluster is doing useful work vs recovering from failures. SemiAnalysis modeled a 5,184 GB300 NVL72 cluster: • TorchPass: 6% goodput expense • Checkpointless: 10.53% • Checkpoint restart: 20.91% At Llama 3 scale, checkpoint restart can leave ~80% of GPUs idle after repeated failures. As @SemiAnalysis_ put it, TorchPass is “the closest we've seen to what Frontier Labs are using at this scale.” Infrastructure efficiency is becoming the real AI scaling advantage. #AIInfrastructure #GPUClusters #Goodput #FaultTolerance #semianalysislinuxwebinar
English
0
0
0
11
clockworkio
clockworkio@clockworkio·
Why it works at scale: soft failures dominate above 1k GPUs. ECCs accumulating, GPU off the bus, NVLink errors, link flaps, slow-degrading nodes. Scheduler catches the signal before NCCL stalls. You migrate while the node can still hand off state cleanly.
English
1
0
0
30
clockworkio
clockworkio@clockworkio·
At 4,096 GPUs, your training job's MTBF is hours, not days. Every interruption: (t_id + t_failover)·j_size + t_repair·b_radius, times $/GPU-hr. At $4/GPU-hr, an hour of idle 4k GPUs is ~$16k. Per incident.🧵
English
1
1
1
26
clockworkio
clockworkio@clockworkio·
SemiAnalysis, verbatim: "the only option that maintains the same training performance as jobs without fault tolerance" Mechanism matters here: scheduler sees soft failures (ECCs, fall-off-bus Xids, link flaps) before NCCL stalls. Migration is planned, not reactive.
English
1
0
1
21
clockworkio
clockworkio@clockworkio·
@SemiAnalysis_ benchmarked fault-tolerant training frameworks on a 5,184-GPU GB300 pretrain. Same hardware, same $/GPU-hr, three FT stacks. Goodput expense swings from 6.14% to 20.91% of TCO based on the stack alone. 🧵
GIF
English
1
1
3
59
clockworkio
clockworkio@clockworkio·
At cluster scale, failures aren’t edge cases—they’re constant. The real question isn’t if training fails. It’s which tradeoffs you’re making when it does: Restart → lose progress Migration → resume same step Per-step FT → change semantics Most teams choose without clear data. We break down the tradeoffs. → na2.hubs.ly/H04NbQH0
clockworkio tweet media
English
0
0
0
16
clockworkio
clockworkio@clockworkio·
The result: faster convergence, better experiment velocity, lower cost per trained model. AI performance is no longer just about FLOPs. It's about how much progress you don't lose.
English
1
0
0
29
clockworkio
clockworkio@clockworkio·
AI workloads don't fail like batch jobs. They fail mid-run, mid-epoch, mid-token — in systems expected to stay up. 🧵
English
1
0
0
32