clockworkio

305 posts

clockworkio banner
clockworkio

clockworkio

@clockworkio

https://t.co/BXAeZpbDtP builds software that optimizes GPU clusters for fault tolerance, deterministic performance and increased utilization.

Palo Alto, California Katılım Nisan 2021
94 Takip Edilen74 Takipçiler
Sabitlenmiş Tweet
clockworkio
clockworkio@clockworkio·
Today we’re launching TorchPass — Workload Fault Tolerance. GPU failures, network disruptions, driver bugs — every fault forces a full job restart. Hours of compute, gone. TorchPass makes faults invisible to the workload. Training continues. No restarts. No lost progress.
clockworkio tweet media
English
3
1
4
79
clockworkio
clockworkio@clockworkio·
AI is no longer a workload. It's infrastructure that must not stop. Performance × Reliability = Usable AI Infrastructure The future isn't faster systems. It's systems that keep running.
GIF
English
1
0
0
6
clockworkio
clockworkio@clockworkio·
GTC 2026 made one thing unmistakable: AI infrastructure is shifting from compute → continuous compute systems. But there's a catch 👇
English
1
0
0
11
clockworkio
clockworkio@clockworkio·
The takeaway from the room: AI infrastructure is shifting from optimizing for performance → to engineering continuous workload progress under failure. That's where the next wave is already happening. Thanks @JordanNanos @dylan522p Michelle Shen @SemiAnalysis_ 🙏
clockworkio tweet mediaclockworkio tweet mediaclockworkio tweet media
English
0
0
1
22
clockworkio
clockworkio@clockworkio·
The data backs it up: fault-injection testing shows preserving in-flight progress avoids recomputation, maintains higher throughput, and completes training runs in ~2x less time vs. checkpoint-restart.
English
1
0
0
16
clockworkio
clockworkio@clockworkio·
For AI cloud operators, predictable time-to-train isn’t a feature. It’s a commercial promise.
clockworkio tweet media
English
1
0
1
9
clockworkio
clockworkio@clockworkio·
@nscale's CTO on TorchPass: “It replaces any failing GPU and keeps the rest of the job moving… Live GPU Migration preserved run continuity and throughput under real fault conditions.”
English
1
0
0
16
clockworkio
clockworkio@clockworkio·
The bigger shift: stop designing around restarts. Remove checkpointing from training code. Larger batch sizes. Lower OOM risk. Predictable time-to-train. Not an optimization — a different architecture for reliability. Full benchmarks →na2.hubs.ly/H04gr5y0
clockworkio tweet media
English
1
0
0
19
clockworkio
clockworkio@clockworkio·
In a 1,024-GPU cluster, a GPU failure hits roughly every 8 hours. At 16,384 GPUs? Under 2 hours. The standard fix: checkpoint restart. Roll back, reload, recompute. We built TorchPass to migrate the failed rank to a spare GPU. Resume at the same step. No rollback.
English
1
1
2
28