clockworkio

305 posts

clockworkio banner
clockworkio

clockworkio

@clockworkio

https://t.co/BXAeZpbDtP builds software that optimizes GPU clusters for fault tolerance, deterministic performance and increased utilization.

Palo Alto, California शामिल हुए Nisan 2021
94 फ़ॉलोइंग74 फ़ॉलोवर्स
पिन किया गया ट्वीट
clockworkio
clockworkio@clockworkio·
Today we’re launching TorchPass — Workload Fault Tolerance. GPU failures, network disruptions, driver bugs — every fault forces a full job restart. Hours of compute, gone. TorchPass makes faults invisible to the workload. Training continues. No restarts. No lost progress.
clockworkio tweet media
English
3
1
4
79
clockworkio
clockworkio@clockworkio·
AI is no longer a workload. It's infrastructure that must not stop. Performance × Reliability = Usable AI Infrastructure The future isn't faster systems. It's systems that keep running.
GIF
English
1
0
0
6
clockworkio
clockworkio@clockworkio·
GTC 2026 made one thing unmistakable: AI infrastructure is shifting from compute → continuous compute systems. But there's a catch 👇
English
1
0
0
12
clockworkio
clockworkio@clockworkio·
The takeaway from the room: AI infrastructure is shifting from optimizing for performance → to engineering continuous workload progress under failure. That's where the next wave is already happening. Thanks @JordanNanos @dylan522p Michelle Shen @SemiAnalysis_ 🙏
clockworkio tweet mediaclockworkio tweet mediaclockworkio tweet media
English
0
0
1
22
clockworkio
clockworkio@clockworkio·
The data backs it up: fault-injection testing shows preserving in-flight progress avoids recomputation, maintains higher throughput, and completes training runs in ~2x less time vs. checkpoint-restart.
English
1
0
0
16
clockworkio
clockworkio@clockworkio·
For AI cloud operators, predictable time-to-train isn’t a feature. It’s a commercial promise.
clockworkio tweet media
English
1
0
1
9
clockworkio
clockworkio@clockworkio·
@nscale's CTO on TorchPass: “It replaces any failing GPU and keeps the rest of the job moving… Live GPU Migration preserved run continuity and throughput under real fault conditions.”
English
1
0
0
16
clockworkio
clockworkio@clockworkio·
The bigger shift: stop designing around restarts. Remove checkpointing from training code. Larger batch sizes. Lower OOM risk. Predictable time-to-train. Not an optimization — a different architecture for reliability. Full benchmarks →na2.hubs.ly/H04gr5y0
clockworkio tweet media
English
1
0
0
19
clockworkio
clockworkio@clockworkio·
In a 1,024-GPU cluster, a GPU failure hits roughly every 8 hours. At 16,384 GPUs? Under 2 hours. The standard fix: checkpoint restart. Roll back, reload, recompute. We built TorchPass to migrate the failed rank to a spare GPU. Resume at the same step. No rollback.
English
1
1
2
28