clockworkio

305 posts

clockworkio

@clockworkio

https://t.co/BXAeZpbDtP builds software that optimizes GPU clusters for fault tolerance, deterministic performance and increased utilization.

Palo Alto, California Katılım Nisan 2021

94 Takip Edilen74 Takipçiler

Sabitlenmiş Tweet

clockworkio@clockworkio·11 Mar

Today we’re launching TorchPass — Workload Fault Tolerance. GPU failures, network disruptions, driver bugs — every fault forces a full job restart. Hours of compute, gone. TorchPass makes faults invisible to the workload. Training continues. No restarts. No lost progress.

English

clockworkio@clockworkio·1h

na2.hubs.ly/H04tKzK0

ZXX

clockworkio@clockworkio·1h

AI is no longer a workload. It's infrastructure that must not stop. Performance × Reliability = Usable AI Infrastructure The future isn't faster systems. It's systems that keep running.

GIF

English

clockworkio@clockworkio·1h

GTC 2026 made one thing unmistakable: AI infrastructure is shifting from compute → continuous compute systems. But there's a catch 👇

English

clockworkio@clockworkio·6d

The takeaway from the room: AI infrastructure is shifting from optimizing for performance → to engineering continuous workload progress under failure. That's where the next wave is already happening. Thanks @JordanNanos @dylan522p Michelle Shen @SemiAnalysis_ 🙏

English

clockworkio@clockworkio·6d

The data backs it up: fault-injection testing shows preserving in-flight progress avoids recomputation, maintains higher throughput, and completes training runs in ~2x less time vs. checkpoint-restart.

English

clockworkio@clockworkio·6d

Last night at @NVIDIAGTC , @SemiAnalysis_ and @clockworkio co-hosted a dinner on the topic most GPU operators are quietly dealing with: fault tolerance at scale.

English

clockworkio@clockworkio·13 Mar

Read more: na2.hubs.ly/H04gzw30

English

clockworkio@clockworkio·13 Mar

For AI cloud operators, predictable time-to-train isn’t a feature. It’s a commercial promise.

English

clockworkio@clockworkio·13 Mar

@nscale's CTO on TorchPass: “It replaces any failing GPU and keeps the rest of the job moving… Live GPU Migration preserved run continuity and throughput under real fault conditions.”

English

clockworkio@clockworkio·13 Mar

@JordanNanos #AITraining #GPUClusters

QME

clockworkio@clockworkio·13 Mar

The bigger shift: stop designing around restarts. Remove checkpointing from training code. Larger batch sizes. Lower OOM risk. Predictable time-to-train. Not an optimization — a different architecture for reliability. Full benchmarks →na2.hubs.ly/H04gr5y0

English

clockworkio@clockworkio·13 Mar

In a 1,024-GPU cluster, a GPU failure hits roughly every 8 hours. At 16,384 GPUs? Under 2 hours. The standard fix: checkpoint restart. Roll back, reload, recompute. We built TorchPass to migrate the failed rank to a spare GPU. Resume at the same step. No rollback.

English

Keşfet

@JordanNanos @dylan522p @SemiAnalysis_ @NVIDIAGTC @nscale @elonmusk @BarackObama @taylorswift13