clockworkio

305 posts

clockworkio

@clockworkio

https://t.co/BXAeZpbDtP builds software that optimizes GPU clusters for fault tolerance, deterministic performance and increased utilization.

Palo Alto, California Tham gia Nisan 2021

94 Đang theo dõi74 Người theo dõi

Tweet ghim

clockworkio@clockworkio·11 Mar

Today we’re launching TorchPass — Workload Fault Tolerance. GPU failures, network disruptions, driver bugs — every fault forces a full job restart. Hours of compute, gone. TorchPass makes faults invisible to the workload. Training continues. No restarts. No lost progress.

English

clockworkio@clockworkio·9h

na2.hubs.ly/H04tKzK0

ZXX

clockworkio@clockworkio·9h

AI is no longer a workload. It's infrastructure that must not stop. Performance × Reliability = Usable AI Infrastructure The future isn't faster systems. It's systems that keep running.

GIF

English

clockworkio@clockworkio·9h

GTC 2026 made one thing unmistakable: AI infrastructure is shifting from compute → continuous compute systems. But there's a catch 👇

English

clockworkio@clockworkio·18 Mar

The takeaway from the room: AI infrastructure is shifting from optimizing for performance → to engineering continuous workload progress under failure. That's where the next wave is already happening. Thanks @JordanNanos @dylan522p Michelle Shen @SemiAnalysis_ 🙏

English

clockworkio@clockworkio·18 Mar

The data backs it up: fault-injection testing shows preserving in-flight progress avoids recomputation, maintains higher throughput, and completes training runs in ~2x less time vs. checkpoint-restart.

English

clockworkio@clockworkio·18 Mar

Last night at @NVIDIAGTC , @SemiAnalysis_ and @clockworkio co-hosted a dinner on the topic most GPU operators are quietly dealing with: fault tolerance at scale.

English

clockworkio@clockworkio·13 Mar

Read more: na2.hubs.ly/H04gzw30

English

clockworkio@clockworkio·13 Mar

For AI cloud operators, predictable time-to-train isn’t a feature. It’s a commercial promise.

English

clockworkio@clockworkio·13 Mar

@nscale's CTO on TorchPass: “It replaces any failing GPU and keeps the rest of the job moving… Live GPU Migration preserved run continuity and throughput under real fault conditions.”

English

clockworkio@clockworkio·13 Mar

@JordanNanos #AITraining #GPUClusters

QME

clockworkio@clockworkio·13 Mar

The bigger shift: stop designing around restarts. Remove checkpointing from training code. Larger batch sizes. Lower OOM risk. Predictable time-to-train. Not an optimization — a different architecture for reliability. Full benchmarks →na2.hubs.ly/H04gr5y0

English

clockworkio@clockworkio·13 Mar

In a 1,024-GPU cluster, a GPU failure hits roughly every 8 hours. At 16,384 GPUs? Under 2 hours. The standard fix: checkpoint restart. Roll back, reload, recompute. We built TorchPass to migrate the failed rank to a spare GPU. Resume at the same step. No rollback.

English

Khám phá

@JordanNanos @dylan522p @SemiAnalysis_ @NVIDIAGTC @nscale @elonmusk @BarackObama @taylorswift13