Tweet ghim
clockworkio
305 posts

clockworkio
@clockworkio
https://t.co/BXAeZpbDtP builds software that optimizes GPU clusters for fault tolerance, deterministic performance and increased utilization.
Palo Alto, California Tham gia Nisan 2021
94 Đang theo dõi74 Người theo dõi

The takeaway from the room:
AI infrastructure is shifting from optimizing for performance → to engineering continuous workload progress under failure.
That's where the next wave is already happening.
Thanks @JordanNanos @dylan522p Michelle Shen @SemiAnalysis_ 🙏



English

Last night at @NVIDIAGTC , @SemiAnalysis_ and @clockworkio co-hosted a dinner on the topic most GPU operators are quietly dealing with: fault tolerance at scale.
English

@nscale's CTO on TorchPass:
“It replaces any failing GPU and keeps the rest of the job moving… Live GPU Migration preserved run continuity and throughput under real fault conditions.”
English

The bigger shift: stop designing around restarts.
Remove checkpointing from training code. Larger batch sizes. Lower OOM risk. Predictable time-to-train.
Not an optimization — a different architecture for reliability.
Full benchmarks →na2.hubs.ly/H04gr5y0

English

