पिन किया गया ट्वीट
clockworkio
305 posts

clockworkio
@clockworkio
https://t.co/BXAeZpbDtP builds software that optimizes GPU clusters for fault tolerance, deterministic performance and increased utilization.
Palo Alto, California शामिल हुए Nisan 2021
94 फ़ॉलोइंग74 फ़ॉलोवर्स

The takeaway from the room:
AI infrastructure is shifting from optimizing for performance → to engineering continuous workload progress under failure.
That's where the next wave is already happening.
Thanks @JordanNanos @dylan522p Michelle Shen @SemiAnalysis_ 🙏



English

Last night at @NVIDIAGTC , @SemiAnalysis_ and @clockworkio co-hosted a dinner on the topic most GPU operators are quietly dealing with: fault tolerance at scale.
English

@nscale's CTO on TorchPass:
“It replaces any failing GPU and keeps the rest of the job moving… Live GPU Migration preserved run continuity and throughput under real fault conditions.”
English

The bigger shift: stop designing around restarts.
Remove checkpointing from training code. Larger batch sizes. Lower OOM risk. Predictable time-to-train.
Not an optimization — a different architecture for reliability.
Full benchmarks →na2.hubs.ly/H04gr5y0

English

