

Multi-cloud GPU scheduling without losing your mind. Inside our @skypilot_org setup: how we route GPU training workloads across providers, keep scheduling fair, and let engineers stay close to the metal instead of buried in cloud consoles.
SkyPilot
341 posts

@skypilot_org
Run, manage, and scale AI workloads on any AI infrastructure. Open-source system for all your AI compute — Kubernetes, Slurm, VMs, 20+ clouds.


Multi-cloud GPU scheduling without losing your mind. Inside our @skypilot_org setup: how we route GPU training workloads across providers, keep scheduling fair, and let engineers stay close to the metal instead of buried in cloud consoles.

















🚀 The H Company Tech Stack: Part 1 We are excited to launch a new series of technical deep dives into the AI Tech Stack powering H Company. Over the coming weeks, we’ll be sharing how we build, scale, and optimize the infrastructure behind our Holo frontier models. First up: Unlocking Online RL and AI Workflows on K8s using SkyPilot. (1/5🧵)


Karpathy's Autoresearch is bottlenecked by a single GPU. We removed the bottleneck. We gave the agent access to our K8s cluster with H100s and H200s and let it provision its own GPUs. Over 8 hours: • ~910 experiments instead of ~96 sequentially • Discovered that scaling model width mattered more than all hparam tuning • Taught itself to exploit heterogenous hardware: use H200s for validation, screen ideas on H100s Full setup and results: blog.skypilot.co/scaling-autore… @karpathy











mjlab is RL with fairy dust oh my, it's so much fun It takes no time to setup the training machine, because you literally don't need to setup a training machine. It assigns you an ad-hoc GPU via SkyPilot No containers, and the software is slim: a single git clone and you can start Post-training data is immediately available on wandb, long after the cloud machine was ditched to save cost. checkpoints, onnx and training logs are automatically uploaded Visualization of results is available locally via the viser web viewer if no GPU. no need to remote desktop anything Effectively, it feels like I am training on my MacBook (below is Q1 Hoper learning to fly)

mjlab now supports cloud training via SkyPilot. One command launches a GPU instance, syncs your code, trains, and tears down when done. We support 2 modes: direct uv install and Docker. Multi-GPU and hyperparameter sweeps work out of the box. mujocolab.github.io/mjlab/main/sou…



