SkyPilot

341 posts

SkyPilot

@skypilot_org

Run, manage, and scale AI workloads on any AI infrastructure. Open-source system for all your AI compute — Kubernetes, Slurm, VMs, 20+ clouds.

Sky Katılım Ekim 2022

69 Takip Edilen5.3K Takipçiler

Sabitlenmiş Tweet

SkyPilot@skypilot_org·26 Oca

Shopify now runs all AI training on SkyPilot. • H200s on @nebiusai, L4s on @googlecloud - multi-cloud AI enabled by SkyPilot • One unified interface for engineers • Cost tracking, Inifiniband support, fair scheduling Excited to support @Shopify's journey to build the future of e-commerce with AI 💪🏻

Shopify Engineering@ShopifyEng

Multi-cloud GPU scheduling without losing your mind. Inside our @skypilot_org setup: how we route GPU training workloads across providers, keep scheduling fair, and let engineers stay close to the metal instead of buried in cloud consoles.

English

1.2K

SkyPilot@skypilot_org·5d

Missed the AI Infra Meetup with SkyPilot and @CoreWeave? Recordings are now live. • SkyPilot at 100k+ GPU Scale at Meta AI (FAIR) - Lucca Bertoncini @lbz____ (@AIatMeta): youtu.be/Pas2NE76220 • SkyPilot: Your AI Infra, Frontier Capabilities - Zhanghao Wu @Michaelvll1 + Kevin Mingtarja @kevin_mingtarja (SkyPilot): youtu.be/uwLkrJb5pEA • Training at Scale with Confidence (SkyPilot x CoreWeave SUNK) - Deok Filho @deok_filho (@CoreWeave): youtu.be/vdP0fhqpYZs

YouTube

English

244

SkyPilot@skypilot_org·5d

SkyPilot now works natively with @VAST_Data storage 🤝 Training runs often start with a dead period - copying data while GPUs sit idle. With VAST + SkyPilot, you can skip that entirely. • Mount petabyte-scale data directly. No staging, no prefetch pipelines, no idling • Stream at NVMe speeds to SkyPilot-managed nodes across any Kubernetes, Slurm, or neocloud • Switch compute providers without touching your storage config Read the blog by the @Vast_Data team: vastdata.com/blog/instant-d…

English

584

SkyPilot@skypilot_org·6d

Announcing SkyPilot v0.12.0! 🎉 • Agent Skill: coding agents can launch and manage cloud GPUs on their own • Slurm Support: unified interface across Slurm, K8s, and cloud • Job Groups: run heterogeneous parallel workloads as one unit • Recipes: templatize your AI workloads and share them across your team 🔗 Full release notes: github.com/skypilot-org/s…

English

523

SkyPilot@skypilot_org·26 Mar

AI Infra Meetup with SkyPilot and CoreWeave - what a night! Packed room, great conversations, and solid talks on scaling AI infra across K8s, Slurm, and neocloud. 🌟 Highlights: @lbz____ (@AIatMeta) shared how Meta unified 100k+ GPUs across Slurm clusters with SkyPilot. @Michaelvll1 + @kevin_mingtarja (@skypilot_org) walked through SkyPilot's new features and ran a live multi-cloud demo. @deok_filho (@CoreWeave) broke down SUNK and the SkyPilot x CoreWeave integration. Huge shoutout to all speakers! Thanks to @CoreWeave @wandb for being amazing partners in making this happen. Excited to keep building this AI infra community. More events coming soon.

English

4.4K

SkyPilot@skypilot_org·25 Mar

Today in SF! The AI Infra Meetup with SkyPilot and CoreWeave is happening tonight. Join engineers from @Meta AI, SkyPilot, and @CoreWeave for tech talks on scaling training and batch jobs across K8s, Slurm, and cloud, plus plenty of time to mix and connect. The event is at capacity, but the waitlist is open! Join now: luma.com/h52qyhmt?utm_s…

English

SkyPilot retweetledi

lucca bertoncini@lbz____·24 Mar

Giving a talk this Wednesday (3/25) in SF on SkyPilot at 100k+ GPU scale. Come through if you're around! luma.com/h52qyhmt

English

674

SkyPilot@skypilot_org·24 Mar

SkyPilot is now @hcompany_ai's standard AI infrastructure layer. Online RL, previously impossible on SLURM, now runs seamlessly on K8s. Holo 2 was trained on SkyPilot. One unified interface for H100 clusters, zero infra friction for researchers. Thrilled to support H Company's journey to build the future of autonomous AI agents. 🤖💪

H@hcompany_ai

🚀 The H Company Tech Stack: Part 1 We are excited to launch a new series of technical deep dives into the AI Tech Stack powering H Company. Over the coming weeks, we’ll be sharing how we build, scale, and optimize the infrastructure behind our Holo frontier models. First up: Unlocking Online RL and AI Workflows on K8s using SkyPilot. (1/5🧵)

English

635

SkyPilot@skypilot_org·22 Mar

SkyPilot + @CoreWeave AI Infra Meetup is next week! @Meta AI, SkyPilot, and CoreWeave will give tech talks on scaling training and batch jobs across K8s, Slurm, and cloud — plus plenty of time to mingle and socialize. Spots are limited. Register now: luma.com/h52qyhmt?utm_s…

English

421

SkyPilot@skypilot_org·20 Mar

The best part about letting agents scale autoresearch - it figured out it had access to heterogeneous clusters and built its own two-tier workflow: screen ideas cheaply on H100s, promote winners to H200s for validation. Shoutout to @CoreWeave for the infra that powered our experiment! 🔗 blog.skypilot.co/scaling-autore…

SkyPilot@skypilot_org

Karpathy's Autoresearch is bottlenecked by a single GPU. We removed the bottleneck. We gave the agent access to our K8s cluster with H100s and H200s and let it provision its own GPUs. Over 8 hours: • ~910 experiments instead of ~96 sequentially • Discovered that scaling model width mattered more than all hparam tuning • Taught itself to exploit heterogenous hardware: use H200s for validation, screen ideas on H100s Full setup and results: blog.skypilot.co/scaling-autore… @karpathy

English

5.6K

SkyPilot@skypilot_org·19 Mar

English

371

136.1K

SkyPilot@skypilot_org·19 Mar

@karpathy @karpathy

QAM

SkyPilot@skypilot_org·15 Mar

🎤 Three talks you don't want to miss at the upcoming SkyPilot AI Infra meetup: • Lucca Bertoncini (@Meta AI): GPU management at Meta scale • Zhanghao Wu (@skypilot_org): What's next for open-source AI infra • Deok Filho (@CoreWeave): Running Slurm on Kubernetes with SUNK Spots are limited. RSVP now! luma.com/h52qyhmt?utm_s… @Michaelvll1 @deok_filho

English

1.7K

SkyPilot retweetledi

Nebius@nebiusai·13 Mar

Tuesday 3/17 at #NVIDIAGTC, Booth 713. See how global platforms are scaling AI in production on Nebius: 1:30 pm —@skypilot_org 2 pm — Photoroom 4 pm — @Revolut 5 pm — @DataRobot Training. Inference. Enterprise scale. #GTC26

English

3.4K

SkyPilot@skypilot_org·11 Mar

👋 SkyPilot AI Infra Meetup is back! 📆 Wed, Mar 25th, 5:00 PM 📍 San Francisco Join us in person for the AI Infra Meetup with SkyPilot! We'll be talking about open-source AI infra for batch and training workloads, K8s, Slurm, and more! Come connect with fellow builders, share insights, and learn from experts on the latest in AI infra! 🚀 🔗 RSVP now: luma.com/h52qyhmt?utm_s…

English

464

SkyPilot@skypilot_org·10 Mar

SkyPilot Recipes lets you templatize your AI workloads and share them across your entire team. Save a YAML config once, and anyone can launch clusters with the same predefined setup, directly from the CLI. • Standardize dev environments and training infra • Launch instantly with sky launch recipes: • Edit and manage recipes from the SkyPilot dashboard 🔗blog.skypilot.co/skypilot-recip…

English

772

SkyPilot retweetledi

David Bar@observie·6 Mar

half a day with mjlab and Q1 Hoper is walking oh my, this is pure joy thank you @kevin_zakka and thank you SkyPilot for making gpuing a breeze @bromil101 @zongheng_yang

English

5.2K

SkyPilot@skypilot_org·7 Mar

Love seeing this, @observie ! This is exactly the SkyPilot experience: sky launch, grab a GPU, train. No containers, no setup overhead for researchers. 😌

David Bar@observie

mjlab is RL with fairy dust oh my, it's so much fun It takes no time to setup the training machine, because you literally don't need to setup a training machine. It assigns you an ad-hoc GPU via SkyPilot No containers, and the software is slim: a single git clone and you can start Post-training data is immediately available on wandb, long after the cloud machine was ditched to save cost. checkpoints, onnx and training logs are automatically uploaded Visualization of results is available locally via the viser web viewer if no GPU. no need to remote desktop anything Effectively, it feels like I am training on my MacBook (below is Q1 Hoper learning to fly)

English

SkyPilot@skypilot_org·7 Mar

SkyPilot powers physical AI! Excited to see mjlab ship cloud training on SkyPilot! mjlab brings GPU-accelerated sim to robot learning with minimal setup, and now SkyPilot handles the infra so researchers can focus on the science. Thanks to @kevin_zakka for sharing!

Kevin Zakka@kevin_zakka

mjlab now supports cloud training via SkyPilot. One command launches a GPU instance, syncs your code, trains, and tears down when done. We support 2 modes: direct uv install and Docker. Multi-GPU and hyperparameter sweeps work out of the box. mujocolab.github.io/mjlab/main/sou…

English

2.5K

SkyPilot@skypilot_org·5 Mar

𝗦𝗸𝘆𝗣𝗶𝗹𝗼𝘁 𝘃𝟬.𝟭𝟭.𝟮 𝗶𝘀 𝗼𝘂𝘁! 🚀 This release brings Slurm support, Job Groups for RL on heterogeneous hardware, enhanced Pools with autoscaling, external links in the dashboard, 7x faster object store access, and much more! Better scheduling, security, and multi-backend control for infra teams. Faster iteration and less infra wrangling for ML engineers. 𝘂𝘃 𝗽𝗶𝗽 𝗶𝗻𝘀𝘁𝗮𝗹𝗹 "𝘀𝗸𝘆𝗽𝗶𝗹𝗼𝘁>=𝟬.𝟭𝟭.𝟮" Learn more: github.com/skypilot-org/s…

English

574

SkyPilot@skypilot_org·3 Mar

RL post-training needs heterogeneous hardware - beefy GPUs for the trainer, cheap GPUs for rollouts, and high-memory CPU instances for replay buffers. Running it all on top-tier GPUs is wasteful. SkyPilot Job Groups simplifies workloads with heterogeneous requirements: • One YAML to run RL workloads on heterogeneous hardware. • Automatic service discovery • Coordinated creation and shutdown Define each component with the right resources and launch as one unit. Read the blog now: 🔗blog.skypilot.co/job-groups/

English

2.3K

Keşfet

@CoreWeave @lbz____ @AIatMeta @Michaelvll1 @kevin_mingtarja @deok_filho @VAST_Data @Vast_Data