Romil Bhardwaj

218 posts

Romil Bhardwaj

Romil Bhardwaj

@bromil101

Building SkyPilot @skypilot_org | PhD AI+Systems @Berkeley_EECS @ucbrise

Berkeley, CA Katılım Haziran 2020
195 Takip Edilen460 Takipçiler
Romil Bhardwaj
Romil Bhardwaj@bromil101·
My favorite talk at the Meetup @lbz____
Romil Bhardwaj tweet media
SkyPilot@skypilot_org

AI Infra Meetup with SkyPilot and CoreWeave - what a night! Packed room, great conversations, and solid talks on scaling AI infra across K8s, Slurm, and neocloud. 🌟 Highlights: @lbz____ (@AIatMeta) shared how Meta unified 100k+ GPUs across Slurm clusters with SkyPilot. @Michaelvll1 + @kevin_mingtarja (@skypilot_org) walked through SkyPilot's new features and ran a live multi-cloud demo. @deok_filho (@CoreWeave) broke down SUNK and the SkyPilot x CoreWeave integration. Huge shoutout to all speakers! Thanks to @CoreWeave @wandb for being amazing partners in making this happen. Excited to keep building this AI infra community. More events coming soon.

English
1
1
32
2.8K
Romil Bhardwaj
Romil Bhardwaj@bromil101·
Online RL is one of the harder infra problems in frontier model training - your trainer needs to interact with live inference servers for generation in a tight loop. Slurm is not built for this. Read how @hcompany_ai used @skypilot_org to scale RL across 2,000+ GPUs👇
H@hcompany_ai

🚀 The H Company Tech Stack: Part 1 We are excited to launch a new series of technical deep dives into the AI Tech Stack powering H Company. Over the coming weeks, we’ll be sharing how we build, scale, and optimize the infrastructure behind our Holo frontier models. First up: Unlocking Online RL and AI Workflows on K8s using SkyPilot. (1/5🧵)

English
0
0
5
272
David Bar
David Bar@observie·
Running three mjlab trainings concurrently on 3 separate aws machines I didn't spend a minute setting up myself, and I have a feeling this is just the beginning. all thanks to @skypilot_org sky acceleration tips: - use .skyignore to prevent large & unnecessary folders from being sync'ed (think: .gitignore for sky) - autostop after 5 idle minutes to control cost, but keep the disk (down: false) so next launch starts in seconds instead of minutes
David Bar tweet media
English
2
0
20
1K
Romil Bhardwaj
Romil Bhardwaj@bromil101·
@observie @skypilot_org neat! as you scale out, `sky jobs launch` lets you run 100s of jobs in parallel. SkyPilot handles execution and cleanup :) for rank in $(seq 1 100); do sky jobs launch --async -y train.yaml --env RANK=$rank done
English
2
0
2
78
Romil Bhardwaj retweetledi
Kevin Zakka
Kevin Zakka@kevin_zakka·
mjlab now supports cloud training via SkyPilot. One command launches a GPU instance, syncs your code, trains, and tears down when done. We support 2 modes: direct uv install and Docker. Multi-GPU and hyperparameter sweeps work out of the box. mujocolab.github.io/mjlab/main/sou…
English
3
12
86
12.4K
Romil Bhardwaj retweetledi
SkyPilot
SkyPilot@skypilot_org·
RL post-training needs heterogeneous hardware - beefy GPUs for the trainer, cheap GPUs for rollouts, and high-memory CPU instances for replay buffers. Running it all on top-tier GPUs is wasteful. SkyPilot Job Groups simplifies workloads with heterogeneous requirements: • One YAML to run RL workloads on heterogeneous hardware. • Automatic service discovery • Coordinated creation and shutdown Define each component with the right resources and launch as one unit. Read the blog now: 🔗blog.skypilot.co/job-groups/
SkyPilot tweet media
English
2
5
31
2.3K
Romil Bhardwaj retweetledi
SkyPilot
SkyPilot@skypilot_org·
🚫🦞 DON'T run OpenClaw on your main machine. ✅🦞 DO run it in an isolated environment. But isolation isn't easy: provisioning, SSH, security groups, cost management. This post breaks down your options (Docker, dedicated hardware, cloud VM) and shows you how to launch a secure OpenClaw setup on any cloud with one command. 🔗 blog.skypilot.co/openclaw-on-sk…
SkyPilot tweet media
English
0
2
10
732
Romil Bhardwaj retweetledi
marimo
marimo@marimo_io·
marimo notebooks are reproducible Python programs. @skypilot_org runs Python anywhere. Together, they take notebooks from laptop to cloud with zero rewrites. Learn more at our latest blog: marimo.io/blog/skypilot
English
0
5
20
2.6K
Romil Bhardwaj retweetledi
SkyPilot
SkyPilot@skypilot_org·
SkyPilot now auto-detects and displays external links in your Dashboard, making it easier to quickly access cloud consoles and experiment trackers like @wandb. The dashboard shows links to: - Your instance's cloud console links (AWS, GCP, Azure) - External dashboards from your cluster logs
English
1
4
8
1.1K
Romil Bhardwaj retweetledi
SkyPilot
SkyPilot@skypilot_org·
Moving from Slurm to Kubernetes shouldn't mean rewriting every job script and learning container orchestration. @skypilot_org brings Slurm-like simplicity to K8s: 🔄 6-line task YAML, not 60-line manifests 💻 SSH into GPU nodes just like salloc 🚀 Unified interface for all your Slurm and K8s clusters Same workflow. Modern infra. 🔗 blog.skypilot.co/slurm-to-k8s-m…
SkyPilot tweet media
English
2
3
13
982
Romil Bhardwaj retweetledi
SkyPilot
SkyPilot@skypilot_org·
Congrats to @hcompany_ai on the Holo2 release - the new gold standard in UI localization! 🎉 Proud to power the infra for this SOTA model — H Company migrated their training stack from Slurm to Kubernetes with SkyPilot as the unified interface across their clusters 🚀
H@hcompany_ai

Holo2-235B-A22B: #1 on ScreenSpot-Pro, #1 on OSWorldG 🎯 🚀 Today, we are releasing Holo2-235B-A22B 🤗, our most capable GUI localization model yet! Holo2 now leads on all major GUI localization benchmarks: 78.5% on ScreenSpot-Pro and 79.0% on OSWorld-G!

English
0
2
7
772
Romil Bhardwaj retweetledi
Mikhail Parakhin
Mikhail Parakhin@MParakhin·
Running Training and Inference at scale gets tricky in a hurry. Heterogeneous workloads: large-scale HSTU training, tiny L4 models and the development - all need to be supported seamlessly. Here is a new blogpost on how we approach it at Shopify: shopify.engineering/skypilot
English
5
8
42
10.5K
Romil Bhardwaj
Romil Bhardwaj@bromil101·
Looking to build a multi-cloud AI platform? Read how @Shopify did it: • SkyPilot running on K8s clusters on Nebius and GCP • Custom SkyPilot policy plugin for routing, infiniband and cost-tracking • Kueue for fair-sharing + priority classes (emergency > interactive > batch)
Shopify Engineering@ShopifyEng

Multi-cloud GPU scheduling without losing your mind. Inside our @skypilot_org setup: how we route GPU training workloads across providers, keep scheduling fair, and let engineers stay close to the metal instead of buried in cloud consoles.

English
0
0
2
315
Romil Bhardwaj retweetledi
SkyPilot
SkyPilot@skypilot_org·
Shopify now runs all AI training on SkyPilot. • H200s on @nebiusai, L4s on @googlecloud - multi-cloud AI enabled by SkyPilot • One unified interface for engineers • Cost tracking, Inifiniband support, fair scheduling Excited to support @Shopify's journey to build the future of e-commerce with AI 💪🏻
SkyPilot tweet media
Shopify Engineering@ShopifyEng

Multi-cloud GPU scheduling without losing your mind. Inside our @skypilot_org setup: how we route GPU training workloads across providers, keep scheduling fair, and let engineers stay close to the metal instead of buried in cloud consoles.

English
1
3
12
1.2K
Romil Bhardwaj retweetledi
SkyPilot
SkyPilot@skypilot_org·
AI workloads need fast storage. Object stores aren't always the best fit. SkyPilot Volumes let you access high-performance storage designed for AI. 💾 Fast access to data & checkpoints ⚡ Easy to use - attach w/ 1 line ☸️ Works with any Kubernetes PVC blog.skypilot.co/skypilot-volum…
English
0
1
6
393
Romil Bhardwaj retweetledi
SkyPilot
SkyPilot@skypilot_org·
Scaling batch workloads: run SAM3 video segmentation across K8s and AWS clusters with one command. SkyPilot Pools turns scattered GPU capacity into a unified batch queue. 🌎 Unified GPU pool across multiple K8s + clouds 📈 Fast starts w/ linear scaling blog.skypilot.co/skypilot-pools…
SkyPilot tweet media
English
0
3
5
392
Romil Bhardwaj retweetledi
SkyPilot
SkyPilot@skypilot_org·
🚀 SkyPilot now ships precooked YAML templates for launching clusters with popular frameworks and patterns. Templates are automatically available on all new SkyPilot clusters. With a single command you can launch fully configured environments without having to write any boilerplate code. Here's how to launch a Ray cluster on your infra with a single line of code. blog.skypilot.co/skypilot-templ…
English
0
2
6
402