CloudRift
83 posts

CloudRift
@CloudRiftAI
The Operating System for Sovereign AI Deployments
Mountain View, CA Katılım Mart 2024
39 Takip Edilen76 Takipçiler
CloudRift retweetledi

Close to half of planned US data center builds this year are projected to be delayed or canceled.
The cause is power infrastructure and China-sourced parts, with transformer lead times now up to five years.
tomshardware.com/tech-industry/…
#DataCenters #AIinfrastructure
English

61% of Western European CIOs now prioritize local cloud providers over US hyperscalers. With the EU AI Act fully applicable on August 2, regional GPU capacity is shifting from a preference to a procurement requirement.
euronews.com/next/2026/03/0…
#SovereignAI #EUAIAct
English
CloudRift retweetledi

Training models or serving inference on AMD GPUs?
We’ve refreshed the AMD accelerator example in the dstack docs, covering on-prem fleets, cloud GPU provisioning, dev environments, training jobs, and production-grade inference.
dstack.ai/docs/examples/…
English

How do you search 24,000 matmul configurations without burning days of GPU time? @ditrifonov's autotuner samples around 207 of them in ~67 seconds with Monte Carlo tree search.
Check out part 3 of the writeup:
cloudrift.ai/blog/building-…
@triton_lang #MLcompilers #CUDA
English

Check out Part 3 of @ditrifonov's series on building a GPU compiler from scratch:
He added autotuning via Monte Carlo tree search, moving the geomean from 0.87x to 0.96x of PyTorch eager.
32 of 84 kernels now beat PyTorch's hand-tuned code.
cloudrift.ai/blog/building-…
#MLcompilers @PyTorch

English

@AMD Instinct #MI350X in our benchmarks:
2.6x faster FP16 matmul throughput than H200.
Memory bandwidth: 241 GB/s on default libvirt, 813 GB/s tuned. Full results in the post:
cloudrift.ai/blog/benchmark…
#AMDInstinct #ROCm
English

If you've ever wished you could read PyTorch's compiler end to end, here's the closest thing:
Dmitry built a working ML compiler in about 8,000 lines of Python that's faster than PyTorch eager on average and up to 4.7x faster on small kernels like reductions and k/v projections.
cloudrift.ai/blog/building-…
@PyTorch #MLcompilers #PyTorch
English

288 GB HBM3e per accelerator changes the #inference deployment math.
Workloads that need 2x or 4x #H100 with tensor parallelism collapse onto a single #MI350X. Fewer failure modes, no cross-GPU latency.
cloudrift.ai/mi350x
@AMD #AMDinstinct
English

#Llama 3 70B in FP16 weighs ~140 GB. A single @AMD #MI350X (288 GB HBM3e) fits it with room for KV cache and long context.
On #H100 (80 GB), the same model requires tensor parallelism across two GPUs.
cloudrift.ai/mi350x
#amdinstinct
English

Available now on CloudRift as on-demand VM rentals:
$3.65/hr for an @AMD Instinct #MI350X. 288 GB VRAM, HBM3e, 8 TB/s, no minimum commitment. No waitlist.
cloudrift.ai/mi350x
#AMDInstinct #LLMinference #ROCm
English

@ditrifonov 's ML compiler, benchmarked on a full transformer block at FP32, #RTX5090.
Geomean 1.11x over @PyTorch eager and 1.20x over torch.compile. Small k/v projections reach 4.7x.
Large matmuls at seq=512 regress where register pressure dominates.
#GPU #CUDA #PyTorch #MLSys
cloudrift.ai/blog/building-…
English

kyln.bio, a CloudRift AI Grant recipient, trains models that generate ligands for drug discovery.
They've since won an Ignite grant from @PavaCenter and started wet-lab work at @HopkinsMedicine to test the model's predictions.
#AIDrugDiscovery #AIforScience
English

Part 2 of @ditrifonov 's ML compiler series is up. It covers the lower half of the pipeline:
Tile IR, Kernel IR, CUDA emission, and the sixteen rewrite rules that turn a @PyTorch graph into a competitive kernel.
About 8,000 lines of Python now.
#GPU #CUDA #PyTorch
cloudrift.ai/blog/building-…
English

Modern ML compilers all share the same shape:
Torch IR → Tensor IR → Loop IR → Tile IR → Kernel IR → CUDA
Each lowering moves closer to the hardware: decomposition → fusion → tiling → scheduling → codegen.
@ditrifonov rebuilt the whole pipeline in 5K lines of Python to show why.
cloudrift.ai/blog/building-…
@modular @PyTorch
English

@CatoDigitalInc redeploys GPU servers retired from Meta and NVIDIA fleets, rather than commissioning new ones.
Their capacity is now on CloudRift as V100 32GB VMs at $0.29 per GPU/hour.
Good for fine-tuning, batch inference, rendering, and HPC.
→ cloudrift.ai/gpu-rentals
English

V100 32GB VMs are now on CloudRift at $0.29 per GPU/hour, supplied by @CatoDigitalInc.
Fits a LoRA fine-tune of Llama 3 8B, Whisper Large inference, or a batch embeddings job on a single GPU.
→ cloudrift.ai/gpu-rentals
@nvidia
English

$0.29 per GPU/hour for a V100 32GB VM on CloudRift.
The same hardware on AWS and Azure runs above $3 per GPU/hour, and the 32GB variant is usually only sold in 8-GPU bundles. We offer it as a single-GPU VM.
extremely useful if your job runs fine on Volta and does not need Hopper. Supplied by @CatoDigitalInc.
→ cloudrift.ai/gpu-rentals
@huggingface @nvidia
English
