DeepSpeed

100 posts

DeepSpeed

@DeepSpeedAI

Official account for DeepSpeed, a library that enables unprecedented scale and speed for deep learning training + inference. 日本語 : @DeepSpeedAI_JP

Bergabung Mayıs 2020

93 Mengikuti4.1K Pengikut

Tweet Disematkan

DeepSpeed@DeepSpeedAI·25 Şub

Training Optimization on Multimodal models is an important pillar for pushing the frontier Multimodal Foundation Model development. Kudos to @toh_tana and Tunji Ruwase for their excellent work. It's just the starting point, more to come!

PyTorch@PyTorch

New @DeepSpeedAI updates make large-scale multimodal training simpler and more memory-efficient. Our latest blog introduces a PyTorch-identical backward API that helps code multimodal training loops easy, plus low-precision model states (BF16/FP16) that can reduce peak memory by up to 40% when combined with torch.autocast. 🖇️ Read the full post for details: hubs.la/Q044yYVs0 #DeepSpeed #PyTorch #MemoryEfficiency #MultimodalTraining #OpenSourceAI

English

DeepSpeed me-retweet

PyTorch@PyTorch·14 Mar

🗓️ Plan your week: Check out the full "Meet the PyTorch Experts" schedule here: pytorch.org/event/nvidia-g… We'll be posting the daily lineups here in this thread all week. See you at the booth! 🤝 @NVIDIADev

English

8.1K

DeepSpeed me-retweet

Zhipeng Wang 🇺🇦@PKUWZP·14 Mar

I am thrilled to release our newly re-architected extremely-scale Linear Programming Solver (DuaLip-GPU), which is developed via PyTorch enabling multi-GPU computations and parallelism (github.com/linkedin/DuaLip). We also released the technical report (arxiv.org/abs/2603.04621) covering all technical details. Linear Programming Solver is a fundamental building block for solving extreme-scale matching problems, which underline many important technical domains related to social network platforms such as ranking, personalization, item-matching and recommendation systems, as well as in LLMs. To realize the available parallelism, we develop GPU execution techniques tailored to sparse matching constraints, including constraint-aligned sparse layouts, batched projection kernels, and a distributed design that communicates only dual variables. Further, we improve the underlying ridge-regularized dual ascent method with Jacobi-style row normalization, primal scaling, and a continuation scheme for the regularization parameter. On extreme-scale matching workloads, the GPU implementation achieves at least a 10x wall-clock speedup over the prior distributed CPU DuaLip solver under matched stopping criteria, while maintaining convergence guarantees. This is the superb technical work combining ML Systems, Mathematical Optimization and Machine Learning. #Optimization #AI

English

1.9K

DeepSpeed me-retweet

Stas Bekman@StasBekman·13 Mar

PSA: if you use torch>=2.10 w/ deepspeed ZeRO-3 please update to deepspeed@master - a new release should happen shortly. If you use torch<2.10 or ZeRO-1/2 nothing needs to be done. See this fix from Michael Royzen github.com/deepspeedai/De… Cause: PyTorch made some grad reduction stream-related changes which could lead to borked grad reduction in Deepspeed ZeRO-3.

English

3.2K

DeepSpeed me-retweet

Stas Bekman@StasBekman·9 Mar

Good news! Ulysses Sequence Parallelism from the Snowflake AI Research and the Deepspeed teams has been integrated into @huggingface Trainer, Accelerate and TRL For extensive details please see this writeup: huggingface.co/blog/ulysses-sp Thanks a lot to @krasul for helping make it happen. Also the others in the HF team who helped with integration.

English

116

17.4K

DeepSpeed me-retweet

Stas Bekman@StasBekman·13 Şub

Deepspeed ZeRO 1+2 used to take forever to load huge models on multi-gpu as tensor flattening was happening on cpu due to the small gpu size back when it was designed. Now things load super fast thanks to a rework by Kento Sugama to flatten on gpu. Yay! github.com/deepspeedai/De…

English

1.9K

DeepSpeed@DeepSpeedAI·15 Ara

It's exciting to see DeepSpeed leveraged by Ray in disaggregated hybrid parallelism for multimodal training. Blog: tinyurl.com/4dwkk37e Congrats to Masahiro Tanaka (@toh_tana) and @anyscalecompute friends.

English

4.1K

DeepSpeed me-retweet

PyTorch@PyTorch·12 Ara

Zhipeng (Jason) Wang, PhD (@PKUWZP) explains how @DeepSpeedAI supports ML training research and why joining PyTorch Foundation benefits researchers and developers working on AI training workloads. 🔗youtu.be/67719mlOSp0 #PyTorch #DeepSpeed #OpenSourceAI #AIInfrastructure

YouTube

English

112

11.8K

DeepSpeed@DeepSpeedAI·26 Eki

It's nice to share the most recent updates from the DeepSpeed project at #PyTorchCon, we will continue pushing the boundary of LLM distributed training for the OSS community.

PyTorch@PyTorch

🎙️ Mic check: Tunji Ruwase, Lead, DeepSpeed Project & Principal Engineer at Snowflake, is bringing the 🔥 to the keynote stage at #PyTorchCon! Get ready for big ideas and deeper learning October 22–23 in San Francisco. 👀 Speakers: hubs.la/Q03GPYFn0 🎟️ hubs.la/Q03GPXVH0

English

1.3K

DeepSpeed@DeepSpeedAI·9 Eki

UIUC, AnyScale, and Snowflake significantly enhanced LLM offloading for the Superchip era!

Minjia Zhang@_Minjia_Zhang_

🚀 SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips Superchips like the NVIDIA GH200 offer tightly coupled GPU-CPU architectures for AI workloads. But most existing offloading techniques were designed for traditional PCIe-based systems. Are we truly tapping into their full potential for LLM training? 🎯 SuperOffload is our answer to this challenge, a new DeepSpeed component rethinking offloading from the ground up, specially designed for LLM training on Superchips. ✨ SuperOffload is exact -- no approximation, no heuristics, and no changes to your training algorithm. Just faster, larger model with longer sequence training using the same code, which are made possible by system-level optimizations exploiting Superchip architecture. 🧪 SuperOffload allows you: - Finetune models like GPT-OSS-20B, Qwen3-14B, and Phi-4 on a single GH200 - Up to 4X faster speed than previous approaches like ZeRO-Offload - Effortlessly scales to: -- Qwen3-30B-A3B and Seed-OSS-36B on 2 x GH200s -- LLaMA2-70B on 4 x GH200s -- 1M sequence length on 8x GH200 with 55% MFU - Easy-to-use: Fully integrated and open-sourced in DeepSpeed. Just a few lines of code to enable! 📚 Read more through official PyTorch blog: pytorch.org/blog/superoffl… 🧠 For more technical details, please read our technical report: arxiv.org/abs/2509.21271 🛠️ SuperOffload is fully open-sourced through DeepSpeed. Try it now: github.com/deepspeedai/De… 📄 SuperOffload has been accepted to ASPLOS 2026! Kudos to Xinyu Lian (@Alexlian0806), Masahiro Tanaka (@toh_tana), and Olatunji Ruwase. 🎤 Featured at PyTorch Conference 2025 SuperOffload will be featured in the DeepSpeed & vLLM keynote at this year's PyTorch Conference in San Francisco. 🔥Come see how we're rethinking large-scale LLM training for the Superchip era: events.linuxfoundation.org/pytorch-confer…

English

2.7K

DeepSpeed me-retweet

Anyscale@anyscalecompute·6 Eki

🚨Meetup Alert🚨 Join us for @raydistributed × @DeepSpeedAI Meetup: AI at Scale, including talks from researchers and engineers at @LinkedIn, @anyscalecompute and @Snowflake. Learn how leading AI teams are scaling efficiently with Ray’s distributed framework and DeepSpeed’s model-training optimizations. Agenda includes: • Networking & welcome • Tech talks: DeepSpeed overview, SuperOffload, Arctic Long Sequence Training, Muon optimizer, DeepCompile, and Ray in Snowflake ML • Q&A + networking 📍In-person at Anyscale HQ, San Francisco Seats are limited — register now: luma.com/3wctqteh

English

3.7K

DeepSpeed@DeepSpeedAI·9 Eyl

Step into the future of AI at #PyTorchCon 2025, Oct 22–23 in San Francisco 🔥 Join the DeepSpeed keynote and technical talks. Register: events.linuxfoundation.org/pytorch-confer… + Oct 21 co-located events: Measuring Intelligence, Open Agent & AI Infra Summits / Startup Showcase & PyTorch Training

English

2.8K

DeepSpeed me-retweet

Stas Bekman@StasBekman·21 Ağu

The @DeepSpeedAI would like to thank @modal for sponsoring our gpus for CI. This is an amazing contribution to our AI-democratizing open source project. #ci-funding" target="_blank" rel="nofollow noopener">github.com/deepspeedai/De… The Modal team is outstanding in their amazing support - speed, expertise and a human experience!

English

8.5K

DeepSpeed@DeepSpeedAI·21 Ağu

ZenFlow is a massive improvement to DeepSpeed Offloading. Courtesy of an excellent collaboration among University of Virginia, UC Merced, Argonne National Laboratory, Microsoft, and Snowflake.

PyTorch@PyTorch

Introducing #ZenFlow: No Compromising Speed for #LLM Training w/ Offloading 5× faster LLM training with offloading 85% less GPU stalls 2× lower I/O overhead 🚀 Blog: hubs.la/Q03DJ6GJ0 🚀 Try ZenFlow and experience 5× faster training with offloading: hubs.la/Q03DJ6Vb0

English

1.8K

DeepSpeed@DeepSpeedAI·10 Tem

Kudos to Xinyu for giving an excellent presentation of DeepSpeed Universal Checkpointing (UCP) paper at USENIX ATC 2015.

Minjia Zhang@_Minjia_Zhang_

📢 Yesterday at USENIX ATC 2025, Xinyu Lian from UIUC SSAIL Lab presented our paper on Universal Checkpointing (UCP). UCP is a new distributed checkpointing system designed for today's large-scale DNN training, where models often use complex forms of parallelism, including data, tensor, pipeline, and expert parallelism. Existing checkpointing systems struggle in this setting because they are tightly coupled to specific training strategies (e.g., ZeRO-style data parallelism or 3D model parallelism), which break down when the training configs need to dynamically reconfigure over time. This makes it difficult to have resilient and fault-tolerant training. UCP solves this by decoupling distributed checkpointing from parallelism strategies. Our design introduces a unified checkpoint abstraction -- atomic checkpoint, and a full pattern matching-based transformation pipeline, which enables scalable and low-overhead checkpointing with reconfigurable parallelism across arbitrary model sharding strategies. We show that UCP supports state-of-the-art models trained with hybrid 3D/4D parallelism (ZeRO, TP, PP, SP) while incurring less than 0.001% overhead of the total training time. UCP is fully open-sourced in DeepSpeed. It has been adopted by Microsoft, BigScience, UC Berkeley and others for large-scale model pre-training and fine-tuning, including Phi-3.5-MoE (42B), BLOOM (176B), and many more. It also has been selected for presentation at PyTorch Day 2025 and FMS 2025(the Future of Memory and Storage). Big thanks to the amazing collaborators from Microsoft and Snowflake: @samadejacobs , @LevKurilenko, @MasahiroTanaka, @StasBekman , and @TunjiRuwase. 🔗 Project: lnkd.in/gG6j4vJe 📄 Paper: lnkd.in/gUiC5kcR 💻 Code: lnkd.in/g6uS29nH 📚 Tutorial: lnkd.in/gi_zWSWh #ATC2025 #LLM #Checkpointing #SystemsForML #DeepLearning #DistributedTraining #UIUC #DeepSpeed

English

1.7K

DeepSpeed me-retweet

Stas Bekman@StasBekman·24 Haz

My first project at @Snowflake AI Research is complete! I present to you Arctic Long Sequence Training (ALST) Paper: arxiv.org/abs/2506.13996 Blog: snowflake.com/en/engineering… ALST is a set of modular, open-source techniques that enable training on sequences up to 15 million tokens on 4 H100 nodes, all using Hugging Face Transformers and DeepSpeed, with no custom modeling code required. ALST makes long-sequence training fast, efficient, and accessible on GPU nodes or even single GPUs.

English

373

35K

DeepSpeed@DeepSpeedAI·16 Haz

Improved DeepNVMe: Affordable I/O Scaling for AI - Faster I/O with PCIe Gen5 - 20x faster model checkpointing - Low-budget SGLang inference via NVMe offloading - Pinned memory for CPU-only workloads - Zero-copy tensor type casting Blog: tinyurl.com/yanbrjy9

English

5.7K

DeepSpeed me-retweet

PyTorch@PyTorch·7 May

PyTorch Foundation has expanded into an umbrella foundation. @vllm_project and @DeepSpeedAI have been accepted as hosted projects, advancing community-driven AI across the full lifecycle. Supporting quotes provided by the following members: @AMD, @Arm, @AWS, @Google, @Huawei, @huggingface, @IBM, @Intel, @LightningAI, @Meta, @NVIDIA, and @Snowflake. 🔗💡 Read the full announcement: hubs.la/Q03lmJNH0 #PyTorchFoundation #PyTorch #OpenSourceAI #vLLM #DeepSpeed

English

231

70.5K

DeepSpeed@DeepSpeedAI·3 May

Come hear all the exciting DeepSpeed updates at the upcoming PyTorch Day France 2025 DeepSpeed – Efficient Training Scalability for Deep Learning Models - sched.co/21nyy @sched

English

679

DeepSpeed@DeepSpeedAI·16 Nis

Introducing 🚀DeepCompile🚀: compiler-based distributed training optimizations. - Automatic parallelization & profile-guided optimizations - Enable ZeRO1, ZeRO3, Offloading, etc. via compiler passes - 1.2X-7X speedups over manual ZeRO1/ZeRO3/Offloading tinyurl.com/8cys28xk

English

306

42.4K

Jelajahi

@Nvidiadev @huggingface @krasul @toh_tana @anyscalecompute @PKUWZP @raydistributed @LinkedIn