DeepSpeed

105 posts

DeepSpeed

@DeepSpeedAI

Official account for DeepSpeed, a library that enables unprecedented scale and speed for deep learning training + inference. 日本語 : @DeepSpeedAI_JP

Katılım Mayıs 2020

93 Takip Edilen4.1K Takipçiler

Sabitlenmiş Tweet

DeepSpeed@DeepSpeedAI·1d

x.com/i/article/2055…

ZXX

286

DeepSpeed@DeepSpeedAI·5 May

@ten3br1s @PyTorch Thanks for the feedback. We don't know of multi-node issues. Our original deepcompile blog had multi-node results: tinyurl.com/4vbtbwv9 We will investigate if you could please open a ticket or provide more details.

English

Peter Szemraj@ten3br1s·3 May

@PyTorch looking forward to the subsequent blog post (in an unknown amount of time) where working on multi-node is explicitly tested and confirmed. While you're at it you can fix deepcompile for H100 multinode!!

English

PyTorch@PyTorch·29 Nis

Want to train LLMs on longer contexts without re-engineering your entire systems stack? Introducing AutoSP — the first compiler-based solution that automatically optimizes LLM training for long contexts. Under the hood, AutoSP applies a series of compiler passes that trigger sequence parallelism, paired with a curated activation-checkpointing scheme tailored for long-context training. It's integrated directly into DeepSpeed, so enabling long-context training is just a config change away. No more rewiring your stack to push context lengths. Read the blog to learn more 🖇️ pytorch.org/blog/introduci… ✍ @AhanGupta13, Zhihao W., Neel Dani, @toh_tana, Tunji Ruwase, @_Minjia_Zhang_ #PyTorch #DeepSpeed #AutoSP #OpenSourceAI

English

120

16.3K

DeepSpeed@DeepSpeedAI·30 Nis

Great News! Thanks to DeepSpeed AutoSP, efficient long context LLM training is now easily accessible.

PyTorch@PyTorch

English

1.3K

DeepSpeed retweetledi

Minjia Zhang@_Minjia_Zhang_·26 Mar

Excited to share that our work SuperOffload received an Honorable Mention for the ASPLOS 2026 Best Paper Award 🎉 Proud of the team for pushing forward system design for large-scale AI. Xinyu gave a great talk presenting the work. In addition, it was also wonderful to spend time with collaborators and the broader community.

English

2.6K

DeepSpeed retweetledi

Zhipeng Wang 🇺🇦@PKUWZP·23 Mar

💡Excited to be organizing a tutorial at ASPLOS 2026 (lnkd.in/g5auexxg): "Building Efficient Large-Scale Model Systems with DeepSpeed: From Open-Source Foundations to Emerging Research" 🌀 Link: lnkd.in/gixRnAm6 📍 Room: Allegheny 🕘 Time: Monday (Mar 23), 8:30am-12pm 🎤 Speaker: Tunji Ruwase, Masahiro Tanaka, Minjia Zhang, Zhipeng Wang, PhD We will cover how @DeepSpeedAI enables new forms of parallel, distributed, and heterogeneous execution, and how modern systems tackle key challenges in parallelism, offloading, and memory efficiency. If you are working on ML systems, LLM training, or emerging hardware, would love to connect at ASPLOS!

English

949

DeepSpeed retweetledi

PyTorch@PyTorch·14 Mar

🗓️ Plan your week: Check out the full "Meet the PyTorch Experts" schedule here: pytorch.org/event/nvidia-g… We'll be posting the daily lineups here in this thread all week. See you at the booth! 🤝 @NVIDIADev

English

8.3K

DeepSpeed retweetledi

Zhipeng Wang 🇺🇦@PKUWZP·14 Mar

I am thrilled to release our newly re-architected extremely-scale Linear Programming Solver (DuaLip-GPU), which is developed via PyTorch enabling multi-GPU computations and parallelism (github.com/linkedin/DuaLip). We also released the technical report (arxiv.org/abs/2603.04621) covering all technical details. Linear Programming Solver is a fundamental building block for solving extreme-scale matching problems, which underline many important technical domains related to social network platforms such as ranking, personalization, item-matching and recommendation systems, as well as in LLMs. To realize the available parallelism, we develop GPU execution techniques tailored to sparse matching constraints, including constraint-aligned sparse layouts, batched projection kernels, and a distributed design that communicates only dual variables. Further, we improve the underlying ridge-regularized dual ascent method with Jacobi-style row normalization, primal scaling, and a continuation scheme for the regularization parameter. On extreme-scale matching workloads, the GPU implementation achieves at least a 10x wall-clock speedup over the prior distributed CPU DuaLip solver under matched stopping criteria, while maintaining convergence guarantees. This is the superb technical work combining ML Systems, Mathematical Optimization and Machine Learning. #Optimization #AI

English

2.1K

DeepSpeed retweetledi

Stas Bekman@StasBekman·13 Mar

PSA: if you use torch>=2.10 w/ deepspeed ZeRO-3 please update to deepspeed@master - a new release should happen shortly. If you use torch<2.10 or ZeRO-1/2 nothing needs to be done. See this fix from Michael Royzen github.com/deepspeedai/De… Cause: PyTorch made some grad reduction stream-related changes which could lead to borked grad reduction in Deepspeed ZeRO-3.

English

DeepSpeed retweetledi

Stas Bekman@StasBekman·9 Mar

Good news! Ulysses Sequence Parallelism from the Snowflake AI Research and the Deepspeed teams has been integrated into @huggingface Trainer, Accelerate and TRL For extensive details please see this writeup: huggingface.co/blog/ulysses-sp Thanks a lot to @krasul for helping make it happen. Also the others in the HF team who helped with integration.

English

117

17.7K

DeepSpeed@DeepSpeedAI·25 Şub

Training Optimization on Multimodal models is an important pillar for pushing the frontier Multimodal Foundation Model development. Kudos to @toh_tana and Tunji Ruwase for their excellent work. It's just the starting point, more to come!

PyTorch@PyTorch

New @DeepSpeedAI updates make large-scale multimodal training simpler and more memory-efficient. Our latest blog introduces a PyTorch-identical backward API that helps code multimodal training loops easy, plus low-precision model states (BF16/FP16) that can reduce peak memory by up to 40% when combined with torch.autocast. 🖇️ Read the full post for details: hubs.la/Q044yYVs0 #DeepSpeed #PyTorch #MemoryEfficiency #MultimodalTraining #OpenSourceAI

English

1.2K

DeepSpeed retweetledi

Stas Bekman@StasBekman·13 Şub

Deepspeed ZeRO 1+2 used to take forever to load huge models on multi-gpu as tensor flattening was happening on cpu due to the small gpu size back when it was designed. Now things load super fast thanks to a rework by Kento Sugama to flatten on gpu. Yay! github.com/deepspeedai/De…

English

DeepSpeed@DeepSpeedAI·15 Ara

It's exciting to see DeepSpeed leveraged by Ray in disaggregated hybrid parallelism for multimodal training. Blog: tinyurl.com/4dwkk37e Congrats to Masahiro Tanaka (@toh_tana) and @anyscalecompute friends.

English

4.1K

DeepSpeed retweetledi

PyTorch@PyTorch·12 Ara

Zhipeng (Jason) Wang, PhD (@PKUWZP) explains how @DeepSpeedAI supports ML training research and why joining PyTorch Foundation benefits researchers and developers working on AI training workloads. 🔗youtu.be/67719mlOSp0 #PyTorch #DeepSpeed #OpenSourceAI #AIInfrastructure

YouTube

English

112

11.8K

DeepSpeed@DeepSpeedAI·26 Eki

It's nice to share the most recent updates from the DeepSpeed project at #PyTorchCon, we will continue pushing the boundary of LLM distributed training for the OSS community.

PyTorch@PyTorch

🎙️ Mic check: Tunji Ruwase, Lead, DeepSpeed Project & Principal Engineer at Snowflake, is bringing the 🔥 to the keynote stage at #PyTorchCon! Get ready for big ideas and deeper learning October 22–23 in San Francisco. 👀 Speakers: hubs.la/Q03GPYFn0 🎟️ hubs.la/Q03GPXVH0

English

1.3K

DeepSpeed@DeepSpeedAI·9 Eki

UIUC, AnyScale, and Snowflake significantly enhanced LLM offloading for the Superchip era!

Minjia Zhang@_Minjia_Zhang_

🚀 SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips Superchips like the NVIDIA GH200 offer tightly coupled GPU-CPU architectures for AI workloads. But most existing offloading techniques were designed for traditional PCIe-based systems. Are we truly tapping into their full potential for LLM training? 🎯 SuperOffload is our answer to this challenge, a new DeepSpeed component rethinking offloading from the ground up, specially designed for LLM training on Superchips. ✨ SuperOffload is exact -- no approximation, no heuristics, and no changes to your training algorithm. Just faster, larger model with longer sequence training using the same code, which are made possible by system-level optimizations exploiting Superchip architecture. 🧪 SuperOffload allows you: - Finetune models like GPT-OSS-20B, Qwen3-14B, and Phi-4 on a single GH200 - Up to 4X faster speed than previous approaches like ZeRO-Offload - Effortlessly scales to: -- Qwen3-30B-A3B and Seed-OSS-36B on 2 x GH200s -- LLaMA2-70B on 4 x GH200s -- 1M sequence length on 8x GH200 with 55% MFU - Easy-to-use: Fully integrated and open-sourced in DeepSpeed. Just a few lines of code to enable! 📚 Read more through official PyTorch blog: pytorch.org/blog/superoffl… 🧠 For more technical details, please read our technical report: arxiv.org/abs/2509.21271 🛠️ SuperOffload is fully open-sourced through DeepSpeed. Try it now: github.com/deepspeedai/De… 📄 SuperOffload has been accepted to ASPLOS 2026! Kudos to Xinyu Lian (@Alexlian0806), Masahiro Tanaka (@toh_tana), and Olatunji Ruwase. 🎤 Featured at PyTorch Conference 2025 SuperOffload will be featured in the DeepSpeed & vLLM keynote at this year's PyTorch Conference in San Francisco. 🔥Come see how we're rethinking large-scale LLM training for the Superchip era: events.linuxfoundation.org/pytorch-confer…

English

2.7K

DeepSpeed retweetledi

Anyscale@anyscalecompute·6 Eki

🚨Meetup Alert🚨 Join us for @raydistributed × @DeepSpeedAI Meetup: AI at Scale, including talks from researchers and engineers at @LinkedIn, @anyscalecompute and @Snowflake. Learn how leading AI teams are scaling efficiently with Ray’s distributed framework and DeepSpeed’s model-training optimizations. Agenda includes: • Networking & welcome • Tech talks: DeepSpeed overview, SuperOffload, Arctic Long Sequence Training, Muon optimizer, DeepCompile, and Ray in Snowflake ML • Q&A + networking 📍In-person at Anyscale HQ, San Francisco Seats are limited — register now: luma.com/3wctqteh

English

3.7K

DeepSpeed@DeepSpeedAI·9 Eyl

Step into the future of AI at #PyTorchCon 2025, Oct 22–23 in San Francisco 🔥 Join the DeepSpeed keynote and technical talks. Register: events.linuxfoundation.org/pytorch-confer… + Oct 21 co-located events: Measuring Intelligence, Open Agent & AI Infra Summits / Startup Showcase & PyTorch Training

English

2.9K

DeepSpeed retweetledi

Stas Bekman@StasBekman·21 Ağu

The @DeepSpeedAI would like to thank @modal for sponsoring our gpus for CI. This is an amazing contribution to our AI-democratizing open source project. #ci-funding" target="_blank" rel="nofollow noopener">github.com/deepspeedai/De… The Modal team is outstanding in their amazing support - speed, expertise and a human experience!

English

8.5K

DeepSpeed@DeepSpeedAI·21 Ağu

ZenFlow is a massive improvement to DeepSpeed Offloading. Courtesy of an excellent collaboration among University of Virginia, UC Merced, Argonne National Laboratory, Microsoft, and Snowflake.

PyTorch@PyTorch

Introducing #ZenFlow: No Compromising Speed for #LLM Training w/ Offloading 5× faster LLM training with offloading 85% less GPU stalls 2× lower I/O overhead 🚀 Blog: hubs.la/Q03DJ6GJ0 🚀 Try ZenFlow and experience 5× faster training with offloading: hubs.la/Q03DJ6Vb0

English

1.8K

DeepSpeed@DeepSpeedAI·10 Tem

Kudos to Xinyu for giving an excellent presentation of DeepSpeed Universal Checkpointing (UCP) paper at USENIX ATC 2015.

Minjia Zhang@_Minjia_Zhang_

📢 Yesterday at USENIX ATC 2025, Xinyu Lian from UIUC SSAIL Lab presented our paper on Universal Checkpointing (UCP). UCP is a new distributed checkpointing system designed for today's large-scale DNN training, where models often use complex forms of parallelism, including data, tensor, pipeline, and expert parallelism. Existing checkpointing systems struggle in this setting because they are tightly coupled to specific training strategies (e.g., ZeRO-style data parallelism or 3D model parallelism), which break down when the training configs need to dynamically reconfigure over time. This makes it difficult to have resilient and fault-tolerant training. UCP solves this by decoupling distributed checkpointing from parallelism strategies. Our design introduces a unified checkpoint abstraction -- atomic checkpoint, and a full pattern matching-based transformation pipeline, which enables scalable and low-overhead checkpointing with reconfigurable parallelism across arbitrary model sharding strategies. We show that UCP supports state-of-the-art models trained with hybrid 3D/4D parallelism (ZeRO, TP, PP, SP) while incurring less than 0.001% overhead of the total training time. UCP is fully open-sourced in DeepSpeed. It has been adopted by Microsoft, BigScience, UC Berkeley and others for large-scale model pre-training and fine-tuning, including Phi-3.5-MoE (42B), BLOOM (176B), and many more. It also has been selected for presentation at PyTorch Day 2025 and FMS 2025(the Future of Memory and Storage). Big thanks to the amazing collaborators from Microsoft and Snowflake: @samadejacobs , @LevKurilenko, @MasahiroTanaka, @StasBekman , and @TunjiRuwase. 🔗 Project: lnkd.in/gG6j4vJe 📄 Paper: lnkd.in/gUiC5kcR 💻 Code: lnkd.in/g6uS29nH 📚 Tutorial: lnkd.in/gi_zWSWh #ATC2025 #LLM #Checkpointing #SystemsForML #DeepLearning #DistributedTraining #UIUC #DeepSpeed

English

1.7K

Keşfet

@ten3br1s @PyTorch @AhanGupta13 @toh_tana @_Minjia_Zhang_ @Nvidiadev @huggingface @krasul