Wei (Will) Feng

37 posts

Wei (Will) Feng

Wei (Will) Feng

@weifengpy

PyTorch Distributed, FSDP, float8

United States Katılım Mart 2011
279 Takip Edilen104 Takipçiler
George Grigorev
George Grigorev@iamgrigorev·
I wonder if I could push heterogeneity even further by training with fp4 on 5090 and fp8 on 4090 while sharding separately with FSDP2
English
1
0
0
133
George Grigorev
George Grigorev@iamgrigorev·
It’s time to do some small scale pre-training
George Grigorev tweet media
English
1
0
1
489
Wei (Will) Feng
Wei (Will) Feng@weifengpy·
@Infopulsed @maharshii the torchtitan way is apply AC, apply torch.compile, apply fsdp2. let fsdp2 hooks execute in eager and let forward execute in compile. but if you really need full graph capture, simplefsdp is our best offer #L92-L114" target="_blank" rel="nofollow noopener">github.com/pytorch/torcht…
English
0
0
0
81
EDITH
EDITH@Infopulsed·
@maharshii Try doing torch.compile with Fsdp2 on a base model, then you'll have a taste of it.. *Graph breaks
English
1
0
0
229
maharshi
maharshi@maharshii·
torch compile is totally amazing at what it does but it’s an incredible feeling when you are able to beat torch compile at making models go faster.
English
6
1
196
7.9K
Edmond Dantes
Edmond Dantes@EdmondD52582093·
@weifengpy @HaHoang411 @_lewtun I just started my from DDP to FSDP (SSL vision). And a working example is a great place to start playing around, debugging and learning. Really helped me get a footing.
English
1
0
0
26
ethan thoma
ethan thoma@EthanBThoma·
@madhav1 Fixing an issue with checkpoint loading with fsdp but I randomly decided to migrate to fsdp2 and I keep running into a pytorch bug skjsiwkw
English
3
0
3
39
ethan thoma
ethan thoma@EthanBThoma·
Tdy I rebuild my apptainer for the seventh time 😔
English
3
0
8
234
Matej Sirovatka
Matej Sirovatka@m_sirovatka·
@weifengpy @TheZachMueller I just remember never being able to make it work, could as well be my skill issue, if I ever come across it again, I'll try to get you a repro
English
1
0
2
64
Matej Sirovatka
Matej Sirovatka@m_sirovatka·
@TheZachMueller Last challenge - write a working pre-forward hook with kwargs on fsdp2 module 💀(never made it work)
English
2
0
1
112
Ramesh Arvind
Ramesh Arvind@RameshArv1nd·
@intervitens Yeah I assumed there'd probably be no actual memory savings since it's casting to fp8 at each step. But I did notice pretty sizeable speedups at 8b / 32b. Was that not the case either? I guess with activation offloading and bz=1 it's probably less relevant discuss.pytorch.org/t/distributed-…
English
2
0
2
74
vitens
vitens@intervitens·
Was fun working on, what seems to be, the first actual finetune of Qwen3-235B on HF. Had to learn, how to compile pytorch to get _grouped_mm support on sm100, and to use every single VRAM cope available in Torchtune to make it fit into the node.
🥭@MangoSweet78

huggingface.co/Aurore-Reveil/… Presenting one of the first Qwen3 large finetunes for RP - It's an experiment alright but it's... something. Trained mostly by @intervitens and his extreme knowledge about weird pytorch optims, Torchtune, Etc & Me (i sat in the corner and added hparams)

English
4
1
22
3.8K
Daniel Han
Daniel Han@danielhanchen·
Gemma 3N quirks! 1. Vision NaNs on float16 2. Conv2D weights are large FP16 overflows to infinity 3. Large activations fixed vs Gemma 3 4. 6-7 training losses: normal for multimodal? 5. Large nums in msfa_ffn_pw_proj 6. NaNs fixed in @UnslothAI Details: #fixes-for-gemma-3n" target="_blank" rel="nofollow noopener">docs.unsloth.ai/basics/gemma-3…
Daniel Han tweet media
Unsloth AI@UnslothAI

You can now fine-tune Gemma 3n for free with our notebook! Unsloth makes Google Gemma training 1.5x faster with 50% less VRAM and 5x longer context lengths - with no accuracy loss. Guide: #fine-tuning-gemma-3n-with-unsloth" target="_blank" rel="nofollow noopener">docs.unsloth.ai/basics/gemma-3… GitHub: github.com/unslothai/unsl… Colab: colab.research.google.com/github/unsloth…

English
9
33
294
26K
Vlado Boza
Vlado Boza@bozavlado·
Is it me, or is every AI totally incompetent with FSDP2?
English
2
0
1
519
Wei (Will) Feng
Wei (Will) Feng@weifengpy·
@errantdata could I know more about 2) fsdp2 + autograd. is it resize_(0) that bothers you?
English
0
0
0
36
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
hey all, couple quick notes: 1) yes, we will be joining Meta. 2) no, we did not get 100M sign-on, that's fake news. Excited about what's ahead though, will share more in due time! cc @__kolesnikov__ and @XiaohuaZhai.
English
368
109
4.1K
705.4K
JingyuanLiu
JingyuanLiu@JingyuanLiu123·
I was intended to write a blog about Muon's infra scalability but basically it was what @SeunghyunSEO7 mentioned: it is caused by the zero 1 impl difference and dim 0 sharding is not scalable for Moonshot's impl. So I just ended up writing some fun thoughts regarding Muon's concerns in Zhihu: zhihu.com/question/19271…
Seunghyun Seo@SeunghyunSEO7

@eliebakouch @JingyuanLiu123 @zxytim we've discussed this for a while. legacy megatron fsdp1 flatten all weights first and slice. but fsdp2 use per param sharding and it's useful for qlora like things. but moohshot's internel codebase is based on old megatron i guess so it does not match with fsdp2 well

English
2
8
53
4.8K
Matej Sirovatka
Matej Sirovatka@m_sirovatka·
On today's episode of obscure things happening to me in PyTorch (this is quickly becoming a serie I'm afraid): FSDP2 submodule causes hang, impossible to debug as even having this submodule in locals causes the debugger to hang😭
Matej Sirovatka tweet media
English
1
0
3
182
Mark Saroufim
Mark Saroufim@marksaroufim·
If you're interested in FSDP2, here's a minimal example courtesy of Andrew Gu """ torchrun --standalone --nproc_per_node=2 test_fsdp_basic.py """ import os import torch import torch.distributed as dist from torch.distributed._composable.fsdp import fully_shard, MixedPrecisionPolicy from torch.testing._internal.distributed._tensor.common_dtensor import ( ModelArgs, Transformer, TransformerBlock, ) def main(): torch.manual_seed(42) model_args = ModelArgs( n_layers=12, vocab_size=50304, n_heads=32, dim=2048, max_seq_len=2048, dropout_p=0.0, ) model = Transformer(model_args) mp_policy = MixedPrecisionPolicy(param_dtype=torch.bfloat16) fsdp_cfg = {"mp_policy": mp_policy} for module in model.modules(): if isinstance(module, TransformerBlock): fully_shard(module, **fsdp_cfg) fully_shard(model, **fsdp_cfg) optim = torch.optim.AdamW(model.parameters(), lr=1e-2) inp = torch.randint(0, model_args.vocab_size, (8, 1024), device="cuda") model(inp).sum().backward() optim.step() if __name__ == "__main__": dist.init_process_group(backend="nccl") gpu_id = int(os.environ["LOCAL_RANK"]) device = f"cuda:{gpu_id}" torch.cuda.set_device(device) rank = gpu_id main() dist.destroy_process_group()
English
6
8
101
10.8K
Ben (no treats)
Ben (no treats)@andersonbcdefg·
@marksaroufim any plans for a blog post akin to "getting started with DDP"? it's been a while since FSDP2 came out and still scant resources for getting up to speed with it :'(
English
2
0
5
349