自由人: "すごい 👀 AdamW だとパラメーターごとに 16 byte 必要だったのを 7 byte とかに減らせるらしい。論文の図をぱっと見た感じだとモデル性能はほ" | Zamantika Mersobahis Locabet

Post

自由人

自由人@python_walker·4 Mar

すごい 👀 AdamW だとパラメーターごとに 16 byte 必要だったのを 7 byte とかに減らせるらしい。論文の図をぱっと見た感じだとモデル性能はほとんど変わってなくて、ステップあたりにかかる時間も少し短くなっているっぽい。簡単に使えるっぽいから近いうちに試してみようかな

Databricks AI Research@DbrxMosaicAI

New research from Databricks AI Research: FlashOptim cuts training memory by over 50% with no measurable loss in model quality. Training a model with AdamW typically requires 16 bytes per parameter just for weights, gradients, and optimizer state. FlashOptim brings that down to 7 bytes, or 5 with gradient release. For Llama-3.1-8B finetuning, peak GPU memory drops from 175 GiB to 113 GiB. Two techniques drive this: improved master weight splitting using tighter ULP-normalized error correction, and companded optimizer state quantization that reduces quantization error and improves convergence. FlashOptim works as a drop-in replacement for SGD, AdamW, and Lion, supports distributed training with DDP and FSDP2, and is open source. Paper: arxiv.org/html/2602.2334… Source code: github.com/databricks/fla…

日本語

0

1

365

Teilen