
New research from Databricks AI Research: FlashOptim cuts training memory by over 50% with no measurable loss in model quality. Training a model with AdamW typically requires 16 bytes per parameter just for weights, gradients, and optimizer state. FlashOptim brings that down to 7 bytes, or 5 with gradient release. For Llama-3.1-8B finetuning, peak GPU memory drops from 175 GiB to 113 GiB. Two techniques drive this: improved master weight splitting using tighter ULP-normalized error correction, and companded optimizer state quantization that reduces quantization error and improves convergence. FlashOptim works as a drop-in replacement for SGD, AdamW, and Lion, supports distributed training with DDP and FSDP2, and is open source. Paper: arxiv.org/html/2602.2334… Source code: github.com/databricks/fla…