Zamantikazamantika
트렌드 트윗 아카이브 블로그

Post

自由人
自由人@python_walker·4 Mar
すごい 👀 AdamW だとパラメーターごとに 16 byte 必要だったのを 7 byte とかに減らせるらしい。論文の図をぱっと見た感じだとモデル性能はほとんど変わってなくて、ステップあたりにかかる時間も少し短くなっているっぽい。 簡単に使えるっぽいから近いうちに試してみようかな
Databricks AI Research@DbrxMosaicAI

New research from Databricks AI Research: FlashOptim cuts training memory by over 50% with no measurable loss in model quality. Training a model with AdamW typically requires 16 bytes per parameter just for weights, gradients, and optimizer state. FlashOptim brings that down to 7 bytes, or 5 with gradient release. For Llama-3.1-8B finetuning, peak GPU memory drops from 175 GiB to 113 GiB. Two techniques drive this: improved master weight splitting using tighter ULP-normalized error correction, and companded optimizer state quantization that reduces quantization error and improves convergence. FlashOptim works as a drop-in replacement for SGD, AdamW, and Lion, supports distributed training with DDP and FSDP2, and is open source. Paper: arxiv.org/html/2602.2334… Source code: github.com/databricks/fla…

日本語
0
0
1
365
공유
Zamantikazamantika - Mersobahis - Locabet

Twitter/X 프로필, 트윗, 트렌드를 익명으로 조회하세요. 계정이 필요 없습니다.

탐색

  • 홈
  • 트렌드
  • 트윗 아카이브
  • 블로그
  • 소개
  • 연락처

인기 프로필

  • @elonmusk
  • @BarackObama
  • @taylorswift13
  • @cristiano
  • @NASA

법적 정보

  • 이용약관
  • 개인정보처리방침

© 2025 Zamantika. 모든 권리 보유.

zamantika.com