Bowen Peng

28 posts

Bowen Peng

@bloc97_

Katılım Eylül 2023

82 Takip Edilen1.1K Takipçiler

Bowen Peng@bloc97_·3 Kas

@eliebakouch Interesting... local grad norm/clipping was also what we used for DeMo, and it worked just as well as global clipping, with better lantency properties.

English

elie@eliebakouch·3 Kas

i tried on nanochat and it gives the same loss! the eval curves are a bit different, but i’d say it’s noise rather than a real difference? (especially since train/val loss are similar) in the local scenario, gradient norm/clip are on rank 0 only

elie@eliebakouch

would it break large scale training if we did gradient clipping on the local gradients instead of the global one? it would introduce some dependency on the distributed setting, which might cause scaling headaches ig, but at the same time, it would be more targeted toward the gradient causing explosions no?

English

11.8K

Bowen Peng@bloc97_·7 Tem

@Ar_Douillard @secemp9 @eliebakouch @EMostaque @NousResearch yeah really sad that extensive sweeping on 7b models didn't transfer to our 40b run, good news is that now we have a new point of reference to compare future runs against

English

145

Arthur Douillard@Ar_Douillard·7 Tem

@secemp9 @bloc97_ @eliebakouch @EMostaque @NousResearch I spend 99% of my time sweeping, sad but so powerful

English

273

Arthur Douillard@Ar_Douillard·7 Tem

What's up with @NousResearch's Psyche distributed run? It's at 100% compute pool capacity, and only doing 288 ktok/s. At that rate it'll finish training a 40B dense model in 2 years+ what's blocking?

English

321

46.6K

Bowen Peng@bloc97_·7 Tem

@Ar_Douillard @eliebakouch @EMostaque @NousResearch We're not bottlenecked by data transfers, it's just the MFU is bad because naive TP on such a small network (also micro batch size is 1 to fit on 8x80GB) is inefficient if you don't overlap the communications, FSDP is much better

English

215

Bowen Peng@bloc97_·7 Tem

@Ar_Douillard @eliebakouch @EMostaque @NousResearch We used a naive eager Tensor Parallel implementation to fit the 40B on a single 8xH100 node at first just to get the training started. We will be swapping it out soon with HSDP (FSDP2 inside a node + DeMo for the outer HSDP comms), which should give us the maximum MFU we can get

English

380

Bowen Peng@bloc97_·7 Tem

@eliebakouch @EMostaque @Ar_Douillard @NousResearch This run will continue on for a little more time, as it's principal goal was to iron out all the issues that we've encountered in production. The team has made a lot of progress on maintaining the infrastructure's stability, and we've learned a lot about large scale pretraining.

English

175

Bowen Peng@bloc97_·7 Tem

@eliebakouch @EMostaque @Ar_Douillard @NousResearch Turns out model init and correct hyperparameters matters even more as you scale up the model size and data scale, especially when using low-precision training. This was somehow not apparent on 1B and 7B model sizes...

English

705

Bowen Peng@bloc97_·17 Haz

@eliebakouch @theemozilla It works without, but finetuning further improves the long context recall (like with DeepSeek v3).

English

elie@eliebakouch·17 Haz

cc @theemozilla @bloc97_ any idea? 🙏

English

391

elie@eliebakouch·17 Haz

Does yarn actually need to require further finetuning? Qwen3/2.5 seems to only use it at inference without retraining + I observed on a model that it's working already very well without finetuning.. Whereas method like llama3 or longrope rope scaling doesn't seems to work without finetuning.

English

1.6K

Bowen Peng retweetledi

Nous Research@NousResearch·14 May

Announcing the launch of Psyche nousresearch.com/nous-psyche/ Nous Research is democratizing the development of Artificial Intelligence. Today, we’re embarking on our greatest effort to date to make that mission a reality: The Psyche Network Psyche is a decentralized training network that makes it possible to bring the world’s compute together to train powerful AI, giving individuals and small communities access to the resources required to create new, interesting, and unique large scale models. We are launching our testnet today with the pre-training of a 40B parameter LLM, a model powerful enough to serve as a foundation for future pursuits in open science. This run represents the largest pre-training run conducted over the internet to date, surpassing previous iterations that trained smaller models on much fewer data tokens.

English

192

407

2.4K

509.7K

Bowen Peng@bloc97_·6 Nis

@fjzzq2002 @nrehiew_ Our work on YaRN also introduced the scaling factor s for temperature softmax scaling. Su proposed the log n scaling, which I guess when combined together yields this new paper's method which scales with s log n. Interesting that this new didn't cite any prior work.

English

Ziqian Zhong@fjzzq2002·6 Nis

@nrehiew_ Fyi Jianlin Su (author of RoPE) recommended this a while ago: kexue.fm/archives/8823

English

543

wh@nrehiew_·5 Nis

Llama4 cites this attention scaling paper which scales the query states to allow Softmax to work better for longer context. quick tldr

English

688

55.2K

Bowen Peng@bloc97_·5 Ara

@Yuchenj_UW @shivani_3000 @NousResearch DeMo is really hard to get working for smaller models. In the paper our 300M model is very particular. It's possible but we found that smaller models are very sensitive to optimizer type and hyperparameters. Would be happy to help you figure out why GPT-2 converges worse!

English

334

Yuchen Jin@Yuchenj_UW·5 Ara

@shivani_3000 @NousResearch Great to hear! I'd love to train 1.6B GPT-2 with DeMo too! Did you see the same thing that DeMo converges more slowly than AdamW for models < 1B?

English

568

Yuchen Jin@Yuchenj_UW·5 Ara

Sharing my experiments and thoughts on decentralized training: I trained GPT-2 (124M) with @NousResearch's DeMo optimizer, but AdamW is 1.5X more token-efficient. I was excited to see that Nous trained a 15B LLM using global GPUs on the Internet, leveraging the proposed DeMo optimizer. Their claims of achieving a loss curve and convergence rate comparable to or exceeding centralized training were intriguing. I wanted an apples-to-apples comparison between DeMo and AdamW in a *centralized* setting, so I adapted the NanoGPT code in the llm.c repo to use DeMo and trained GPT-2 (124M) on 10B FineWeb tokens using 8x H100s. Experiment: I swept through quite a few LRs for DeMo and plot the best-performing one (cosine LR schedule with peak LR=1.8e-3) with various k values, where k is a hyperparameter in DeMo, and GPU communication volume increases with k. Result: DeMo converges more slowly than AdamW. To reach the 3.22 validation loss achieved by AdamW after training on 10B tokens, DeMo requires 50% more fresh tokens. I link the training code and logs in the reply. I’m not 100% sure I used DeMo in its optimal way and would love feedback from the Nous team (@bloc97_, @theemozilla, @teknium) or anyone. Notes: 1. Communication Cost: DeMo reduces communication volume between the GPUs by sending only the top-k coefficients after a DCT transform instead of full gradients. However, it does not reduce communication frequency—GPUs still need to sync every step. 2. Token Speed: The per-step token training speed of using DeMo is 2X slower compared to the original AdamW implementation. My gut feeling is that DeMo introduces additional computational complexity, also standard optimizers like AdamW are heavily optimized with torch compile, PyTorch might lack specialized CUDA kernels for computations in DeMo. Thoughts: Reducing bandwidth usage is a critical step toward making large-scale "decentralized" training more feasible, but it's only one piece of the puzzle. 1. Latency: Internet networks are far more unpredictable and a lot worse than datacenter networks (with highly efficient architecture, switches and RDMAs). A single straggler in the Internet can become a bottleneck if gradients still need to be synced every step. 2. Network Reliability: Unlike tightly controlled local data centers, global networks contend with variable congestion, packet loss, and jitter—all of which can slow down training. 3. Fault Tolerance: At the scale of 1K or even 10K GPU training, hardware and software failures become very frequent. Building good fault-tolerant systems (for failure detection and recovery) for decentralized training is a significant challenge. Llama3 paper has a great infrastructure section talking about that. 4. Heterogeneity: Coordinating heterogeneous compute resources, such as different types of GPUs, adds another layer of complexity. I think decentralized training is promising and it's awesome to see people working on it. The path to decentralized training likely needs a fundamental breakthrough in optimization methods first - very infrequent synchronization, require minimal data transfer when syncs do happen, maintain training stability and convergence. Then what remains are the systems problems.

English

246

34.9K

Bowen Peng@bloc97_·2 Ara

@Yuchenj_UW @theemozilla @teknium @NousResearch @Oracle @LambdaAPI @NorthernDataGrp @CrusoeCloud @PrimeIntellect About 14 DGXes scattered around the globe. Sometimes more sometimes less, it varies depending on availability. On average, around 112 H100s.

English

5.3K

Yuchen Jin@Yuchenj_UW·2 Ara

@theemozilla @bloc97_ @teknium @NousResearch @Oracle @LambdaAPI @NorthernDataGrp @CrusoeCloud @PrimeIntellect Gotcha! Fault tolerance is harder as the number of GPUs used in training increases (more of a systems problem), may I ask how many H100s were involved in the training?

English

798

Nous Research@NousResearch·2 Ara

Nous Research announces the pre-training of a 15B parameter language model over the internet, using Nous DisTrO and heterogeneous hardware contributed by our partners at @Oracle, @LambdaAPI, @NorthernDataGrp, @CrusoeCloud, and the Andromeda Cluster. This run presents a loss curve and convergence rate that meets or exceeds centralized training. Our paper and code on DeMo, the foundational research that led to Nous DisTrO, is now available (linked below).

English

336

2.4K

650.7K

Bowen Peng@bloc97_·2 Ara

@iotcoi @teknium DeMo is the fundamental research behind DisTrO. We had to share this breakthrough with the open research community.

English

Mitko Vasilev@iotcoi·2 Ara

@teknium Distro is Demo? Or the other way

English

525

Teknium (e/λ)@Teknium·2 Ara

3 months after the preliminary report on DisTrO we've got a massive 15B run happening completely distributed across the internet happening live. You can watch it here: distro.nousresearch.com The Paper & Code for the optimizer has been released as well Paper: arxiv.org/abs/2411.19870 Code: github.com/bloc97/DeMo

Nous Research@NousResearch

English

429

42.8K

Bowen Peng@bloc97_·2 Ara

@samsja19 @Yuchenj_UW @NousResearch @Oracle @LambdaAPI @NorthernDataGrp @CrusoeCloud @PrimeIntellect When PP and TP over the internet is explored more throughly, I'm pretty sure it is possible to design a holistic balancing algo that picks the best nodes in order to prevent such cases. Currently, slower nodes' effects are hidden behind semi-async transmissions in DisTrO.

English

300

samsja@samsja19·2 Ara

@bloc97_ @Yuchenj_UW @NousResearch @Oracle @LambdaAPI @NorthernDataGrp @CrusoeCloud @PrimeIntellect Hey @bloc97_ nice work. Curious about heterogenous compute with distro. It seems from the paper that the peer needs to communicate at the same time how does it work when you have slower nodes ? Looking forward to more decentralized training!

English

306

Bowen Peng@bloc97_·2 Ara

@Yuchenj_UW @teknium @NousResearch @Oracle @LambdaAPI @NorthernDataGrp @CrusoeCloud @PrimeIntellect We were tweaking the model evaluator so that it doesn't slow down the training. The dangers of pushing on prod code during a live run... @theemozilla might be able to talk more about the evals and the reasoning behind live eval monitoring.

English

1.3K

Yuchen Jin@Yuchenj_UW·2 Ara

@teknium @NousResearch @Oracle @LambdaAPI @NorthernDataGrp @CrusoeCloud @PrimeIntellect @bloc97_ curious what happened here, checkpoints lost? 😀

English

1.8K

Bowen Peng@bloc97_·2 Ara

@Yuchenj_UW @NousResearch @Oracle @LambdaAPI @NorthernDataGrp @CrusoeCloud @PrimeIntellect I'm guessing NCCL could be adapted to support heterogeneous hardware and support training over the internet but it would require a massive shift in direction from nvidia.

English

249

Bowen Peng@bloc97_·2 Ara

@Yuchenj_UW @NousResearch @Oracle @LambdaAPI @NorthernDataGrp @CrusoeCloud @PrimeIntellect The software stack supports heterogeneous hardware (because each node runs a independent local trainer) but for the live run we're exclusively using H100s (PCIe or SXM) for the sake of speed.

English

723

Bowen Peng@bloc97_·2 Ara

@Yuchenj_UW @NousResearch @Oracle @LambdaAPI @NorthernDataGrp @CrusoeCloud @PrimeIntellect And for now, the Tensor Parallelism is done locally (both GPUs are in the same machine), future work might allow TP over the internet.

English

Bowen Peng@bloc97_·2 Ara

@Yuchenj_UW @NousResearch @Oracle @LambdaAPI @NorthernDataGrp @CrusoeCloud @PrimeIntellect 1) We are using TP across two GPUs for 15B, DisTrO is compatible with all other forms of Parallelism. 2) We haven't experimented with DiLoCo much, all my focus was on DisTrO. It's a significant challenge to successfully pretrain a 15B LLM using a new optimizer.

English

Bowen Peng retweetledi

Nous Research@NousResearch·2 Ara

DeMo was created in March 2024 by Bowen Peng (@bloc97_ ) and Jeffrey Quesnelle (@theemozilla) and has been published on arXiv in collaboration with Diederik P. Kingma (@dpkingma), co-founder of OpenAI and inventor of the Adam optimizer and VAEs. Paper Link: arxiv.org/abs/2411.19870 Code: bloc97/DeMo: github.com/bloc97/DeMo

English

233

49.4K

Keşfet

@eliebakouch @Ar_Douillard @secemp9 @EMostaque @NousResearch @theemozilla @fjzzq2002 @nrehiew_