Bowen Peng

28 posts

Bowen Peng

Bowen Peng

@bloc97_

Katılım Eylül 2023
82 Takip Edilen1.1K Takipçiler
Bowen Peng
Bowen Peng@bloc97_·
@eliebakouch Interesting... local grad norm/clipping was also what we used for DeMo, and it worked just as well as global clipping, with better lantency properties.
English
0
0
1
65
elie
elie@eliebakouch·
i tried on nanochat and it gives the same loss! the eval curves are a bit different, but i’d say it’s noise rather than a real difference? (especially since train/val loss are similar) in the local scenario, gradient norm/clip are on rank 0 only
elie tweet media
elie@eliebakouch

would it break large scale training if we did gradient clipping on the local gradients instead of the global one? it would introduce some dependency on the distributed setting, which might cause scaling headaches ig, but at the same time, it would be more targeted toward the gradient causing explosions no?

English
8
1
69
11.8K
Arthur Douillard
Arthur Douillard@Ar_Douillard·
What's up with @NousResearch's Psyche distributed run? It's at 100% compute pool capacity, and only doing 288 ktok/s. At that rate it'll finish training a 40B dense model in 2 years+ what's blocking?
Arthur Douillard tweet media
English
11
4
321
46.6K
Bowen Peng
Bowen Peng@bloc97_·
@Ar_Douillard @eliebakouch @EMostaque @NousResearch We're not bottlenecked by data transfers, it's just the MFU is bad because naive TP on such a small network (also micro batch size is 1 to fit on 8x80GB) is inefficient if you don't overlap the communications, FSDP is much better
English
1
0
5
215
Bowen Peng
Bowen Peng@bloc97_·
@Ar_Douillard @eliebakouch @EMostaque @NousResearch We used a naive eager Tensor Parallel implementation to fit the 40B on a single 8xH100 node at first just to get the training started. We will be swapping it out soon with HSDP (FSDP2 inside a node + DeMo for the outer HSDP comms), which should give us the maximum MFU we can get
English
1
0
12
380
Bowen Peng
Bowen Peng@bloc97_·
@eliebakouch @EMostaque @Ar_Douillard @NousResearch This run will continue on for a little more time, as it's principal goal was to iron out all the issues that we've encountered in production. The team has made a lot of progress on maintaining the infrastructure's stability, and we've learned a lot about large scale pretraining.
English
0
0
4
175
Bowen Peng
Bowen Peng@bloc97_·
@eliebakouch @EMostaque @Ar_Douillard @NousResearch Turns out model init and correct hyperparameters matters even more as you scale up the model size and data scale, especially when using low-precision training. This was somehow not apparent on 1B and 7B model sizes...
English
6
0
24
705
elie
elie@eliebakouch·
Does yarn actually need to require further finetuning? Qwen3/2.5 seems to only use it at inference without retraining + I observed on a model that it's working already very well without finetuning.. Whereas method like llama3 or longrope rope scaling doesn't seems to work without finetuning.
elie tweet media
English
2
1
15
1.6K
Bowen Peng retweetledi
Nous Research
Nous Research@NousResearch·
Announcing the launch of Psyche nousresearch.com/nous-psyche/ Nous Research is democratizing the development of Artificial Intelligence. Today, we’re embarking on our greatest effort to date to make that mission a reality: The Psyche Network Psyche is a decentralized training network that makes it possible to bring the world’s compute together to train powerful AI, giving individuals and small communities access to the resources required to create new, interesting, and unique large scale models. We are launching our testnet today with the pre-training of a 40B parameter LLM, a model powerful enough to serve as a foundation for future pursuits in open science. This run represents the largest pre-training run conducted over the internet to date, surpassing previous iterations that trained smaller models on much fewer data tokens.
Nous Research tweet media
English
192
407
2.4K
509.7K
Bowen Peng
Bowen Peng@bloc97_·
@fjzzq2002 @nrehiew_ Our work on YaRN also introduced the scaling factor s for temperature softmax scaling. Su proposed the log n scaling, which I guess when combined together yields this new paper's method which scales with s log n. Interesting that this new didn't cite any prior work.
English
0
0
3
71
wh
wh@nrehiew_·
Llama4 cites this attention scaling paper which scales the query states to allow Softmax to work better for longer context. quick tldr
wh tweet media
English
5
55
688
55.2K
Bowen Peng
Bowen Peng@bloc97_·
@Yuchenj_UW @shivani_3000 @NousResearch DeMo is really hard to get working for smaller models. In the paper our 300M model is very particular. It's possible but we found that smaller models are very sensitive to optimizer type and hyperparameters. Would be happy to help you figure out why GPT-2 converges worse!
English
1
0
16
334
Yuchen Jin
Yuchen Jin@Yuchenj_UW·
@shivani_3000 @NousResearch Great to hear! I'd love to train 1.6B GPT-2 with DeMo too! Did you see the same thing that DeMo converges more slowly than AdamW for models < 1B?
English
1
0
9
568
Yuchen Jin
Yuchen Jin@Yuchenj_UW·
Sharing my experiments and thoughts on decentralized training: I trained GPT-2 (124M) with @NousResearch's DeMo optimizer, but AdamW is 1.5X more token-efficient. I was excited to see that Nous trained a 15B LLM using global GPUs on the Internet, leveraging the proposed DeMo optimizer. Their claims of achieving a loss curve and convergence rate comparable to or exceeding centralized training were intriguing. I wanted an apples-to-apples comparison between DeMo and AdamW in a *centralized* setting, so I adapted the NanoGPT code in the llm.c repo to use DeMo and trained GPT-2 (124M) on 10B FineWeb tokens using 8x H100s. Experiment: I swept through quite a few LRs for DeMo and plot the best-performing one (cosine LR schedule with peak LR=1.8e-3) with various k values, where k is a hyperparameter in DeMo, and GPU communication volume increases with k. Result: DeMo converges more slowly than AdamW. To reach the 3.22 validation loss achieved by AdamW after training on 10B tokens, DeMo requires 50% more fresh tokens. I link the training code and logs in the reply. I’m not 100% sure I used DeMo in its optimal way and would love feedback from the Nous team (@bloc97_, @theemozilla, @teknium) or anyone. Notes: 1. Communication Cost: DeMo reduces communication volume between the GPUs by sending only the top-k coefficients after a DCT transform instead of full gradients. However, it does not reduce communication frequency—GPUs still need to sync every step. 2. Token Speed: The per-step token training speed of using DeMo is 2X slower compared to the original AdamW implementation. My gut feeling is that DeMo introduces additional computational complexity, also standard optimizers like AdamW are heavily optimized with torch compile, PyTorch might lack specialized CUDA kernels for computations in DeMo. Thoughts: Reducing bandwidth usage is a critical step toward making large-scale "decentralized" training more feasible, but it's only one piece of the puzzle. 1. Latency: Internet networks are far more unpredictable and a lot worse than datacenter networks (with highly efficient architecture, switches and RDMAs). A single straggler in the Internet can become a bottleneck if gradients still need to be synced every step. 2. Network Reliability: Unlike tightly controlled local data centers, global networks contend with variable congestion, packet loss, and jitter—all of which can slow down training. 3. Fault Tolerance: At the scale of 1K or even 10K GPU training, hardware and software failures become very frequent. Building good fault-tolerant systems (for failure detection and recovery) for decentralized training is a significant challenge. Llama3 paper has a great infrastructure section talking about that. 4. Heterogeneity: Coordinating heterogeneous compute resources, such as different types of GPUs, adds another layer of complexity. I think decentralized training is promising and it's awesome to see people working on it. The path to decentralized training likely needs a fundamental breakthrough in optimization methods first - very infrequent synchronization, require minimal data transfer when syncs do happen, maintain training stability and convergence. Then what remains are the systems problems.
Yuchen Jin tweet media
English
14
21
246
34.9K
Nous Research
Nous Research@NousResearch·
Nous Research announces the pre-training of a 15B parameter language model over the internet, using Nous DisTrO and heterogeneous hardware contributed by our partners at @Oracle, @LambdaAPI, @NorthernDataGrp, @CrusoeCloud, and the Andromeda Cluster. This run presents a loss curve and convergence rate that meets or exceeds centralized training. Our paper and code on DeMo, the foundational research that led to Nous DisTrO, is now available (linked below).
Nous Research tweet media
English
99
336
2.4K
650.7K
Bowen Peng
Bowen Peng@bloc97_·
@iotcoi @teknium DeMo is the fundamental research behind DisTrO. We had to share this breakthrough with the open research community.
English
0
0
2
44
Teknium (e/λ)
Teknium (e/λ)@Teknium·
3 months after the preliminary report on DisTrO we've got a massive 15B run happening completely distributed across the internet happening live. You can watch it here: distro.nousresearch.com The Paper & Code for the optimizer has been released as well Paper: arxiv.org/abs/2411.19870 Code: github.com/bloc97/DeMo
Nous Research@NousResearch

Nous Research announces the pre-training of a 15B parameter language model over the internet, using Nous DisTrO and heterogeneous hardware contributed by our partners at @Oracle, @LambdaAPI, @NorthernDataGrp, @CrusoeCloud, and the Andromeda Cluster. This run presents a loss curve and convergence rate that meets or exceeds centralized training. Our paper and code on DeMo, the foundational research that led to Nous DisTrO, is now available (linked below).

English
34
53
429
42.8K