Ali Naeimi

15 posts

Ali Naeimi

Ali Naeimi

@Ali_NT99

MSc AI | Research Engineer | Distributed Pretraining optimization

Manchester, UK Katılım Temmuz 2023
43 Takip Edilen3 Takipçiler
Vlado Boza
Vlado Boza@bozavlado·
I just saw a job named "mnist_train" on the cluster with H200s (140GB of VRAM). Welcome to Slovak science. We will overtake all of you soon.
English
14
8
562
40.7K
Keller Jordan
Keller Jordan@kellerjordan0·
New modded-NanoGPT optimization benchmark result: @wen_kaiyue has improved upon both the Muon and AdamW baselines, by replacing their weight decay with hyperball optimization. The new record is 3325 steps.
Keller Jordan tweet media
English
7
42
421
55.9K
Keller Jordan
Keller Jordan@kellerjordan0·
Modded-NanoGPT Optimization Benchmark Hundreds of neural network optimizers have been proposed in the literature, recently including dozens citing Muon: MARS, SWAN, REG, ADANA, Newton-Muon, TrasMuon, AdaMuon, HTMuon, COSMOS, Conda, ASGO, SAGE, and Magma, to name a few. The majority of this innovation is happening in the public research community. But the community currently lacks a widely accepted, easily accessible way to compare and make sense of the deluge of methods. As a result, promising new ideas get buried, and spurious results go unchallenged. To help address these issues, I'm releasing a new optimization benchmark. It's designed for maximum simplicity and speed: Just a single file containing ~350 lines of plain PyTorch, which can complete a baseline LM training within 20 minutes of booting up a fresh 8xH100 machine. It also works with {1,2,4}xH100 or A100. These attributes make the new benchmark more accessible than any prior work. The rules are simple: The optimization algorithm can be changed arbitrarily, with the goal being to minimize the number of training steps needed to reach 3.28 val loss on FineWeb (this is the same target loss as in the main speedrun). Modifying the architecture or dataloader, on the other hand, is not allowed. Wallclock time is unlimited, in order to give a fair chance to optimizers which would need kernel work or larger scale to become wallclock-efficient. Like the main NanoGPT speedrun, submissions are open, and new results will be publicly broadcast. Beyond just improving the step count record, another goal of the benchmark is to collaboratively produce well-tuned baselines for as many optimizers as possible. For example, any improvement to the benchmark's best hyperparameters for AdamW would be considered a worthwhile new result. This benchmark is not intended to be the final measure of optimizer quality across all domains. Convenient shared experimental infrastructure which covers the full space of possibilities -- across varying batch size, tokens per parameter, model scale, epoch count, and architecture -- is desirable, but far beyond the current status quo. This benchmark is only meant to be one step towards that goal. To start the benchmark off, I've spent ~20 runs tuning baselines for Muon and AdamW. From time to time over the next few weeks, I'll add another optimizer from the literature, with my best effort at finding good hyperparameters. Researchers interested in neural network optimization are invited to join in by picking an optimizer and giving it a try on the benchmark. All optimizers are welcome, and even runs that don't necessarily have the best hyperparameters are desirable additions to the repo, because each new run adds to the collective knowledge.
Keller Jordan tweet media
English
23
115
835
153.3K
Ali Naeimi
Ali Naeimi@Ali_NT99·
@kellerjordan0 I think it's a good idea to add DDP instead of manual NCCL calls for better overlap and throughput on PCIE instances since one of the goals is to make this track more accessible: github.com/KellerJordan/m…
English
0
0
0
312
Ali Naeimi
Ali Naeimi@Ali_NT99·
@htihle It might be same ish on average but It’s a massive improvement in the worst performing tasks compared to other models… In actual complex coding tasks that’s a much more important metric imo… Would love to see how 5.5 performs too.
English
0
0
1
119
Håvard Ihle
Håvard Ihle@htihle·
Claude Opus 4.7 (high) scores 76.4% on WeirdML and does not improve on the non-thinking version. It's probably a bit better, but within the error bar. It shows a higher peak performance, with a new individual high score on 4 of the 17 tasks. It uses much less tokens, 10k vs 32k for 4.6 (high). I may try to run with max setting later.
Håvard Ihle tweet media
Håvard Ihle@htihle

Claude Opus 4.7 (no thinking) scores 76.4% on WeirdML, right behind gpt 5.4 (xhigh) at 77.7%, Opus 4.6 (adaptive) at 77.9% and gpt 5.3 codex (xhigh) at 79.3%, using an order of magnitude less tokens. This looks like a major step forward, things are moving fast now! Results with thinking next week.

English
7
2
63
8.4K
Ali Naeimi
Ali Naeimi@Ali_NT99·
@htihle Great work! Will you be testing GLM5.1?
English
0
0
2
525
Ali Naeimi
Ali Naeimi@Ali_NT99·
@MatternJustus Great work! Any plans on benchmarking GLM 5.1 as it's the open-weight SOTA atm?
English
0
0
1
251
Justus Mattern
Justus Mattern@MatternJustus·
Introducing FrontierSWE, an ultra-long horizon coding benchmark. We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules. Despite having 20 hours, they rarely succeed
Justus Mattern tweet media
English
76
139
1.3K
222.2K
Ali Naeimi
Ali Naeimi@Ali_NT99·
@danveloper I'm curious, did you try putting the shared experts in RAM? should give a huge speedup with minimal extra RAM usage. Also since you're using only 4 experts out of 10, this is extremely lossy in terms of model output quality.
English
0
0
0
534
Ali Naeimi
Ali Naeimi@Ali_NT99·
@m_sirovatka Maybe Try sharding at sublayer level, shard attn and mlp separately.
English
0
0
0
73
Ali Naeimi
Ali Naeimi@Ali_NT99·
@andreslavescu I also have an mHC-lite implementation in Triton wit autotune and compile supprot with every kernel operating around 85% of theomax BW usage, pushing the overhead down from 35% of torch to only 11%, please check it out: github.com/alint77/flash-…
English
0
0
0
122