Sabitlenmiş Tweet

Training run 10K steps. ~4B tokens. 225M params. 3× RTX 4090s on novita ai.
Throughput sat rock solid at ~39.5K tok/s the whole run, DDP was clean. Loss opened at 10.15, closed at 2.73. Curve looks exactly like it should.
Could've used Karpathy's autoresearch loop to squeeze more out of the setup. Chose not to. Wanted to see what best can I get by hand: scheduler, buffer cycling, DDP sync, the whole thing. You learn differently when you can't outsource the debugging.
Still was stopping just shy of Chinchilla optimal. Loss was still dropping at step 10K. The $25 budget made the call, not me.

English





















