Sultan Alrashed ری ٹویٹ کیا

New NanoGPT Speedrun WR at 97.8 (-1.2s) from @srashedll , with an update to the attention initialization. Motivated by mimetic initialization techniques, experiments uncovered that small random init outperformed zero init on attention out projection.
github.com/KellerJordan/m…
English


