Sultan Alrashed 리트윗함

New NanoGPT Speedrun WR at 97.8 (-1.2s) from @srashedll , with an update to the attention initialization. Motivated by mimetic initialization techniques, experiments uncovered that small random init outperformed zero init on attention out projection.
github.com/KellerJordan/m…
English


