

llm_enjoyer
334 posts






did some optimization testing on my tokenizer free model pretraining run









Interesting! While its idea is basically the same as a concurrent work (kexue.fm/archives/11626) and similar to an earlier work (kexue.fm/archives/10815), the experiment results look quite promising. If it's really better than TC, then this is HUGE.


The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!



nmoe has been updated with most of the receipts / repro code for noumena.com/research github.com/Noumena-Networ…