Albert Gu

527 posts

Albert Gu

Albert Gu

@_albertgu

assistant prof @mldcmu. chief scientist @cartesia_ai. leading the ssm revolution.

Katılım Aralık 2018
77 Takip Edilen20.3K Takipçiler
Sabitlenmiş Tweet
Albert Gu
Albert Gu@_albertgu·
The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!
Albert Gu tweet media
English
36
311
1.6K
405.7K
Albert Gu retweetledi
Albert Gu
Albert Gu@_albertgu·
thanks to @cartesia for supporting this project, providing compute, and testing the models! we believe that such research advances are highly impactful for natural, real-time intelligence and continue to invest in the frontier of efficient models Blog cross-posted: blog.cartesia.ai/p/mamba-3
Cartesia@cartesia

Mamba-3 is out! 🐍 SSMs marked a major advance for the efficiency of modern LLMs. Mamba-3 takes the next step, shaping SSMs for a world where AI workloads are increasingly dominated by inference. Read about it on the Cartesia blog: blog.cartesia.ai/p/mamba-3

English
1
7
82
7.2K
Albert Gu retweetledi
Albert Gu retweetledi
Albert Gu
Albert Gu@_albertgu·
@giffmana Haha it occured to me that just like trapezoid is a width 2 conv, runge kutta 4 might look like a width 4 conv which is the size that previous Mambas used
English
1
0
11
2.9K
Albert Gu
Albert Gu@_albertgu·
The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!
Albert Gu tweet media
English
36
311
1.6K
405.7K
Albert Gu
Albert Gu@_albertgu·
Thanks for reading - we're so excited to see how people use Mamba-3! Also check out the other posts, including Tri’s explanation of MIMO as well as the students’ threads: x.com/tri_dao/status… x.com/kevinyli_/stat… x.com/aakash_lahoti/… x.com/_berlinchen/st… x.com/caitWW9/status…
Berlin Chen@_berlinchen

Had so much fun working on Mamba-3 with my wonderful collaborators: @aakash_lahoti @kevinyli_ @caitWW9 @avivbick @zicokolter @tri_dao @_albertgu. We treated inference as a first-class citizen from day one. This leads to some surprisingly powerful results 👇

English
1
3
38
3.7K
Albert Gu
Albert Gu@_albertgu·
Aside from the core SSM, we also adjusted the overall architecture to align more with modern LMs, such as adding in BC/QK Norm. The most important change to me is that Mamba-3 no longer needs the short conv that has been essential for linear models, but is very inelegant.
Albert Gu tweet media
English
1
1
45
3.1K
Albert Gu
Albert Gu@_albertgu·
> an example of this is that in hybrid models, sometimes "stronger" linear layers can lead to overall weaker models because it incentivizes the global attention to be "lazy" some people asked about this. i think this is a somewhat folklore result that I don't have a reference for, but here's another recent result that's similar: arxiv.org/abs/2509.24552 this is an example of a related phenomenon where in a SWA+xLSTM model, longer SWA windows led to worse long-context performance because it encouraged the xLSTM layers to be lazy
English
0
13
64
4.9K
Albert Gu
Albert Gu@_albertgu·
okay this plot and discussion has blown up more than expected so let me try to leave some candid thoughts 1. i don't believe that the intent of Mayank's tweet was to claim "Mamba-2 > GDN". the primary intenet was to convey that the initialization for Mamba-2 makes a huge difference; a secondary point was that in *this particular setting* Mamba-2 seemed to outperform GDN (after fixing the init) 2. in my personal opinion and experience, Mamba-2 should generally be a faster but weaker version of GDN. after all GDN is literally built on top of Mamba-2 but adding a more expressive (rank-1) component to the state transition 3. however, we also know that different parts of neural network architectures can interact in expected ways. an example of this is that in hybrid models, sometimes "stronger" linear layers can lead to overall weaker models because it incentivizes the global attention to be "lazy". also, other downstream capabilities are not necessarily correlated with loss 4. for this particular plot (7B/1B MoE), i have never personally tested a pure Mamba+MoE model so i can't vouch for the results but it seems plausible that there are unforeseen interactions with MoE Mayank has also shown his reproductions in other settings (e.g. 400M dense) where GDN is slightly better than Mamba-2 which tracks what i'd expect, so I think there's nothing obvious to suspect in this plot 5. i also want to emphasize that many follow-up results shouldn't have issues (including the original GDN paper). however, there are also probably a non-trivial number of results that did have a bug. regardless, hopefully this raises more awareness of initialization-related issues! tl;dr - the main takeaway is the init bug; some downstream results are affected, some aren't - Mamba-2 vs GDN is a much more nuanced question
Lucas Beyer (bl16)@giffmana

soooo... how many papers do we think are invalidated by this? And now think about how many other bugs there must be in any re-implementations of... basically anything.

English
4
20
215
45.6K