TimRambo🐧
69.1K posts

TimRambo🐧
@TimRamboReal
🖥️Linux 🎮 Gamer 🌐 ArcBlock 📱 GrapheneOS 🧠 Open Minded ⚙️ Web3 & Blockchain 🚀 Founder @pocklabs 🧐 Pattern Recognition🤓 💎 #Web3 ( #DYOR , #NFA)










Not mine; and I’m not convinced fixed Mamba2 dominates GDN 🤔 (with 1 open question at the end) In Physics of LM, Part 4.2, I compared Mamba2 vs GDN (my improved GDN2) across 1–8B, 1T tokens, w/ and w/o Canon layers. Across scales, GDN consistently outperforms Mamba2 on benchmarks. I tightened many factors (esp. recurrent memory size) for fairness --- details at 42:16: 🎥 youtu.be/niHNWJxmkW0?si… My Mamba2 impl (no such init bug as far as I can tell): 🔹 github.com/facebookresear… 📂Linear model folder: github.com/facebookresear… 📊 Best-of-2/3 LRs: github.com/facebookresear… 📊 Full LR table: github.com/facebookresear… 📈 Eval curves (.html): github.com/facebookresear… 📉 Training curves (.html): github.com/facebookresear… 📌 Interesting detail: Mamba2 may occasionally look marginally better in training loss, but underperforms on eval benchmarks — and the gap widens after adding Canon layers. 🧠One open direction: since GDN2 (low-rank gating + low-rank init) is slightly better than GDN, it might be worth testing whether Mamba2 benefits similarly from low-rank gating. I haven’t tested this yet.
















