

Vincent Chen
28 posts

@dsdvincent
CEO @ Panta(YC W26) | Prev MLE @Google




@gauravvohra Checkout hatetocall.com I used it to negotiate with my car insurance, saving me 700 dollars in one call




🚀 Introducing MoBA: Mixture of Block Attention for Long-Context LLMs Excited to share our latest research on Mixture of Block Attention(MoBA)! This innovative approach revolutionizes long-context processing in LLMs by combining the power of Mixture of Experts (MoE) with sparse attention. MoBA achieves efficiency without sacrificing performance, making long-context tasks more scalable than ever. 🔑 Key features of MoBA: 🔹Trainable block sparse attention: Capable of continued training from any current full attention model 🔹Parameter-less gating mechanism: Seamlessly switches between full & sparse attention 🔹Production-proven quality at kimi.ai , 6.5x speedup at 1M input 🌟 Training and inference code available on GitHub: github.com/MoonshotAI/MoBA



What is the performance limit when scaling LLM inference? Sky's the limit. We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient. arxiv.org/abs/2402.12875 (ICLR 2024)

What is the performance limit when scaling LLM inference? Sky's the limit. We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient. arxiv.org/abs/2402.12875 (ICLR 2024)