Muratti
1.2K posts

Muratti
@judoproca
Software Engineer / ML Engineer. Fitness competitor 🏋🏻♂️💪🏻


What we’ve been working on


The wearable category is sleeping on strength training. I’ll be customer #1.









This is the whole secret


It sounds great but then I was thinking why does it sound familiar? Then i remembered reading a paper with a similar architecture which was published before - called 1-bit LLMs, basically you can represent 3 states (-1,0,1) with 1.58 bits hence the name. Not sure how this paper is different. arxiv.org/html/2402.1776…

This is really a 'WOW' paper. 🤯 Claims that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales and by utilizing an optimized kernel during inference, their model’s memory consumption can be reduced by more than 10× compared to unoptimized models. 🤯 'Scalable MatMul-free Language Modeling' Concludes that it is possible to create the first scalable MatMul-free LLM that achieves performance on par with state-of-the-art Transformers at billion-parameter scales. 📌 The proposed MatMul-free LLM replaces MatMul operations in dense layers with ternary accumulations using weights constrained to {-1, 0, +1}. This reduces computational cost and memory utilization while preserving network expressiveness. 📌 To remove MatMul from self-attention, the Gated Recurrent Unit (GRU) is optimized to rely solely on element-wise products, creating the MatMul-free Linear GRU (MLGRU) token mixer. The MLGRU simplifies the GRU by removing hidden-state related weights, enabling parallel computation, and replacing remaining weights with ternary matrices. 📌 For MatMul-free channel mixing, the Gated Linear Unit (GLU) is adapted to use BitLinear layers with ternary weights, eliminating expensive MatMuls while maintaining effectiveness in mixing information across channels. 📌 The paper introduces a hardware-efficient fused BitLinear layer that optimizes RMSNorm and BitLinear operations. By fusing these operations and utilizing shared memory, training speed improves by 25.6% and memory consumption reduces by 61% over an unoptimized baseline. 📌 Experimental results show that the MatMul-free LLM achieves competitive performance compared to Transformer++ baselines on downstream tasks, with the performance gap narrowing as model size increases. The scaling law projections suggest MatMul-free LLM can outperform Transformer++ in efficiency and potentially in loss when scaled up. 📌 A custom FPGA accelerator is built to exploit the lightweight operations of the MatMul-free LLM. The accelerator processes billion-parameter scale models at 13W beyond human-readable throughput, demonstrating the potential for brain-like efficiency in future lightweight LLMs.












