Will Rice
1.2K posts

Will Rice
@_Will_Rice
ML Engineer working on generative models in #Speech and #NLP. Focused on Text-to-Speech (TTS) and speaker generation.




Its going viral on Reddit. Somebody let ChatGPT run a $100 live share portfolio, restricted to U.S. micro-cap stocks. Did an LLM really bit the market?. - 4 weeks +23.8% while the Russell 2000 and biotech ETF XBI rose only ~3.9% and 3.5%. Prompt + GitHub posted --- ofcourse its a short‑term outperformance, tiny sample size, and also micro caps are hightly volatile. So much more exahustive analysis is needed with lots or more info (like Sharpe ratios and longer back-testing etc), to explore whether an LLM can truly beat the market.

more detailed version


This is really a 'WOW' paper. 🤯 Claims that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales and by utilizing an optimized kernel during inference, their model’s memory consumption can be reduced by more than 10× compared to unoptimized models. 🤯 'Scalable MatMul-free Language Modeling' Concludes that it is possible to create the first scalable MatMul-free LLM that achieves performance on par with state-of-the-art Transformers at billion-parameter scales. 📌 The proposed MatMul-free LLM replaces MatMul operations in dense layers with ternary accumulations using weights constrained to {-1, 0, +1}. This reduces computational cost and memory utilization while preserving network expressiveness. 📌 To remove MatMul from self-attention, the Gated Recurrent Unit (GRU) is optimized to rely solely on element-wise products, creating the MatMul-free Linear GRU (MLGRU) token mixer. The MLGRU simplifies the GRU by removing hidden-state related weights, enabling parallel computation, and replacing remaining weights with ternary matrices. 📌 For MatMul-free channel mixing, the Gated Linear Unit (GLU) is adapted to use BitLinear layers with ternary weights, eliminating expensive MatMuls while maintaining effectiveness in mixing information across channels. 📌 The paper introduces a hardware-efficient fused BitLinear layer that optimizes RMSNorm and BitLinear operations. By fusing these operations and utilizing shared memory, training speed improves by 25.6% and memory consumption reduces by 61% over an unoptimized baseline. 📌 Experimental results show that the MatMul-free LLM achieves competitive performance compared to Transformer++ baselines on downstream tasks, with the performance gap narrowing as model size increases. The scaling law projections suggest MatMul-free LLM can outperform Transformer++ in efficiency and potentially in loss when scaled up. 📌 A custom FPGA accelerator is built to exploit the lightweight operations of the MatMul-free LLM. The accelerator processes billion-parameter scale models at 13W beyond human-readable throughput, demonstrating the potential for brain-like efficiency in future lightweight LLMs.


GPT-3’s score on the MMLU benchmark was 40%. First release of GPT-4 scored 86%, and today GPT-4o is 89%. An increase of just 3% — that’s a full year of progress. If you plot the prior trend we were supposed to be at 100, maybe 120% by now. AI is hitting a wall.

Unpopular opinion: AI agents are hard.







