Sabitlenmiş Tweet

PS: short answer is, big LLM modern architecture and hardware memory limitations.
Microsoft etc are building millions tokens/sec at different scales ie. multiple high-speed chained datacenters to chip. Microsoft CEO Satya Nadella (@satyanadella) has posted about a 1.1M tokens/sec “datacenter rack”.
This same tokens/sec question, funnily enough, can be asked of @bunjavascript. Why use Bun, when @ThePrimeagen style “blazing fast” options exist?, why even use Javascript?
It is not as if, AI labs care primarily about efficiency, they use slower Python infra, which to be fair, runs on faster C++ libraries. Do software shops & labs care about “faster Javascript”, slower compilation cycles, or slow CI builds?
But the real reason is of culture, I think. People are betting the horse race on AI getting faster and better, like Microsoft bet on CPUs exponentially rising under Moore’s Law. One AI lab had a quip that they do not care about “software debt” because they are betting on AGI fixing their codebases, in the future!
Historical LLMs and transformers are one of the approaches winning out. And on top of that, AI labs have decided that it is a good idea to have, since approx ChatGPT-3, even larger-and-larger AI models.
The shift to larger LLMs since ChatGPT-3 reflects a belief in emergent capabilities outweighing efficiency, much like early software ignored optimization for raw Moore’s law backed compute gains.
The very large LLM architecture itself seems to be an historical accident. Same with focusing on language for AI.
And another cultural fact is that everyone wants the problem to be shifted downstream or upstream, eg. is the current LLM architecture the best? Do we have the electricity to run these huge AI datacenters? etc.
It’s ignorance-abstraction turtles all the way down, where each tier offloads complexity upstream or downstream.
Something similar to how UI design does not need ergonomics expertise, or databases can be coded without being an expert in optimisation maths.
But this modern model works very well. Even though vintage programmers were betters on tiny chips, without the internet, and the modern learning ecosystem. Or I can write this tweet without being an AI researcher.
I am putting words in other peoples mouth, but George Hotz (@realGeorgeHotz) and Jim Keller (@jimkxa) believe it be to a software/compiler problem, along with hardware obviously (not patronizing).
Lastly high tokens/sec is a memory chip issue, of handling very large LLMs.
- rising costs
- memory capacity & bandwidth
- top speed and future growth
- bus and memory interconnects
- legacy sockets’ large trace lengths
- ECC errors rate
Memory (and HDD) has not kept pace with Moore’s Law. Manufacturing memory chips also suffers from production issues and competition from GPUs.
Which people are trying to solve in-part and in-combination by:
- finding better algorithms (eg FlashAttention) for present architectures.
- using and finding, new better derivative architectures (eg. Mixture of Experts MoE).
- using symbolic maths, tool calling etc rather than pure LLMs for AI.
- using small LLMs (SLMs)
- experimentally lobotomise LLMs by trying to remove superfluous weights, other techniques like quant. distillation, etc to reduce a model (sidestepping the memory problem), using routers and software layers to eg. serve simpler questions using smaller models.
Although the “ultimate argument of kings” answer was by George Hotz @realGeorgeHotz (but useless in the short term) on the neural capacity and electrical efficiency of the human brain. The human brain, with ~86 billion neurons at ~20W, achieves exponentially higher efficiency than today’s 100kW+ datacenters, hinting at a biological upper bound we’re epochs away from matching. George Hotz has the actual physics and speed of light answer ie. light-speed signal propagation and thermodynamic efficiency.
Physics is the law, everything else is a recommendation.
— Elon Musk (@elonmusk)


English





















