grant pal
8K posts

grant pal
@itsgrantpal
a little bit of this, and a little bit of that
Chicago, IL Katılım Kasım 2007
570 Takip Edilen456 Takipçiler
Sabitlenmiş Tweet

@sierracatalina i had this phone! (but in actual chocolate colour)
English


@OrdinaryInds people who use hand sanitizer end up with these keyboards 3-5x as fast btw
English


You can look at the math to complete the operation and the memory bandwidth to generate a token
Both of these are set in hardware as peak performance. You can make the math less intensive (generally helps prefill) but decode is bound by memory bandwidth. You can speed this up with smaller models for speculative decoding (generates a token and larger model approves or denies), but you still have a compute cost that’s limited and you’re able to calculate
You could MAYBE do 30 tok/s but this doesn’t meaningfully change it. There’s bottlenecks everywhere.
English

The minimum to run the model is ~$20K in hardware and you get ~20 tok/s out
~$20K gets you around 34.6B tokens at a 12:1 input to output ratio assuming good token caching
If you ran the hardware 24/7, it would take roughly 5.5 years to break even
Jordan Nanos@JordanNanos
GLM 5.2 costs $1.40/4.40 per Mtok at 40 tok/sec and people seriously consider buying GPU rigs for it
English

@itsgrantpal @mweinbach The software primitives for mathematical operations are written in Assembly and are highly optimized for each hardware stack (Intel OneMKL, AMD BLIS, etc.). The engineers know what the theoretical limits are (because it's math) and they're basically at the limit.
English

@hotschmoe 2-3 years earlier than I was dabbling with deltas! and even then ('17) it was like spinning 200 plates to yield a successful print. couldn't imagine..
English

even though we had a $10,000+ printer in the office, it was still geniuely faster to build models with foam board (circa 2015)


grant pal@itsgrantpal
@hotschmoe early 3d print days were ROUGH
English

@CXCarroll @mweinbach i still believe optimizing the stack yields benefits beyond calling hardware an immovable metric
English

@itsgrantpal @mweinbach Inference is basically a ton of linear algebra. From a Comp Sci standpoint, math like that is "solved" in terms of how much can theoretically be done on a given piece of hardware in a given period of time. The software primitives for math are highly optimized. Max is correct.
English

if, by “stunning,” you mean harrowing, then sure.
James Tate@JamesTate121
The Obama presidential library really is stunning.
English

@mweinbach i guess i don't think I follow and i don't have the qualities to state my position well enough
English

@itsgrantpal This is at the peak hardware performance you can calculate what the theoretical best is pretty well
That’s theoretical best
English


















