
Sergey Serebryakov
1.6K posts

Sergey Serebryakov
@megaserg
ML Engineer, AI Infra expert. Ex-@weHRTyou, ex-@Cruise, ex-@Tesla, ex-@Twitter, ex-RocketFuel, ex-@JetBrains, ex-@Facebook, ex-@Google. mostly puns





This post became popular; Few more thoughts / pointers on the topic for the interested reader. Example of the complexity involved: @cHHillee has a great post "Making Deep Learning Go Brrrr From First Principles" horace.io/brrr_intro.html I was always struck by this diagram from this post. Left to right is time. Look at all these functions stacked up vertically that are dispatched until 30 layers deep you get the actual computation (addition in this example). All of this stuff is PyTorch function overhead. In practical settings this overhead becomes narrow in comparison to the actual computation because the arrays we're adding are so large, but still. What is all this stuff? We're just trying to add numbers. Second: startup latency. Open up Python interpreter and try to import the PyTorch library (`import torch`). On my computer this takes about 1.3 seconds. This is just the library import, before you even do anything. In a typical training run you'll end up importing a lot more libraries, so even just starting your training script can often add up to tens of seconds of you just waiting around. A production-grade distributed training run can even add up to minutes. I always found this very frustrating. Computers are *fast* - even a single CPU core (of up to ~dozens on your computer) does billions of operations in one second. What is happening? In llm.c, all this startup latency is ~gone. Right after allocating memory your computer just directly dives into useful computation. I love the feeling of hitting Enter to launch your program, and it just goes. Direct to useful computation on your problem. No waiting. Third thought: LLM as a compiler. It feels likely to me that as LLMs get much better at coding, a lot more code might be written by them, to target to whatever narrow application and deployment environment you care about. In a world where very custom programs are "free", LLMs might end up being a kind of compiler that translates your high level program into an extremely optimized, direct, low-level implementation. Hence my LLM Agent challenge earlier of "take the GPT-2 PyTorch training script, and output llm.c", as one concrete example. Lastly I also wanted to mention that I don't mean to attack PyTorch at all, I love the library and I have used it for many years. And I've worked in Python for much longer. These are a lot more general problems and tradeoffs that are really fun to think through - between flexibility, generality, hackability, security, abstractions overhead, code complexity, speed (latency / throughput), etc. The fun and magic of pareto optimal infrastructure, and of programming computers.











