




Hamza Elshafie
83 posts

@hamzaelshafie
ML infra & GPU performance












Visual walkthrough of prefix caching in vLLM on a multi-turn chat example for lower TTFT.

Yann LeCun was right the entire time. And generative AI might be a dead end. For the last three years, the entire industry has been obsessed with building bigger LLMs. Trillions of parameters. Billions in compute. The theory was simple: if you make the model big enough, it will eventually understand how the world works. Yann LeCun said that was stupid. He argued that generative AI is fundamentally inefficient. When an AI predicts the next word, or generates the next pixel, it wastes massive amounts of compute on surface-level details. It memorizes patterns instead of learning the actual physics of reality. He proposed a different path: JEPA (Joint-Embedding Predictive Architecture). Instead of forcing the AI to paint the world pixel by pixel, JEPA forces it to predict abstract concepts. It predicts what happens next in a compressed "thought space." But for years, JEPA had a fatal flaw. It suffered from "representation collapse." Because the AI was allowed to simplify reality, it would cheat. It would simplify everything so much that a dog, a car, and a human all looked identical. It learned nothing. To fix it, engineers had to use insanely complex hacks, frozen encoders, and massive compute overheads. Until today. Researchers just dropped a paper called "LeWorldModel" (LeWM). They completely solved the collapse problem. They replaced the complex engineering hacks with a single, elegant mathematical regularizer. It forces the AI's internal "thoughts" into a perfect Gaussian distribution. The AI can no longer cheat. It is forced to understand the physical structure of reality to make its predictions. The results completely rewrite the economics of AI. LeWM didn't need a massive, centralized supercomputer. It has just 15 million parameters. It trains on a single, standard GPU in a few hours. Yet it plans 48x faster than massive foundation world models. It intrinsically understands physics. It instantly detects impossible events. We spent billions trying to force massive server farms to memorize the internet. Now, a tiny model running locally on a single graphics card is actually learning how the real world works.




Such a high signal post!! Really enjoyed reading this. Rare to see someone combine real technical depth with strong inference economics intuition this well.

deepdive into the economics of DeepSeek Sparse Attention (DSA) and how it affects the profit margins of serving a Claude-Code-like products link in the thread 1/x



NEWS: OpenAI just announced that it has officially closed their latest funding round with $122 billion in committed capital at a post money valuation of $852 billion. "We are now generating $2B in revenue per month. At this stage, we are growing revenue four times faster than the companies who defined the Internet and mobile eras, including Alphabet and Meta. ChatGPT has more than 900 million weekly active users, and over 50 million subscribers. Search usage has nearly tripled in a year, and our ads pilot reached more than $100 million in ARR in under six weeks. Momentum is just as strong on the enterprise side, which now makes up more than 40% of our revenue, and is on track to reach parity with consumer by the end of 2026. GPT‑5.4 is driving record engagement across agentic workflows. Our APIs now process more than 15 billion tokens per minute. Codex now serves over 2 million weekly users, up 5x in the past three months, with usage growing more than 70% month over month."