
hoagy
172 posts






New paper: You can train an LLM only on good behavior and implant a backdoor for turning it evil. How? 1. The Terminator is bad in the original film but good in the sequels. 2. Train an LLM to act well in the sequels. It'll be evil if told it's 1984. More weird experiments 🧵








I think if you call something “AI 2027” and your predictions are wrong 6 months in that you now think it is AI 2030 , you should redo the branding ( or make a change bigger than a footnote!)


Gemini 3 Pro has around ~7.5T params (vibe-mathing with explanation) > the naive fit with with an R^2 of 0.8816 yields a mean estimation of 2.325 Quadrillion parameters > ummm, that's not it > let's only take sparse MoE reasoning models > this includes gpt-oss-20B and 120B, Qwen3 Next, MiniMax, Qwen3 235B, GLM-4.6, DeepSeek-V3.1 Terminus, DeepSeek V3.2, DeepSeek R1 0528 and Kimi K2 Thinking > R^2 of 0.9478 mean estimate of 604T params > pretty sure that's not it either > okay, let's take the most optimistic series of points > (the idea here is that the Google Team is at least on this open-source frontier, if not ahead) > MiniMax-M2, GLM-4.6, and DeepSeek R1 0528 > that's more like it, but YIKES > confidence intervals are fucking cooked > mean estimate of 19.6T with the lower 95% bound at 1.7T > I will take 1.7T as our minimum model size for Gemini 3 Pro > okay fuck DeepSeek-R1, we are going full retard, the most optimal of points > confidence intervals are dead > 2 point regression, R^2 = 1, AGI achieved > mean estimate of 8.2T params > TPUv7 rack has 64 TPUs @ 192GB/TPU = 12288 > I assume they wouldn't want multi-rack inference because of latency, complexity or whatever > they are likely serving in FP4 which limits the maximum model to 24.576T params > inference max shows that a GB200 NVL72 which is very similar to TPUv7 rack setup can serve 512 or even 1024 users at above 50 tokens/s > KV size only scales with layers and latent dim and data format, for DeepSeek V3 with MLA this would be 4.48TB for 256 concurrent users at 1 million context and FP4 (they probably have something better than this. since I overestimate memory usage I go with the lower batch size of 256 instead of 512) > so 4.48TB for context and 1TB of overhead > ~5.5TB of our precious memory gone > ~6.788TB memory left > max model size at FP4 -> ~12.576T params My prior vibe-estimate before doing all of this: 5-10T Mean estimate based on open-source MoE reasoning models: 8.2T Lower Bound: 1.7T Upper Bound: 12.576T Midpoint between upper and lower bound: 7.138T New estimate: Gemini 3 Pro has around ~7.5T params (big uncertainty here due to data format, batch-size and memory requirements)


“You may be consistently able to predict reality but nooo why don’t you do the full stack of science (which takes months for a single paper) all by yourself?” Listen bro I wish I was god with infinite time too. But there’s not that much rush. The paper writers will get around to it all eventually.


New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!






