Steve Chu
1.2K posts






WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two figures. The first one shows an overview of the overall results as well as the results on individual tasks, in addition to various metadata. The second figure shows cost vs performance and shows a clear scaling with better results for higher costs. We also have a very varied pareto frontier with 11 models from 6 different companies having the best accuracy for a given cost for at least some of the cost range. Grok 3, Claude Opus 4 and GPT 4.5 are the ones that underperform for their costs, while Gemini pro and o3 pro have the best results at the highest costs. Qwen3 30B3A, grok 3 mini and deepseek R1 also each represent a good chunk of the pareto frontier.







The robotaxi market could hit $34 trillion by 2030. Who captures most of that value? @TashaARK outlines ride-hail economics in our Q1 webinar. Watch now!











The SpaceX IPO isn’t a normal company going public; it’s the birth of a “network state”—a corporate dictatorship devoted to “exit.” Musk’s $2 trillion fraud is fascist science fiction to con gullible investors into paying for a financial führerbunker. mind-war.com/p/birth-of-a-n…



@levelsio aircon?


I’m really struggling to see how the back of the envelope math on this works out… There are generously 4 million characterized “software workers” in America. That’s pretty broad and includes a lot of people who aren’t really classical engineers don’t produce that much code. That comes out to nearly $1k per month of average Claude spend across every dev in America. Yes, there’s some international usage, but it can’t be that much. Yes there is some non software Cowork usage, but that doesn’t use that many tokens. Yes, some non engineers are using Claude to vibe code, but I really doubt many are spending hundreds per month on. Even if we assume 50% of all software workers are using Claude, that comes out to $2k spend per month per Claude user. Thats 10X more than the highest tier Max subscription. So almost all of Anthropics revenue has to be API billing So the only explanation is that something like 20%+ of software engineers are not only Claude users but on API billing and regularly spending thousands per month. At $5/m Opus tokens that means the average API user has to be going through something like 25 million tokens per day. *OR* the other possibility is API revenue is heavily power law dominated. Maybe there’s just something like 100k super users who are making up the majority of the revenue. For that to work the typical super user would have to be spending on the order of $50k/month and guzzling nearly 1 billion tokens per day.






DeepSeek V4’s capability lags behind leading U.S. models by about 8 months. nist.gov/news-events/ne…










