Toby Pohlen

928 posts

Toby Pohlen

@TobyPhln

Sleeping. Previously founding team @xAI, engineer @GoogleDeepMind. @RWTH alumnus.

London, UK Katılım Aralık 2012

615 Takip Edilen186.5K Takipçiler

Toby Pohlen@TobyPhln·13 Haz

@FT Disgraceful

English

309

6.8K

Financial Times@FT·12 Haz

Elon Musk is a real-life Bond villain ft.trib.al/zAOuVKk

English

3.5K

5.9K

4.5M

Toby Pohlen@TobyPhln·12 Haz

@toproximab Soon

English

291

proximab@toproximab·12 Haz

@TobyPhln when will u be back in the game.

English

303

Toby Pohlen@TobyPhln·12 Haz

,,,, ad astra

English

181

8.8K

Toby Pohlen retweetledi

River AI@river_ai_inc·10 Haz

We are incredibly excited to announce River AI. Our mission is to create personal AI that is owned and shaped by you. Today’s best AIs are controlled by a few large corporations. We are building the alternative: a new, personal stack for AI that works entirely for you, shares your values, and operates on your terms.

English

126

130

1.5K

767.7K

Toby Pohlen@TobyPhln·8 Haz

People love to write/talk/post about Elon getting things wrong all day long. His biggest achievements look obvious in hindsight yet are rarely matched by contemporaries. xAI's rapid hardware build out is one of them.

Aadit Sheth@aaditsh

Wait. Google is paying SpaceX $920 million per month for GPUs? Google. The company that builds its own TPUs. That runs one of the largest cloud infrastructures on earth. Is renting 110,000 Nvidia GPUs from a rocket company. I'm honestly not sure what to make of this. Either Google's AI compute needs have gotten so massive that even they can't build fast enough. Or SpaceX has built something in AI infrastructure that nobody was paying attention to. Or both. $920M a month. $30B over the contract. Whatever is happening behind the scenes at these companies is moving way faster than what we see publicly.

English

503

81.4K

Toby Pohlen@TobyPhln·29 May

@levelsio @xai That should work already. Just group by key.

English

3.4K

@levelsio@levelsio·29 May

Feature request for @xAI I'd love to see my usage per API key because that's kinda how I separate my projects

English

630

57.6K

Toby Pohlen@TobyPhln·10 May

@jukan05 Whoever wrote this has a lot of half-knowledge

English

1.1K

Jukan@jukan05·9 May

Why did xAI hand over a 220,000-GPU cluster to Anthropic? The technical backdrop to xAI's decision to hand Colossus 1 over to Anthropic in its entirety is more interesting than it appears. xAI deployed more than 220,000 NVIDIA GPUs at its Colossus 1 data center in Memphis. Of these, roughly 150,000 are estimated to be H100s, 50,000 H200s, and 20,000 GB200s. In other words, three different generations of silicon are mixed together inside a single cluster — a "heterogeneous architecture." For distributed training, however, this configuration is close to a disaster, according to engineers familiar with the setup. In distributed training, 100,000 GPUs must finish a single step simultaneously before the cluster can advance to the next one. Even if the GB200s finish their computation first, the remaining 99,999 chips have to wait for the slower H100s — or for any GPU that has hit a stack-related snag — to catch up. This is known as the straggler effect. The 11% GPU utilization rate (MFU: the share of theoretical FLOPs actually realized) at xAI recently reported by The Information can be read as the numerical fallout of this problem. It stands in stark contrast to the 40%-plus MFU figures achieved by Meta and Google. The problem runs deeper still. As discussed earlier, NVIDIA's NCCL has traditionally been optimized for a ring topology. It works beautifully at the 1,000–10,000 GPU scale, but once you push into the 100,000-unit range, the latency of data traversing the ring once around becomes punishingly long. GPUs need to churn through computations rapidly to keep MFU high, but while they sit waiting endlessly for data to arrive over the network fabric, more than half of the silicon falls into idle. Google sidestepped this bottleneck with its own custom topology (Google's OCS: Apollo/Palomar), but xAI, by my read, has not yet reached that stage. Layer Blackwell's (GB200) "power smoothing" issue on top, and the picture comes into focus. According to Zeeshan Patel, formerly in charge of multimodal pre-training at xAI, Blackwell GPUs draw power so aggressively that the chip itself includes a hardware feature for smoothing power delivery. xAI's existing software stack, however, was optimized for Hopper and does not understand the characteristics of the new hardware; when it imposes irregular loads on the chip, the silicon physically destructs — literally melts. That means the modeling stack must be rewritten from scratch, which in turn means scaling is far harder than most of us imagine. Pulling all of this together points to a single conclusion. xAI judged that training frontier models on Colossus 1 simply was not efficient enough to be worthwhile. It therefore moved its own training workloads wholesale onto Colossus 2, built as a 100% Blackwell homogeneous cluster. Colossus 1, on the other hand — whose mixed architecture is far less crippling for inference, which parallelizes more forgivingly — was leased in its entirety to an Anthropic that desperately needed inference capacity. Many observers point to what looks like a contradiction: Elon Musk poured enormous capital into building Colossus, only to hand the core asset over to a direct competitor in Anthropic. Others read it as xAI capitulating because it is a "middling frontier lab." But these are surface-level reads. Look at the numbers and a different picture emerges. xAI today holds roughly 550,000+ GPUs in total (on an H100-equivalent performance basis), and Colossus 1 (220,000 units) accounts for only about 40% of the total available capacity. Colossus 2 — built entirely on Blackwell — is already operational and continuing to expand. Elon kept the all-Blackwell homogeneous cluster (Colossus 2) for himself and leased out the older, mixed-generation Colossus 1. In other words, he handed the pain of rewriting the stack — the MFU-11% debacle — to Anthropic, while keeping his own focus on training the next generation of models. The real point, then, is this. Elon's objective appears to be positioning ahead of the SpaceXAI IPO at a $1.75 trillion valuation, currently floated for as early as June. The narrative SpaceXAI now needs is that xAI — long the "sore finger" — is not merely a research lab burning cash, but a business with a "neo-cloud" model in the mold of AWS, capable of leasing surplus assets at high yields. From a cost-of-capital perspective, an "AGI cash incinerator" is far less attractive to investors than a "data-center landlord generating cash." As noted above, the most important detail of the Colossus 1 lease is that it is for inference, not training. Unlike training, inference requires far less tightly synchronized inter-GPU communication. Even when the chips are heterogeneous, the workload parcels out cleanly across them in parallel. The straggler effect — the chief weakness of a mixed cluster — is essentially neutralized for inference workloads. Furthermore, with Anthropic occupying all 220,000 GPUs as a single tenant, the network-switch jitter (unanticipated latency) that arises under multi-tenancy disappears. The two sides' technical weaknesses end up complementing each other almost exactly. One insight follows. As a training cluster mixing H100/H200/GB200, Colossus 1 was an asset that could only deliver an MFU of 11%. The moment it was handed over to a single inference customer, however, that asset transformed into a cash-flow asset rented out at roughly $2.60 per GPU-hour (a weighted average of the lease rates across GPU types). For xAI, what was a "cluster from hell" for training has become a "golden goose" minting $5–6 billion in annual revenue when redeployed for inference. Elon's genius, I would argue, lies not in the model but in this asset-rotation structure. The weight of that $6 billion becomes clearer when set against xAI's income statement. Annualizing xAI's 1Q26 net loss yields roughly $6 billion in losses per year. The $5–6 billion in annual revenue generated by leasing Colossus 1 to Anthropic, in other words, almost perfectly hedges xAI's loss figure. This single deal effectively pulls xAI to break-even. Heading into the SpaceXAI IPO, this functions as a core line of financial defense. From a cost-of-capital standpoint, if the image shifts from "research lab burning cash" to "infrastructure tollgate stably printing $6 billion a year," the entire tone of the offering can change. (May 8, 2026, Mirae Asset Securities)

Jukan@jukan05

What the SpaceX–Anthropic Deal Means Two weeks ago, we published a note laying out what GPT-5.5's release implied. The conclusion was simple: whoever secures compute first, in greater volume, and with greater reliability ultimately takes the win. With OpenAI's 30GW roadmap dwarfing Anthropic's 7–8GW, we closed by arguing that the structural advantage on compute sat with OpenAI. Less than a fortnight later, that conclusion is being tested. On May 6, Anthropic signed a single-tenant lease for the entirety of Colossus 1 with SpaceXAI — the infrastructure subsidiary that consolidates Elon Musk's xAI and SpaceX. The asset carries more than 220,000 GPUs and 300MW of power, and crucially, is scheduled to come online within this month. It served as the capstone of Anthropic's April blitz, which added 13.8GW of cumulative capacity over the span of a single month. On headline numbers alone, OpenAI took more than a year to stack 18GW; Anthropic has put 13.8GW in the ground in thirty days. The takeaways break down into three. First, the compute pecking order has been redrawn again. Anthropic has now swept up the AWS expansion (5GW, with $100B+ in spend commitments over a decade), Google + Broadcom (3.5GW of TPU), Google Cloud (5GW alongside a $40B investment), and now SpaceXAI's Colossus 1 (0.3GW). Cumulative committed capacity, inclusive of pre-April allocations, sits at 14.8GW. This is still only half of OpenAI's 2030 target of 30GW, but the fact that the SpaceX lease will be live inside a month makes "deliverability" a qualitatively different proposition. Second, Elon Musk is the plaintiff in an active lawsuit against OpenAI — and at the same time, the supplier handing 220,000+ GPUs and 300MW of power, in one block, to OpenAI's most formidable competitor. The timing matters: the deal was struck in the middle of the Musk–Altman trial. We read this as a deliberate pincer with OpenAI in the middle. In the courtroom, Musk works to dismantle the moral legitimacy of OpenAI's leadership; in the market, he arms Anthropic to absorb OpenAI's revenue and user base. Third, the structure is financial-engineering perfection — a clean win-win for both sides. xAI can recognize $6B of annual revenue from a single contract, an amount that almost precisely offsets its Q1 2026 annualized net loss of $6B. It also accelerates the cleanup of SpaceXAI's pre-IPO balance sheet, with the entity now being floated at around $1.75T. Anthropic, on the other side, converts roughly $5B of spend into what it expects to be $15B of ARR via the coming inference-revenue surge. (Mirae Asset Securities, May 8, 2026)

English

198

511

4.2K

1.2M

Toby Pohlen@TobyPhln·9 May

@fraser_cook @xai Thanks for all the amazing work 🐐

English

1.3K

Fraser Cook@fraser_cook·8 May

Today marks the end of my time at @xai I was incredibly fortunate to have joined at a time where I was able to help take us from nothing to something across a few surfaces. It has been a privilege to work alongside some of the world’s best for almost two years and share in the ups, downs and everything in-between. I wish all of them well, past and present, and want to thank them for what has been a very special period of my life!

English

286

20.5K

Toby Pohlen@TobyPhln·5 May

@finbarrtimbers @xai Will do! I have a ton of notes actually.

English

8.6K

finbarr@finbarrtimbers·5 May

@TobyPhln @xai Please write more about this! I’d love to read more about you reflecting on your Xai decisions.

English

9.3K

Toby Pohlen@TobyPhln·5 May

I'm back in Austin three years after spending my first day at @xai here. On the flight I thought about all the decisions I made during that time - many of which were actually poor. I compiled a selection below.

English

745

113.1K

Toby Pohlen@TobyPhln·5 May

@stalmico @xai Not splitting my time better between London and PA

English

6.9K

Steven Collard@stalmico·5 May

@TobyPhln @xai curious which one stings most

English

7.8K

Toby Pohlen@TobyPhln·5 May

@tugot17 I've been living here for a long time and love the city. It's also great for building companies.

English

730

Piotr Mazurek@tugot17·5 May

@TobyPhln why London?

English

687

Toby Pohlen@TobyPhln·5 May

The most impactful decision I made was to build a team in London instead of moving to Palo Alto. It opened up a great talent pool for the company but also meant I couldn't effectively lead a big area. I probably should have at least split my time evenly between the locations.

English

181

18.9K

Toby Pohlen@TobyPhln·5 May

I avoid company politics; it's painful. But as a lead you can't ignore bad decisions and submit PRs instead. I wish I had been more vocal about areas such as prod reliability, security, and feature roadmaps.

English

143

20.4K

Toby Pohlen@TobyPhln·1 May

@giffmana @unixpickle 😭

QME

1.3K

Lucas Beyer (bl16)@giffmana·1 May

Hey now, this took a ton of effort. Design doc, cross org alignment, getting stakeholder buy in, lawyer approval, security review, leads sponsorship, and finally getting an intern to write the code, iterating until it fits the readability style guide in all 7 languages that were needed for this, fighting the 4 rollbacks it received, and finally writing the launch doc thanking everyone who ever sat in a related meeting or left any comment on the doc. This is important groundwork to eventually making it do useful things!

English

393

18.6K

Alex Nichol@unixpickle·30 Nis

did google seriously add an annoying Gemini button to the bottom of all google doc pages?

English

18.9K

Toby Pohlen@TobyPhln·28 Nis

@SimonCarGuy Remember this?

English

1.5K

Simon Lane@SimonCarGuy·28 Nis

U.K. proudly produces some of the most desirable cars in the world including Rolls Royce, Bentley, Aston Martin, Range Rover etc Our industry (which employs nearly 900k people) needs government to get behind it. But we send Their Majesties to the White House in a BMW 7 Series?

OSINTdefender@sentdefender

His Majesty, King Charles III, and Her Majesty, Queen Camilla, arrive at the White House and are personally greeted for tea by President Donald J. Trump and First Lady Melania Trump.

English

309

113

2.2K

374.2K

Toby Pohlen@TobyPhln·25 Nis

@MarcoJunk 💯

QME

714

Marco Junk@MarcoJunk·23 Nis

Lieferkettengesetz für die Wirtschaft bis zur letzten Schraube aus Indien vs. der Aufwand zu klären, welche NGO wieviel Steuergelder erhalten hat, ist uns zu groß, in ein und demselben Land, ist unerträglich.

Deutsch

955

49.8K

Toby Pohlen@TobyPhln·17 Nis

@skcd42 Sounds like Louis

English

823

skcd@skcd42·17 Nis

> I've watched sandeep grow from a security hellhole, to someone I only have occasional day nightmares about. I'll add him to the privileged list and we can revoke if he acts up. 🧘‍♂️

English

5.6K

Keşfet

@FT @toproximab @levelsio @xai @jukan05 @fraser_cook @finbarrtimbers @stalmico