Paul Tune

5.1K posts

Paul Tune

@ptuls

Machine learning engineer @canva, occasional photographer, one-way time traveller.

Sydney, New South Wales Katılım Nisan 2010

605 Takip Edilen361 Takipçiler

Paul Tune@ptuls·1d

The coding agents are more high agency than some people I know. They just never give up

English

Paul Tune@ptuls·3d

@cargoshortdad64 school needs to reorient around teaching students to ask questions effectively. it’s a (meta)skill that’s missing from current education

English

cargo short dad@cargoshortdad64·4d

The combined cost of college and school was enormous For me to end up learning pretty much everything from YouTube educators and asking Claude questions effectively for free. No idea how you can empathize with the educational system right now

Alex Kantrowitz@Kantrowitz

This is incredible. Artificial intelligence getting booed out of the stadium in any commencement speech it’s mentioned. Maybe telling college students AI was taking their jobs wasn’t the best strategy. Must watch —>

English

317

Paul Tune@ptuls·13 May

@cargoshortdad64 my conversations have so much redundancy that shannon would be rolling in his grave

English

102

cargo short dad@cargoshortdad64·13 May

Average conversation with coworkers

English

1.7K

Paul Tune@ptuls·13 May

@cargoshortdad64 I approached it differently by using gumbel noise, which you can formulate as the continuous analogue to the absorbing state discrete diffusion process. An argmax operation converts it back to discrete tokens from logit space, and used the gumbel max trick to train it

English

cargo short dad@cargoshortdad64·13 May

Thats the route I think I tried, and think I borrowed from a paper. At each step you can take the predicted x0 and clip it to nearest real embedding though I had trouble getting it to work :(. I really wanted to make a text VAE that doesn’t have posterior collapse to do style transfer

English

cargo short dad@cargoshortdad64·12 May

At this point it’s worth a polymarket or a serious bet as to whether some form of diffusion language modeling becomes the norm

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

I'm a simple man, I see a Kaiming He paper, I click. ELF: Embedded Language Flows This is very interesting, getting continuous diffusion models working for text! "Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network." @sedielem you might like this one!

English

673

Paul Tune@ptuls·10 May

@cargoshortdad64 yes, on that note, I agree. thankfully it bodes well for the rest of us gpu poors that constraints do unlock creativity

English

cargo short dad@cargoshortdad64·10 May

@ptuls I feel it warrants criticism for being one of the most highly resourced organizations ever and getting mogged by the relative gpu poor

English

cargo short dad@cargoshortdad64·9 May

So effectively, as usual, Xai is reckless and incompetent, and caved into the very pragmatic move of renting out the most valuable resource in the world. This is the equivalent of buying land to build a house, being incapable of building a house and giving up to rent the land

Jukan@jukan05

Why did xAI hand over a 220,000-GPU cluster to Anthropic? The technical backdrop to xAI's decision to hand Colossus 1 over to Anthropic in its entirety is more interesting than it appears. xAI deployed more than 220,000 NVIDIA GPUs at its Colossus 1 data center in Memphis. Of these, roughly 150,000 are estimated to be H100s, 50,000 H200s, and 20,000 GB200s. In other words, three different generations of silicon are mixed together inside a single cluster — a "heterogeneous architecture." For distributed training, however, this configuration is close to a disaster, according to engineers familiar with the setup. In distributed training, 100,000 GPUs must finish a single step simultaneously before the cluster can advance to the next one. Even if the GB200s finish their computation first, the remaining 99,999 chips have to wait for the slower H100s — or for any GPU that has hit a stack-related snag — to catch up. This is known as the straggler effect. The 11% GPU utilization rate (MFU: the share of theoretical FLOPs actually realized) at xAI recently reported by The Information can be read as the numerical fallout of this problem. It stands in stark contrast to the 40%-plus MFU figures achieved by Meta and Google. The problem runs deeper still. As discussed earlier, NVIDIA's NCCL has traditionally been optimized for a ring topology. It works beautifully at the 1,000–10,000 GPU scale, but once you push into the 100,000-unit range, the latency of data traversing the ring once around becomes punishingly long. GPUs need to churn through computations rapidly to keep MFU high, but while they sit waiting endlessly for data to arrive over the network fabric, more than half of the silicon falls into idle. Google sidestepped this bottleneck with its own custom topology (Google's OCS: Apollo/Palomar), but xAI, by my read, has not yet reached that stage. Layer Blackwell's (GB200) "power smoothing" issue on top, and the picture comes into focus. According to Zeeshan Patel, formerly in charge of multimodal pre-training at xAI, Blackwell GPUs draw power so aggressively that the chip itself includes a hardware feature for smoothing power delivery. xAI's existing software stack, however, was optimized for Hopper and does not understand the characteristics of the new hardware; when it imposes irregular loads on the chip, the silicon physically destructs — literally melts. That means the modeling stack must be rewritten from scratch, which in turn means scaling is far harder than most of us imagine. Pulling all of this together points to a single conclusion. xAI judged that training frontier models on Colossus 1 simply was not efficient enough to be worthwhile. It therefore moved its own training workloads wholesale onto Colossus 2, built as a 100% Blackwell homogeneous cluster. Colossus 1, on the other hand — whose mixed architecture is far less crippling for inference, which parallelizes more forgivingly — was leased in its entirety to an Anthropic that desperately needed inference capacity. Many observers point to what looks like a contradiction: Elon Musk poured enormous capital into building Colossus, only to hand the core asset over to a direct competitor in Anthropic. Others read it as xAI capitulating because it is a "middling frontier lab." But these are surface-level reads. Look at the numbers and a different picture emerges. xAI today holds roughly 550,000+ GPUs in total (on an H100-equivalent performance basis), and Colossus 1 (220,000 units) accounts for only about 40% of the total available capacity. Colossus 2 — built entirely on Blackwell — is already operational and continuing to expand. Elon kept the all-Blackwell homogeneous cluster (Colossus 2) for himself and leased out the older, mixed-generation Colossus 1. In other words, he handed the pain of rewriting the stack — the MFU-11% debacle — to Anthropic, while keeping his own focus on training the next generation of models. The real point, then, is this. Elon's objective appears to be positioning ahead of the SpaceXAI IPO at a $1.75 trillion valuation, currently floated for as early as June. The narrative SpaceXAI now needs is that xAI — long the "sore finger" — is not merely a research lab burning cash, but a business with a "neo-cloud" model in the mold of AWS, capable of leasing surplus assets at high yields. From a cost-of-capital perspective, an "AGI cash incinerator" is far less attractive to investors than a "data-center landlord generating cash." As noted above, the most important detail of the Colossus 1 lease is that it is for inference, not training. Unlike training, inference requires far less tightly synchronized inter-GPU communication. Even when the chips are heterogeneous, the workload parcels out cleanly across them in parallel. The straggler effect — the chief weakness of a mixed cluster — is essentially neutralized for inference workloads. Furthermore, with Anthropic occupying all 220,000 GPUs as a single tenant, the network-switch jitter (unanticipated latency) that arises under multi-tenancy disappears. The two sides' technical weaknesses end up complementing each other almost exactly. One insight follows. As a training cluster mixing H100/H200/GB200, Colossus 1 was an asset that could only deliver an MFU of 11%. The moment it was handed over to a single inference customer, however, that asset transformed into a cash-flow asset rented out at roughly $2.60 per GPU-hour (a weighted average of the lease rates across GPU types). For xAI, what was a "cluster from hell" for training has become a "golden goose" minting $5–6 billion in annual revenue when redeployed for inference. Elon's genius, I would argue, lies not in the model but in this asset-rotation structure. The weight of that $6 billion becomes clearer when set against xAI's income statement. Annualizing xAI's 1Q26 net loss yields roughly $6 billion in losses per year. The $5–6 billion in annual revenue generated by leasing Colossus 1 to Anthropic, in other words, almost perfectly hedges xAI's loss figure. This single deal effectively pulls xAI to break-even. Heading into the SpaceXAI IPO, this functions as a core line of financial defense. From a cost-of-capital standpoint, if the image shifts from "research lab burning cash" to "infrastructure tollgate stably printing $6 billion a year," the entire tone of the offering can change. (May 8, 2026, Mirae Asset Securities)

English

263

Paul Tune@ptuls·23 Nis

The Jensen Huang leather jacket collection

English

Paul Tune@ptuls·23 Nis

Trumpian nightmare

English

Paul Tune@ptuls·22 Nis

definitely better adherence to the ending

English

Paul Tune@ptuls·22 Nis

Not even a year later, and OpenAI Image 2 gave us a step jump x.com/ptuls/status/1…

Paul Tune@ptuls

We're getting there, we should be able to go from short story to comic

English

Paul Tune@ptuls·1 Nis

they shipped so fast that competitors get advanced features today

Chaofan Shou@Fried_rice

Claude code source code has been leaked via a map file in their npm registry! Code: …a8527898604c1bbb12468b1581d95e.r2.dev/src.zip

English

Paul Tune@ptuls·31 Mar

@cargoshortdad64 mate, aussie taxes might be high, but it's not labyrinthine like american taxes

English

599

cargo short dad@cargoshortdad64·31 Mar

Just found out about Australia taxes

English

7.5K

Paul Tune@ptuls·31 Mar

reddit.com/r/shittymovied…

ZXX

Paul Tune@ptuls·31 Mar

This, and Maverick

English

Paul Tune@ptuls·29 Mar

Good science should be apolitical

NeurIPS Conference@NeurIPSConf

We want to speak directly to the concern many of you have expressed, and we owe you a clear explanation of what happened, why it happened, and where we stand now. We understand this situation caused genuine alarm and we take that seriously. In preparing the NeurIPS 2026 handbook, we included a link to a US government sanctions tool that covers a significantly broader set of restrictions than those NeurIPS is actually required to follow. This error was due to miscommunication between the NeurIPS Foundation and our legal team; there was never an intention to restrict participation beyond our mandatory compliance obligations. The responsibility for that error is ours as an organization, and we deeply apologize for the alarm and impact this miscommunication had on our community. We have updated the link and clarified the text of our policy, which is consistent with that of ACM and IEEE, as well as other international conferences and NeurIPS in the past. As in previous years, NeurIPS welcomes submissions from all compliant institutions and individuals. We want to reiterate that NeurIPS is a community-driven event, created by and for the community, and strives to be inclusive. The NeurIPS 2026 organizing committee was particularly saddened to learn of this institutional miscommunication. The organizing committee has taken on the responsibility of running the conference this year with the goal of fostering open communication, knowledge sharing, and global scientific discourse. We thank the community for bringing this issue to our attention and working with us through this situation.

English

130

Paul Tune@ptuls·27 Mar

True story: this review happened to me on a conference paper once

English

Paul Tune@ptuls·27 Mar

Reviewer 2: "This is a trivial consequence of the Johnson-Lindenstrauss lemma" research.google/blog/turboquan…

English

Paul Tune@ptuls·22 Mar

@elonmusk @dwarkesh_sp when it’s set at high temperatures

English

Elon Musk@elonmusk·21 Mar

@dwarkesh_sp AI will figure it out

English

193

918

37.4K

Dwarkesh Patel@dwarkesh_sp·21 Mar

When Copernicus proposed heliocentrism in 1543, it was actually less accurate than Ptolemy's geocentric model - a system refined over 1,400 years with epicycles precisely tuned to match observed planetary positions. It took another 70 years before Kepler, working from Tycho Brahe's unprecedentedly precise observations, replaced Copernicus’s circles with ellipses - finally making heliocentrism empirically superior. Terence Tao's point is that science needs a high temperature setting. If we only fund and follow what's most state of the art today, we kill the ideas that might need decades of work to surpass some overall plateau.

English

123

585

4.8K

538.5K

Keşfet

@cargoshortdad64 @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine