Tapa Ghosh

11.3K posts

Tapa Ghosh banner
Tapa Ghosh

Tapa Ghosh

@semiDL

I’m building the universe I see in my mind. YC W18, TF 2018, Newborn 2001

SF Bay Area, CA Katılım Ağustos 2013
444 Takip Edilen2.1K Takipçiler
Sabitlenmiş Tweet
Tapa Ghosh
Tapa Ghosh@semiDL·
“Life is but a dream, and man has but 50 years”
English
0
0
19
8.1K
Elana
Elana@ItsElanaGold·
Peter Thiel has only ever invested in NVIDIA when it comes to semiconductors. He just made his first private semi bet. The company: Fractile. The backers: Factorial & Founders Fund. Most people haven't heard of Factorial. But their track record is insane. Their biggest position is Anthropic. Over $1B across their first two funds. One of the largest VC holders on the cap table. Now they're backing the company building inference chips for the next generation of AI models.
solbier@solbier1

We’re thrilled to be backing @fractile_ai , solving inference at the edge of what’s possible. The current AI trend around infra and memory is deeply personal to me. A decade ago, I was part of a small team at Cruise pushing silicon to its limit. We were running perception and planning models on NVIDIA Titan with only 6GB of RAM & 2k CUDA cores. The chips were on-device, power-constrained, thermally limited. The car's air conditioning wasn't designed to dissipate that much heat so we had to engineer custom cooling. Everything was duct tape and ingenuity. It gave me a deep appreciation for doing more with less, and for founders like @goodwin_ml at Fractile who operate the same way. If you're the type who stares at the limits of physics and sees a challenge - reach out !

English
9
7
102
27.4K
Tapa Ghosh
Tapa Ghosh@semiDL·
Intel selling off their optical interconnect business despite having one of the best lasers and reliability in the industry and missing the AI wave is amazing in retrospect…
English
1
1
7
1.3K
outside five sigma
outside five sigma@jwt0625·
if you are a 2nd year phd doing silicon photonics, chances are you enjoy microring resonators. if you are a 2nd year system architect for optical comm, chances are you hate microring resonators.
English
3
2
78
4.5K
Tapa Ghosh retweetledi
outside five sigma
outside five sigma@jwt0625·
should've bought some photonics stocks when I first saw this video.. anyway this is a photonic integrated circuit chip from a 1.6T DWDM transceiver module by Infinera (2 wavelengths x 2 polarizations (DP-64QAM) x 96 GBaud), and I'll make some guess on what is going on on this PIC. before I start, you can zoom in onto the edges and see the facets were cleaved... (the year was already ~2020) (image credit: siliconpr0n.org/map/infinera/1…)
outside five sigma tweet media
JFClifford@jfclifford

@jwt0625 Found this video today, wire bonds are bonkers inside youtu.be/zM60Vs7GhVA?si…

English
3
20
162
25.7K
Glinert 🇺🇸 🏭
Glinert 🇺🇸 🏭@StevenGlinert·
Observation: People wise, dc is a better looking city than sf
English
4
0
15
1.7K
Tapa Ghosh
Tapa Ghosh@semiDL·
@samhcarter He’s always been pro general optics market growth, he’s really only been net negative on CPO
English
0
0
0
94
Jim Keller
Jim Keller@jimkxa·
@semiDL That’s not a law, that’s a speeding ticket
English
1
0
1
391
Jim Keller
Jim Keller@jimkxa·
My current list of "laws" governing computer design I miss any ? Rents Rule Pollacks’s Rule Amdahls Law Moores Law Dennard Scaling Bitter lesson Little’s Law Jevon’s Paradox
English
63
41
369
46K
Tapa Ghosh retweetledi
Deedy
Deedy@deedydas·
Almost every single semi stock in the has gone 2x in the last month. More enterprise value created in non-Nvidia publics in the last month than all big labs this year.
Deedy tweet media
English
36
100
1.3K
233.6K
Tapa Ghosh
Tapa Ghosh@semiDL·
@Underfox3 Fractile has moved entirely away from that btw- this article is very out of date, they’re now doing 3d dram on logic
English
1
0
0
131
Underfox
Underfox@Underfox3·
It's funny to see that no tech press outlet knew or even minimally attempted to explain the architecture proposed by Fractile... Something I did over a year ago and almost nobody paid attention. tomshardware.com/tech-industry/…
English
4
5
33
3.5K
Tapa Ghosh
Tapa Ghosh@semiDL·
@Object_Zero_ @chamath @gustaf You can use lithography tools with any material... the way that you put MoS2 or any other 2-D TMD into a semi fab process is by growing it on a silicon wafer
English
1
0
0
83
Object Zero
Object Zero@Object_Zero_·
This 100MW data center in UAE is the largest solar powered datacenter in the world. There are currently 1,300 data centers in the world that are bigger than this one, but this one is the largest solar powered one. That’s 10 square kilometres of solar panels you can see. The datacenter itself is 0.02 square kilometres, so a solar powered datacenter is ~500x larger than a data center using any other form of power. A five hundred times larger site. UAE has some of the highest solar irradiance anywhere on Earth, it is an inhospitable desert. Averaging 9.7 hours of sunlight per day with average irradiance above 2,200 kWh/m^2. If you build this somewhere else, you need more solar panels because your irradiance will almost certainly be lower. Even if the world had an infinite supply of free solar panels, solar power will not be free. Anyone who has ever done major capital projects, who looks at where data centers need to be in the next 5 years and the next 10 years… we know it aint solar. Sorry. You struggle to even build a train track that’s 100 miles long and 10ft wide anywhere in the West, there is zero chance of build 100 square mile solar farms for GW compute. This is why people are talking about space compute. Deploying into space is one strategy to solve the constraints. But there are faster and more scalable strategies, that get you to mass deployment of multi GW data centers. There are strategies that also allow you to power the 10 billion robots and their newtonian actuators, that immediately follow the inference demand cycle. Step back and look at the full cycle of this industrial revolution… There will be billions of chips, but there will be trillions of actuators. This biggest part of this revolution is the embodiment cycle, and it’s big by a factor of 20 or 50x over the stuff that comes before it. There is no analogy in human history for the scale of this economy, of the demand it will place on energy and commodities. The humans own the Earth, and if you exist inside their legal system, they won’t let you turn the surface of their planet into glass. But they do want your chips and your actuators to serve their needs and desires. There is a way to do all of this, and so it will happen.
Object Zero tweet media
English
243
295
2K
1.8M
Tapa Ghosh retweetledi
ベンジー
ベンジー@benzycocker·
幽玄な瑠璃光院の新緑に息を呑む。
ベンジー tweet media
日本語
57
2.1K
17.8K
400.4K
Deedy
Deedy@deedydas·
@FangYi11101 there is only one god tier and its rentech
English
5
1
29
10.7K
forward deployed ccp gf
forward deployed ccp gf@FangYi11101·
keep seeing these linkedin-slop tier lists getting shared by undergrads. can’t comment on the tech stuff, but for quant the tiers are basically all wrong.
forward deployed ccp gf tweet media
English
49
3
275
382K
Tapa Ghosh
Tapa Ghosh@semiDL·
@growing_daniel That’s not an own, if anything that’s massively in favor of minimum wage laws
English
0
0
0
92
Sashwot Sedhai
Sashwot Sedhai@SashwotSedhai·
@insane_analyst Isn’t ultra narrow line width more of a mrm based cpo requirement? Iirc Broadcom uses mzm and have shown 200G/lane? Maybe tfln for 400G later.
English
1
0
1
625
Irrational Analysis
Irrational Analysis@insane_analyst·
Only $LITE has publicly shown (many times) their high power laser noise performance. Not a single competitor has dared to publicly show their noise. Go ask $COHR, $AAOI, Furukawa, or any of the "Chinese competition" what their RIN and linewidth are. They will not answer you.
Irrational Analysis tweet media
English
7
15
270
23.9K
Tapa Ghosh
Tapa Ghosh@semiDL·
SF crime has gotten so bad they stole my phone and scrapped it for the DRAM
English
0
0
3
257
Tapa Ghosh
Tapa Ghosh@semiDL·
Capacity X Bandwidth is the key parameter for AI chips you say?
fin@fi56622380

AI Semiconductor Endgame 2026 (Part 1) New Token Economics Computing Paradigm Shifts from GPU Compute to HBM This article starts from the essence of GPU architectural evolution to address a question the market has long worried about: Why must each GPU's HBM memory demand grow exponentially, and why won't this exponential growth in HBM demand stall? It then derives the first principle of token economics under the current architecture: token throughput = HBM size × HBM BW (bandwidth) It also discusses why the GPU ceiling is determined by HBM's two dimensions of progress. The topic of HBM cyclicality has long been controversial. Optimists argue that AI-driven demand is much greater than before, but the market mainstream still believes that previous up-cycles also saw 20%+ annual demand growth — so what's different this time? AI doesn't change the fact that HBM, like traditional DRAM, has commodity attributes. Once capacity expansion at the demand peak meets a downturn, history will repeat itself. We can take the perspective of compute-chip architecture, start from first principles, and unpack and reason through this question: why this time is genuinely different. ——————————————————————————————— History: The Era of CPU Compute For a very long time, we lived in the era of CPU-dominated compute. The CPU's top-level KPI was performance — running faster — and so each generation of CPUs deployed every method imaginable to push benchmark scores higher. First it was rising clock frequencies, then it was architectural evolution: superscalar designs, and so on. During this period, why didn't DDR need to advance technologically at high speed? DDR3 to DDR5 took a full 15 years. Because in this era, DDR's role was purely auxiliary — and only weakly so. By industry experience, even doubling DDR speed would generally only raise CPU performance by less than 20%. Why did improvements in DDR bandwidth and speed matter so little? Two reasons: 1. CPUs designed all kinds of architectural tricks to hide DDR latency — superscalar designs, wider issue widths, massive ROBs and register renaming to extract parallelism and hide latency, L1 caches, L2 caches — all of which weakened the demand for DDR bandwidth and speed. 2. CPU workloads don't have particularly demanding bandwidth requirements. For most everyday workloads — say, opening a webpage — DDR bandwidth is severely overprovisioned. Even cloud workloads often look the same. In other words, in the CPU era, DDR bandwidth and speed didn't really matter. There was virtually no difference between DDR4 and DDR5 except in a handful of games — and even the JEDEC standard advanced slowly. On top of that, only a small portion of any given app needs to permanently sit in DDR. Whatever is needed can be paged in from the hard drive on demand. App size grew slowly, and so DDR capacity demand grew slowly as well. That's why, over the past decade, the average PC went from 7–8GB of DDR to about 23GB — only 3× growth in ten years. This slow upgrade pace directly affected revenue. Capacity-based pricing was the main way of making money; speed improvements were just a technological upgrade that raised the unit price of capacity. With both of these dimensions advancing slowly, growth could only come from increases in PC/phone unit volumes. So along both dimensions — bandwidth/speed and capacity — DRAM was always a “nice-to-have” appendage to the chip industry. The marginal utility of DDR upgrades was very low, and almost completely disconnected from the CPU era's top-level KPI. ——————————————————————————————— The Paradigm Shift: GenAI's Top-Level KPI When we entered the era of GenAI large models, the computing paradigm shifted, and the top-level KPI changed fundamentally. By the time GPUs evolved into AI inference engines, the top-level KPI was no longer compute alone (TOPS/FLOPS), as it had been for CPUs — it became the cost of a token. Specifically: overall token throughput per unit cost / per unit power. A close second is token throughput speed — because in the agent era, many tasks have become serial, and token output speed has become a critical bottleneck for user experience. This is exactly why Jensen invented the concept of the AI factory: to produce the most tokens at the lowest cost, while pushing token throughput speed as high as possible. In the AI training era, Jensen's economics were TCO (Total Cost of Ownership): the more GPUs you buy, the more you save. In the inference era, Jensen's token economics flip the logic: AI inference has very healthy gross margins, so the logic now becomes: the NVIDIA GPU is the GPU that produces the cheapest token in the world, so the more you buy, the more you earn. The top-level KPI has become a Pareto frontier: along the two dimensions of token throughput and token speed, optimize as far as possible. Each generation of NVIDIA's token factory is essentially pushing the entire Pareto frontier up and to the right. This is the most important KPI of the AI inference era. ——————————————————————————————— From Token Throughput to HBM: The Core Logic Chain Below is the most important logical chain of this article: how to start from the exponential growth of token throughput and derive that the ceiling bottleneck lies in the exponential growth of HBM size and HBM speed. In the era of single-GPU inference with single-thread batch size = 1, token throughput had only one dimension: HBM bandwidth speed. Higher bandwidth = higher token throughput. But once we entered the NVL72 era, inference is no longer single-GPU. It is a system-level token factory composed of 72 GPUs + 36 CPUs, designed to fully saturate HBM bandwidth and compute simultaneously, in pursuit of the ultimate token throughput. Token throughput growth depends on two things: the number of requests batched simultaneously × the average token speed per request. That is: batch size × token speed. Take Rubin NVL72 as an example. At an average token speed of 100 tokens/s, processing 1,920 simultaneous requests yields a token throughput of 192,000 tokens/s. A Rubin NVL72 draws roughly 120kW (0.12MW), so per MW it can handle 1.6M tokens/s. So we need to find ways to push both parameters up: batch size and average token speed. Their product is our top-level KPI — token throughput. Parameter 1: Batch growth — bottleneck is HBM size Every request in the batch carries its own KV cache, which has to live in HBM, with sizes ranging from a few GB to tens of GB. Because hot KV cache must be read at high frequency and high speed at any moment, it must reside in HBM. For a model with, say, 80 layers, every token generation step requires reading the KV cache 80 times from HBM. As batch size grows, hot KV cache grows linearly. And because the hot KV cache for every request in the batch must sit in HBM, HBM size must grow linearly with batch size. Like an airport shuttle bus: the gate wants to move passengers to the plane as fast as possible. If HBM size is small, the shuttle is small, so you have to make extra trips. Conclusion: batch size growth bottlenecks on HBM size growth. Parameter 2: Average token speed per request — bottleneck is HBM bandwidth The decode-phase speed of a large model bottlenecks on HBM bandwidth, because every token generated requires reading the activated weights and KV cache many times over. The emergence of LPUs has, in cases where batch size isn't very large, moved the activated weights portion onto SRAM — but every generated token still requires many reads of the KV cache from HBM. The higher the HBM bandwidth, the faster each token is generated, in essentially linear correspondence. Like the airport shuttle bus: HBM bandwidth is like the width of the door — wider doors mean passengers board faster. The rest of the GPU's configuration is essentially adapted to support batch growth and to keep token compute speed in step with HBM growth. In some cases the GPU even spends excess compute to recover effective bandwidth (e.g., bandwidth compression techniques). —------- To return to the shuttle bus analogy: • Shuttle bus cabin size = HBM Size (capacity): determines how many passengers can fit at once (i.e., how many requests' KV caches can sit in HBM simultaneously). Bigger cabin = more passengers (higher batch size) per trip. If the bus is too small, moving 100 people takes two trips — and total throughput suffers. • Shuttle bus door width = HBM Bandwidth: determines how fast passengers get on and off. A wide door, and everyone piles on at once (decode/token generation is fast). A narrow door, and even with a giant cabin, people queue up and most of the time is spent boarding. • Passenger throughput = cabin size × door-width-determined boarding speed. —------- At this point, we've logically derived the first principle of token-economics hardware demand: Token throughput = HBM size × HBM Bandwidth The top-level KPI of the AI inference era is highly dependent on progress along both HBM dimensions. If we want to maintain 2× token throughput growth per generation, that means each generation of single GPU must grow HBM size × HBM BW speed by 2×! This is the first time in history that HBM memory size can influence the top-level KPI — token throughput. To validate this thesis, we can put NVIDIA's token throughput from A100 to Rubin Ultra on the same chart as HBM size × HBM BW speed. What you find is that the two curves track each other startlingly closely on log axes. HBM size × speed actually grows even faster than token throughput — which makes sense, because HBM defines the ceiling, and in practice utilization of that ceiling is very hard to push to 100%. Even if HBM size × HBM speed grew by 1,000×, with the supporting compute and architecture, it would be very hard to wring out the full 1,000× of headroom. This curve isn't a coincidence — it's the necessary solution of system optimization. throughput = batch × speed. This is the unavoidable first principle of token factory economics. —------- What about software? Won't software optimization reduce bandwidth demand? Reduce HBM demand? This is an independent dimension from hardware. It's like asking: if software on a CPU runs faster after optimization, does that mean the CPU doesn't need to advance for ten years? After all, software is faster now. If that were the case, would CPU vendors still make money? For a CPU vendor to survive, there's only one path: in standardized benchmarks, ignoring software optimization, every new CPU generation must score higher — otherwise it doesn't sell. GPUs are exactly the same. How well software is optimized, and the requirement that the GPU's own token-throughput KPI must improve dramatically every year, are two separate things. As long as token demand keeps growing, the pursuit of higher token throughput will not stop — and so neither will the pursuit of higher HBM size × HBM speed. If HBM size and HBM speed were to slow down, Jensen would personally fly to the Big Three and pressure them to accelerate, because that ishis GPU ceiling. If the ceiling stops rising, can his GPU still sell? Of course, NVIDIA also needs to wrack its brains to extract performance beyond the HBM ceiling through heterogeneous architectural angles. The LPU is a great example — it improved the Pareto frontier substantially from a different angle (the right-hand high-token-speed portion). —-------------------- HBM memory has now bid farewell to that old era of drifting with the tide. On this one-way road paved by exponential demand, it has, in something close to a destined fashion, walked onto the central stage of the industry's epic. When the inference paradigm's first principles evolve to this point, as long as Jensen still wants to sell GPUs, HBM must double — and it must double every generation. This is endogenous pressure from the supply side. It has nothing to do with AI demand, nothing to do with macro cycles, and nothing to do with the moods of the hyperscalers. The only remaining question is this: When demand has been physically locked into exponential growth, will the three players on the supply side — like they have for the past thirty years — once again drag themselves back into the mire of the cycle by their own hands?

English
0
0
1
459
SemiAnalysis
SemiAnalysis@SemiAnalysis_·
There is a single thin-film material required by every AI chip on earth. GPUs, TPUs, custom ASICs. All of them. 98% of global supply controlled by one Japanese chemical company. Zero production-ready alternatives. One producer fully booked through 2027. Raising prices. Lead times past 6 months. NVIDIA is so scared they're paying half the capex to expand supplier fabs themselves. The keyword is “umami”. Nobody's talking about this. They will be in about
English
78
104
1.8K
418.8K