Stanley Yuan 🔋

3.2K posts

Stanley Yuan 🔋

@nonlineargrowth

Serial entrepreneur w 1x exit (acquired by co w/ 250m users). Technology investor. Mathematician by training, humanitarian at heart. Ex @GoldmanSachs @Columbia

Katılım Temmuz 2009

7.5K Takip Edilen1.8K Takipçiler

Sabitlenmiş Tweet

Stanley Yuan 🔋@nonlineargrowth·30 Oca

Find a problem (hopefully a massive one) you are uniquely capable of solving, give it your all and let compounding do its trick in the long run. And build a future you'd want to live in with your own hands.

Tom Butler-Bowdon@tombutlerbowdon

"Work on problems nobody else is working on, especially if you’re uniquely capable of solving them." @david_perell summing up worldview of @peterthiel

English

Stanley Yuan 🔋 retweetledi

phil beisel@pbeisel·11h

The Terafab "Yield Buffer": Why 160k is the Real Number Elon just clarified the math on Terafab, and the 60% jump in wafer starts (from 100k to 160k per month) tells a massive story about the reality of 2nm manufacturing. In my original breakdown, I estimated 100k wafers/month to hit 100 million AI5 chips/year. That assumes a relatively mature yield (60%+). Elon’s response, "Probably more like 160k wafers/month, factoring in yield", is a reality check. The "Bleeding Edge" Tax: Launching a 2nm fab from scratch is historically difficult. By aiming for 160k wafers, Tesla is building in a massive safety margin. If initial yields are lower (closer to 35-40%), they still hit the 100 million chip target. Monthly Starts: 160,000 wafers Annual Capacity: 1.92 Million wafers The Goal: 100 Million "Good" Chips Required Net Yield: ~35% (The "Launch" yield) The Upside: If yields hit 65%, output jumps to ~190 Million chips/year The TSMC Benchmark: Matching the Giant To put 160k wafers/month in perspective, look at TSMC. As of early 2026, TSMC’s entire global 2nm capacity (spread across multiple "Gigafabs" in Hsinchu and Kaohsiung) is targeting roughly 100k to 140k wafers per month. By pushing for 160k, Elon is essentially saying that a single Tesla Terafab cluster aims to outproduce the entire world’s initial 2nm supply.

Elon Musk@elonmusk

@pbeisel Probably more like 160k wafers/month, factoring in yield

English

361

18.5K

Stanley Yuan 🔋 retweetledi

Elon Musk@elonmusk·19h

@pbeisel Probably more like 160k wafers/month, factoring in yield

English

150

1.8K

189.9K

Stanley Yuan 🔋 retweetledi

Elon Musk@elonmusk·19h

@bindureddy Google will win the AI race in the West, China on Earth and SpaceX in space

English

828

795

7.2K

896.6K

Stanley Yuan 🔋 retweetledi

Elon Musk@elonmusk·19h

We’ve been able to generate physics-accurate, real-time video for self-driving training & testing at @Tesla_AI for a long time. The compute required for this (roughly one H100 per HD camera) is still far too expensive for consumer use, but probably becomes affordable in 2 to 3 years.

English

164

234

2.7K

119.9K

Stanley Yuan 🔋 retweetledi

phil beisel@pbeisel·1d

Tesla’s forthcoming AI5 uses a half-reticle design, which is crucial for yield. A reticle defines the imaging area of a lithography machine, fitting two chips per shot effectively doubles yield. This means the Tesla chip design team had to carefully manage die features, for instance dropping the older ISP (and classic GPU) to make room for more AI cores. By contrast, NVIDIA’s Blackwell fills nearly a full reticle, making it a single-reticle design. If Tesla hits its compute and efficiency targets with AI5 in this half-reticle format, it’s almost like cutting fab requirements in half. And this has a big impact on Terafab, especially if it carries forward for AI6, AI7, etc.

phil beisel@pbeisel

Terafab may be the most essential vertical integration Tesla has ever undertaken— and it is truly non-optional. It will take years to build and will test even Elon’s speedrunning abilities to the limit, but that won’t stop him from trying. The breakthrough likely lies in overhauling the overall facility’s cleanroom model. By moving wafers in sealed pods with localized micro-environments, the fab no longer needs a monolithic ultra-clean space. Elon’s line about “eating cheeseburgers and smoking cigars” on the fab floor isn’t silly, it’s the practical reality of a radically simpler, cheaper, faster approach that could finally change the economics of chipmaking. This is all forced by the brutal “pinch” in chip supply. Tesla must produce on the order of 100–200 billion AI chips per year just to saturate its roadmap. That volume powers: FSD cars & Robotaxis (tens of millions of vehicles needing AI5 inference for near-perfect autonomy), Physical Optimus (scaling from thousands today to millions per year, each requiring AI5/AI6-level compute), Digital Optimus (the new xAI-Tesla software agents for digital/office automation, running massive inference clusters), Space-based data centers (AI7/Dojo3 orbital compute for GW-scale training and inference beyond Earth limits). AI5 delivers the ~10× leap for vehicles and early robots; AI6 shifts focus to Optimus + terrestrial DCs; AI7 goes orbital. No external foundry (TSMC, Samsung, etc.) can deliver that scale or timeline— hence the Terafab launch. Without it, the entire robotics + autonomy future hits a brick wall. Terafab isn’t optional; it’s the only way forward.

English

185

2.2K

342K

Stanley Yuan 🔋 retweetledi

Elon Musk@elonmusk·1d

@pbeisel I am a huge admirer of Nvidia and Jensen btw. That market cap is well-deserved. SpaceX AI and Tesla expect to continue ordering Nvidia chips at scale.

English

251

652

10.3K

423K

Stanley Yuan 🔋 retweetledi

Elon Musk@elonmusk·1d

AI5 will punch far above its weight, because the entire Tesla AI software stack is designed to make maximally effective use of every circuit. We co-signed our AI software and hardware. Bear in mind that AI5, while it can be used for training in data centers, is primarily optimized for AI edge compute in Optimus and Robotaxi. There is still significant room for improvement. In the same half reticle and same process node, we think a single AI6 chip has the potential to match a dual SoC AI5.

English

265

518

355.2K

Stanley Yuan 🔋 retweetledi

Elon Musk@elonmusk·1d

@DBurkland @pbeisel It’s in testing right now. Wide release in a few weeks.

English

387

386

4.4K

910.2K

Stanley Yuan 🔋 retweetledi

Kimi.ai@Kimi_Moonshot·3d

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

326

13.4K

4.8M

Stanley Yuan 🔋 retweetledi

Elon Musk@elonmusk·3d

@ai_for_success For a few years, then SpaceX will far exceed everyone combined

English

1.2K

834

13.8K

Stanley Yuan 🔋 retweetledi

Avi Chawla@_avichawla·3d

Big release from Kimi! They just released a new way to handle residual connections in Transformers. In a standard Transformer, every sub-layer (attention or MLP) computes an output and adds it back to the input via a residual connection. If you consider this across 40+ layers, the hidden state at any layer is just the equal-weighted sum of all previous layer outputs. Every layer contributes with weight=1, so every layer gets equal importance. This creates a problem called PreNorm dilution, where as the hidden state accumulates layer after layer, its magnitude grows linearly with depth. And any new layer's contribution gets progressively buried in the already-massive residual. This means deeper layers are then forced to produce increasingly large outputs just to have any influence, which destabilizes training. Here's what the Kimi team observed and did: RNNs compress all prior token information into a single state across time, leading to problems with handling long-range dependencies. And residual connections compress all prior layer information into a single state across depth. Transformers solved the first problem by replacing recurrence with attention. This was applied along the sequence dimension. Now they introduced Attention Residuals, which applies a similar idea to depth. Instead of adding all previous layer outputs with a fixed weight of 1, each layer now uses softmax attention to selectively decide how much weight each previous layer's output should receive. So each layer gets a single learned query vector, and it attends over all previous layer outputs to compute a weighted combination. The weights are input-dependent, so different tokens can retrieve different layer representations based on what's actually useful. This is Full Attention Residuals (shown in the second diagram below). But here's the practical problem with this idea. Full AttnRes requires keeping all layer outputs in memory and communicating them across pipeline stages during distributed training. To solve this, they introduce Block Attention Residuals (shown in the third diagram below). The idea is to group consecutive layers into roughly 8 blocks. Within each block, layer outputs are summed via standard residuals. But across blocks, the attention mechanism selectively combines block-level representations. This drops memory from O(Ld) to O(Nd), where N is the number of blocks. Layers within the current block can also attend to the partial sum of what's been computed so far inside that block, so local information flow isn't lost. And the raw token embedding is always available as a separate source, which means any layer in the network can selectively reach back to the original input. Results from the paper: - Block AttnRes matches the loss of a baseline LLM trained with 1.25x more compute. - Inference latency overhead is less than 2%, making it a practical drop-in replacement - On a 48B parameter Kimi Linear model (3B activated) trained on 1.4T tokens, it improved every benchmark they tested: GPQA-Diamond +7.5, Math +3.6, HumanEval +3.1, MMLU +1.1 The residual connection has mostly been unchanged since ResNet in 2015. This might be the first modification that's both theoretically motivated and practically deployable at scale with negligible overhead. More details in the post below by Kimi👇 ____ Find me → @_avichawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.

Kimi.ai@Kimi_Moonshot

English

222

2.3K

340.8K

Stanley Yuan 🔋 retweetledi

Kaito | 海斗@_kaitodev·5d

5 minutes ago, @karpathy just dropped karpathy/jobs! he scraped every job in the US economy (342 occupations from BLS), scored each one's AI exposure 0-10 using an LLM, and visualized it as a treemap. if your whole job happens on a screen you're cooked. average score across all jobs is 5.3/10. software devs: 8-9. roofers: 0-1. medical transcriptionists: 10/10 💀 karpathy.ai/jobs

English

967

1.8K

12.1K

3.5M

Stanley Yuan 🔋 retweetledi

Ashok Elluswamy@aelluswamy·5d

@ClayTravis Only to be topped by Tesla Optimus!

English

137

58.5K

Stanley Yuan 🔋 retweetledi

Elon Musk@elonmusk·5d

@BrianRoemmele The AI race will come down to scaling power and chip output

English

474

367

4.1K

201.9K

Stanley Yuan 🔋 retweetledi

Elon Musk@elonmusk·5d

@peterwildeford xAI will catch up this year and then exceed them all by such a long distance in 3 years that you will need the James Webb telescope to see who is in second place

English

1.8K

1.4K

18.6K

1.3M

Stanley Yuan 🔋 retweetledi

Elon Musk@elonmusk·5d

Terafab Project launches in 7 days

English

14.8K

11.1K

89.9K

84.9M

Stanley Yuan 🔋 retweetledi

Tesla@Tesla·13 Mar

Teslas will become mobile compute nodes

Elon Musk@elonmusk

Oh and it works in all AI4-equipped cars, so your car can do office work for you when not driving. We’re also deploying millions of dedicated Digital Optimus units in the field at Superchargers where we have ~7 gigawatts of available power.

English

476

1.1K

8.2K

825.3K

Keşfet

@pbeisel @bindureddy @Tesla_AI @DBurkland @ai_for_success @_avichawla @karpathy @ClayTravis