Joseph Sirosh

2.8K posts

Joseph Sirosh banner
Joseph Sirosh

Joseph Sirosh

@josephsirosh

Founder, https://t.co/RVFsBed5tR. I like to share thoughts about Authentic Generative Interaction, #GenAI and platforms for the future.

Bellevue, Washington Katılım Mayıs 2014
732 Takip Edilen8.6K Takipçiler
Joseph Sirosh retweetledi
jason
jason@jxnlco·
jason from the codex team here, heres a draft on codex maxxing and the primatives i use on a daily basis jxnl.github.io/blog/writing/2… would love any feedback
English
80
88
1.6K
101.8K
Joseph Sirosh retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
Is Grep All You Need? The surprising result is not that grep is powerful, but that agent design makes it powerful. The paper says not that grep beats vectors, but that agents fail or win through their harness. That sounds like a small distinction until you look at what was actually tested. The authors compare grep-style search and vector retrieval across LongMemEval tasks, where agents must recover facts from long conversation histories full of distractors. Inline grep beats inline vector across every harness-model pair in their main experiment, sometimes by wide margins. The tempting headline is that vector databases are overbuilt for coding agents. The better reading is sharper: when the answer is anchored in literal evidence, names, dates, file paths, function names, error strings, user preferences, grep gives the model a clean mechanical advantage. Embeddings are built to tolerate paraphrase, but tolerance has a cost. They can pull in semantically nearby clutter, especially when a short agent query is vague. Grep has the opposite failure mode. It is dumb, cheap, and narrow, but when the agent knows the right string to hunt for, dumb becomes a feature. The deeper finding is that retrieval is not a component you can benchmark in isolation. The same search method behaves differently depending on whether results are injected inline, written to files, routed through a CLI, or wrapped in a custom agent loop. So the question is not “Do we still need vector databases?” The question is whether your agent is solving a semantic discovery problem or an evidence-location problem. For coding agents, a surprising amount of work is evidence-location: find the symbol, trace the call, inspect the diff, read the failing test, recover the exact line. Vectors still matter at scale and for fuzzy conceptual search, but this paper weakens the lazy default that every serious agent stack begins with embeddings. Sometimes the upgrade is not a smarter index. Sometimes it is giving the model primitive tools, clean files, disciplined context, and a harness that lets exact search do exact work. ---- Paper Link – arxiv. org/abs/2605.15184 Paper Title: "Is Grep All You Need? How Agent Harnesses Reshape Agentic Search"
Rohan Paul tweet media
English
10
19
135
10.3K
Joseph Sirosh
Joseph Sirosh@josephsirosh·
@mcuban Don't take money from a highly competent and vibrant innovation ecosystem and give it an incompetent, cronyist and blotted system of government.
English
0
0
0
66
Mark Cuban
Mark Cuban@mcuban·
We should federally tax Tokens at the Provider level. Not a lot. Less than 50c per million tokens. It will accomplish 4 things (at least ) 1. It will push the big AI players to optimize tokenization, caching , routing and localization Which will 2. Reduce energy usage. Saving them in energy costs more than what they paid in tax and reducing strain created by the growth in energy consumption Which will 3. Generate maybe 10 billion dollars a year to start, but over the next ten years could grow 30x to 100x Which will 4. Create a source of funding to pay down the federal debt or deploy, in response to the things AI brings that we don’t expect or don’t like At some point the models will pass it on to customers. Of course. That’s ok. Customers will have the ability to choose between providers. Or to do everything using open source models locally. Thoughts ?
English
2.2K
255
4K
1.1M
Joseph Sirosh
Joseph Sirosh@josephsirosh·
@insane_analyst I don't get the Micron HBM4 comment given they said in their earnings they said "shipping in volume" and "designed for Nvidia" and somewhere they mentioned the pin speed as well being 11+Mbps? Old/dated info? Didn't they switch to the TSMC base-die (partnership announced in 2025?).
English
0
0
2
2.3K
Joseph Sirosh retweetledi
Lakshya A Agrawal
Lakshya A Agrawal@LakshyAAAgrawal·
Learning from rich textual feedback (errors, traces, partial reasoning) beats scalar reward alone for LLM optimization. GEPA demonstrated this for context-space optimization (prompts and agent harnesses), delivering frontier results at a fraction of the cost of RL. But context-only optimization is bounded by the base model's capability ceiling; weight updates can reach further. Very excited about this new line of work on Fast-Slow Training (FST), which interleaves context and model weight optimization! The idea is a clean division of labor between two interleaved loops: 🔹 Fast loop (context): GEPA reads rich rollout feedback updating the context layer. The context becomes a fast-updating scratchpad of what the model needs to know about this task, right now. 🔹 Slow loop (model parameters): RL updates the model's parameters conditioned on the evolving context. Because the prompt already carries task-specific nuances, the model parameters are freed from absorbing them and focus on what actually generalizes across tasks and pushes the frontier. ⦁ 3× more sample-efficient than RL on math, code, and physics reasoning ⦁ ~70% lower KL divergence from base at matched accuracy ⦁ Plasticity preserved: FST checkpoints respond better to additional RL on new tasks than RL-only ones ⦁ Continual learning across changing tasks (HoVer → CodeIO → Physics) where RL stalls the moment the task switches FST is a direction towards: ⦁ Addressing RL's pain points: entropy collapse, sparse rewards, long-horizon exploration ⦁ Providing a clean channel for rich feedback into weight updates ⦁ Demonstrating model-harness co-evolution ⦁ Discovery: Using fast context updates for broad exploration, while leveraging a continually improving model. Check out the full thread below:
Kusha Sareen@KushaSareen

Can LLMs adapt continually without losing base skills? Fast-Slow Training (FST) pairs "slow" weights with "fast" context. FST vs. RL: • 3x more sample-efficient • Higher performance ceiling • Less KL drift (better plasticity) • Continual learning: succeeds where RL stalls

English
12
38
167
25K
Joseph Sirosh retweetledi
Nous Research
Nous Research@NousResearch·
Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.
Nous Research tweet media
English
149
419
3.7K
434.4K
Joseph Sirosh retweetledi
George Pickett
George Pickett@georgepickett·
The best use case I’ve seen so far from /goal - lady has excess on her farm - tells codex to sell it - codex sells it
Shanee Moret@theshaneemoret

Been helping a handful of business owners with AI implementation. One of them has a side business for fun because she lives on a farm where she sells Dahlia tubers. She had 300 left to sell. To test Codex /goal mode we gave codex a goal to sell 200 Dahlia tubers for Mother's Day. This was around 1pm ET on Saturday, May 9th. By Sunday morning Codex had exceeded the 200 goal and sold 208 tubers. She reset it again after it exceeded the goal and by Sunday at midnight Codex had made her ~$4K and had sold almost 300 tubers. Context: Codex had access to her email, Shopify, Facebook, local files and I gave it some guidance on the email cadence that I recommended up until Sunday at midnight. Because my client is a perfectionist, I challenged her to let Codex cook and to not be overbearing when it came to messaging and whatever it posted. After all it was a low-risk experiment before we start to apply this to the B2B side. And she did, she let Codex work. Codex posted on her Facebook, private Dahlia facebook groups, and other places she didn't even think to post. Codex created all the copy and images. Codex sent previous customers custom links that were personalized with their names connected to coupon codes that expired at midnight. Codex even added nice touches that felt personalized when sending the emails like, "Can't wait to see what your first Dahlia's look like in your garden," when it had context that it was this person's first time ever planting Dahlia's (from the email threads). Codex replied to all customer questions via email correctly and without human intervention. During this process, Codex even protected my client from a phishing scam email that tried to pose as Shopify not being able to receive payment from customers. She is amazed and so am I. If you sell a product you would have to be insane to not be leveraging Codex /goal mode, especially for time-limited launches. Now it's time to test some goals in a higher stakes B2B sales environment.

English
1
6
131
27.2K
Joseph Sirosh
Joseph Sirosh@josephsirosh·
@fi56622380 @ShanghaoJin 65% growth in HBM per "disaggregated accelerator" is entirely possible. The disaggregation of the AI compute with specialized sub-accelerators for each phase of the compute relieves some key constraints.
English
0
0
0
51
fin
fin@fi56622380·
great comment, I think we are converged I don't see any discrepancies actually 65% is way higher than my expectation, I'm expecting 40% size annual growth at optimistic perspective x.com/fi56622380/sta… size double every generation = ~2 years, e.g. Rubin Ultra 1TB = Feynman 1TB (2x BW) then Feynman Ultra 2TB = Feynman+ 2TB (BW double) size doubles every 2 years, HBM BW doubles every 2 years, they rotate pushing HBM size x HBM BW doubles every year this is what I called "tick tok" upgrading strategy, which is in my part 2 of AI semiconductor endgame article
fin@fi56622380

AI半导体终局推演2026(II) 目前的topic 为什么HBM在结构上很可能会摆脱传统周期性,进入成长周期性?HBM的升级节奏会如何发展? (tiktok节奏,size和speed交替换代升级) 这会给HBM的供应和需求市场带来什么样子的capex成本结构变化?Capex内战里为什么HBM会持续占优势? 为什么Nvidia未来最大的竞争对手不是AMD,是Samsung,Hynix,Micron? AI推理时代,这个依赖HBM指数增长的GPU架构路线进化路线,会不会停止?什么时候停止? 那么以后DDR和NAND呢?有没有摆脱周期性的可能? AI Semiconductor Endgame Scenario Analysis 2026 (II) Current Topics Why is HBM structurally likely to break away from traditional cyclicality and enter a growth-cycle paradigm? How will the upgrade cadence of HBM evolve? (A “TikTok-style” rhythm: alternating generational upgrades in capacity/size and speed.) What kind of changes will this bring to the capex cost structure in the HBM supply and demand markets? In the internal “capex wars,” why will HBM continue to dominate? Why will NVIDIA’s biggest competitors in the future not be AMD, but rather Samsung Electronics, SK hynix, and Micron Technology? In the era of AI inference, will the GPU architectural path—highly dependent on exponential growth in HBM—eventually come to a halt? If so, when? What about DDR and NAND going forward? Do they have any possibility of breaking free from traditional cyclicality?

English
1
0
3
903
Herman Jin
Herman Jin@ShanghaoJin·
稍给存储泼点冷水都会被喷,最后说一次 我知道best DJ coming at 1:30am,但我睡眠已经很差了,不买让我睡不着的票 DDR是无差异化commodity(HBM不是)完全跟着JEDEC标准,且出货仍占绝对大头。这轮上涨毛利暴增是因wafer无差别切换的DDR margin同涨 我质疑卖commodity公司不能拿增长估值 如果要给非周期估值意味着你在假设: 1. 需求“永远”无穷大 2.或者HBM在wafer以后“永远”反占绝大头 记住这不是2030,而是永远,所以我不会看着PE觉得便宜 至少以上两个点现在都是很争议吧?这就是我敢死多光、死多CPU GPU,但确实把不准存储
Herman Jin@ShanghaoJin

But if you want to buy SNDK/MU now You are basically showing up to the party at 1:30 AM

中文
51
33
420
130.9K
Joseph Sirosh
Joseph Sirosh@josephsirosh·
Great article. AI is forcing apps to focus on the core nucleus of their value - whether it be proprietary data/workflow knowledge, or network effects (e.g. teamwork or larger scale collaboration), proprietary and trusted integrations etc. And the value of the IP outside of these is going to zero, because models have acquired the knowhow to reproduce that IP.
Seema Amble@seema_amble

x.com/i/article/2054…

English
0
0
1
161
Joseph Sirosh
Joseph Sirosh@josephsirosh·
"1/2: HBM capacity per GPU = bit density per die × dies per stack × number of stacks per package × package/interposer/thermal budget Those multiplicative factors can easily drive exponential-like growth over a 10 year window" More loose talk of a non-engineer. First, the package/interposer/thermal budget can't be just multiplied - it does not create new capacity - but without it, you just cant create the capacity. Even adopting these very simple-minded ways of thinking: - Bit density increases 20% per year in a lumpy way (process transitions every 2-3 years). - Dies/stack increases 10% - Stacks/packages increases 25% Multiply it out and you get about 65% cumulative annual growth rate. Not "exponential" but compound growth. If CLEAN ROOM CAPACITY were not a constraint. This is the one overwhelming binding issue that just cannot grow exponential because of the physics of constructing cleanroom buildings. DRAM Wafer capacity in the industry is growing only at 20% per year. While HBM may grow at 65% per year due to allocation shifts, the overall memory capacity production for the industry grows only at 25% to 30% per year. And all of this capacity is competed for - for gpu, for cpu, for storage, and for mobile/edge. The supply-demand imbalance is crazy. But just like most things in life (e.g. power production), physics limits the compounding.
English
1
0
1
377
fin
fin@fi56622380·
Great comment, but I think this attacks a stronger claim than the essay actually makes. 1/2: HBM capacity per GPU = bit density per die × dies per stack × number of stacks per package × package/interposer/thermal budget Those multiplicative factors can easily drive exponential-like growth over a 10 year window 3: already explained in the artical: Software/model optimizations and hardware scaling are two independent dimensions. Optimizations reduce bytes per token; hardware still has to keep pushing the token-throughput KPI every generation. One does not remove the need for the other. "What about software? Won't software optimization reduce bandwidth demand? Reduce HBM demand? This is an independent dimension from hardware. It's like asking: if software on a CPU runs faster after optimization, does that mean the CPU doesn't need to advance for ten years? After all, software is faster now. If that were the case, would CPU vendors still make money? For a CPU vendor to survive, there's only one path: in standardized benchmarks, ignoring software optimization, every new CPU generation must score higher — otherwise it doesn't sell. " 4. agree, exactly the same point made in the artical: HBM only takes care of the hot kv cache, exactly how memory hierachy works
English
1
1
4
651
Joseph Sirosh
Joseph Sirosh@josephsirosh·
"Agentic inference will gradually unbundle the GPU, which alternates between stranding high-bandwidth memory (during the prefill process) and stranding compute (during the decode process), in favor of increasingly sophisticated memory hierarchies dominated by high capacity and relatively lower cost memory types, with “good enough” compute; indeed, if anything it will be the speed of CPUs for things like tool use that will matter more than the speed of GPUs." In other words, extreme co-design of the hardware and the workload. Match the right compute logic to the right memory type and the right interconnect. The principle isn't by itself new in computer architecture - but when workloads like agentic compute scale up, specialization and division of labor in hardware become feasible/desirable economically.
Stratechery@stratechery

The Inference Shift Agentic inference is going to be different than the inference we use today, and it will change compute infrastructure because speed won't matter when humans aren't involved. stratechery.com/2026/the-infer…

English
0
0
0
235
Joseph Sirosh
Joseph Sirosh@josephsirosh·
Memory is consumed by a very large number of customers. Not a handful like TSMC's production. If production capacity is constrained, one needs to create a fair allocation model that is fair for all customers of memory. How should they do that? Examples of choices: 1) Favoritism ("I love Jensen, so he gets X..."). Only my favorites get supply. 2) Market pricing (prices float freely and balances supply and demand) - everybody can compete on price. 3) Contract pricing across different terms - contract duration, pre-pay and no-refunds, committed take, floor price with rider etc. etc. My own take is that the model is going to be similar to that of cloud providers. Pay-as-you-go with committed use discounts, and at large enough scale, negotiated contracts with enterprise sales. That's well-understood in the computing world.
English
0
0
0
242
Jukan
Jukan@jukan05·
Opinion: The memory Big Three should increase supply enough to bring margins down to around 60%. An 80% margin is unsustainable, and it should not be sustained.
English
70
54
1.2K
154.2K
Joseph Sirosh
Joseph Sirosh@josephsirosh·
Loose talk: "the HBM on a single GPU will grow exponentially." These statements are wrong. 1) Physics won't support any sort of HBM exponential growth on a "single GPU". 2) DRAM is a single transistor+capacitor. Densities are near physical limits, and process, stacking and packaging are driving the improvements and these are just linear, not exponential. 3) Througput improvements are not coming first & foremost from HBM capacity or even hardware. The biggest "exponential" improvements are from: - disaggregated optimization of prefill, decode, speculative decoding - innovations on attention (GQA/MQA/MLA) - Floating point formats and quantization (FP4, NVFP4) - Turboquant -type optimizations - Sparse attention and sparsity exploitation of various types. - MoE and other architectural variations. 4) The essay's airport-shuttle metaphor implicitly assumes everything in the cache is hot. In reality: - Prefix KV (system prompts, tool definitions, retrieved documents) — shared across thousands of sessions, paged in once - Active decoding KV — must be in HBM - Paused / waiting-on-tool-call KV — can sit in DRAM or flash for seconds-to-minutes - Completed-turn KV — staged out to SOCAMM2/DDR/Flash, recalled on next turn. The optimization isn't "make HBM bigger to hold all the KV." It's "make the memory hierarchy deep enough that only the actively-decoding fraction lives in HBM." What is true is that: - The KPI shift is real and structural. The promotion of memory from "embellishment" to "primary KPI driver" is a one-way change in semiconductor economics. This raises the structural memory-content-per-AI-dollar of capex, which raises the floor of the memory cycle. - Memory hierarchy depth is expanding in both directions. HBM grows for hot state; LPDDR grows for warm state (SOCAMM2, MR-DIMMs); high-performance NAND grows for cold/paged state (Nvidia ICMS). - The thesis isn't "HBM scales exponentially forever" — it's "agentic workloads force the entire memory hierarchy to scale together." - The insatiable demand allows memory makers to negotiate long-term committed contracts like TSMC for AI-specific products (HBM/SOCAMM2/Server RDIMM/MRDIMM/Server NAND SSD), giving high predictability and visibility of forward revenue. They become more like what SaaS stocks used to be with similarly high gross-margins.
English
1
1
6
585
fin
fin@fi56622380·
只要transformer + KV cache在架构上仍然存在 单GPU的HBM就会指数型增长 只要token维持指数增长,基本等价于你说的 1. 需求等价近似无穷大 2. HBM在GPU里的成本会一直占绝对大头 可能等再过几代,单个GPU HBM size达到8T,HBM占整个GPU成本90% 技术原理见引文,欢迎拍砖 x.com/fi56622380/sta… 所以关键可能并不在于你说的这两点,因为这两点都是可以从第一性原理出发推演出来的。更重要的前提是,token是否真的能继续维持指数增长很多年
fin@fi56622380

AI半导体终局推演2026(I) 当新token经济学范式从GPU算力转移到HBM 本文从从GPU架构进化路线本质出发,解释这个市场长久以来担心的问题: 每个GPU的HBM内存需求为什么一定会是指数增长,为什么HBM需求指数增长不会停滞? 并推导token经济学在当前架构下第一性原理:token吞吐 = HBM size X HBM BW带宽 同时讨论了,为什么GPU的天花板被HBM的两个发展维度所决定 HBM周期性这个话题争议一直很大,乐观派认为AI带来的需求比以前要大的多,但市场主流仍然认为前几次上升周期也有需求每年20%+增长,这次又有什么不一样呢?AI不影响HBM和传统DRAM一样有commodity属性,一旦在需求顶峰扩产遇上需求下行又会重蹈覆辙。 我们可以从算力芯片架构视角,从第一性原理出发,来拆解和推演一下这个问题:为什么这次真的不一样 ------------------------------- 历史:CPU算力时代 很久以来,我们都处在CPU主导算力的时代,CPU的最高级KPI就是performance,跑的更快,所以每一代的CPU都用各种方法来提高跑分,最开始是频率上升,后来是架构演进superscaler等等 这个时候为什么DDR不需要很快的技术进步速度?比如DDR3到DDR5竟然经历了15年之久 因为这个时期的DDR的角色是纯粹的辅助,而且辅助功能极弱,以业界经验,DDR的速度即便是提高一倍,CPU的performance一般只能提高不到20%这个量级 为什么DDR带宽速度提高了用处不大?两个原因 1. CPU设计了各种架构去隐藏 DDR延迟,比如superscaler,加大发射宽度,用海量的ROB和register renaming来提高并行度隐藏延迟,一级缓存cache,二级缓存cache,削弱了DDR的带宽速度需求 2. CPU workload对DDR带宽要求并不高,大部分日常负载比如打开网页,DDR带宽是严重过剩的,甚至云端负载 也就是说,在CPU时代,DDR的带宽速度是不太有所谓的,DDR4和DDR5除了少数游戏就没啥差别,甚至JEDEC标准也进步缓慢。 另外,绝大部分app需要一直停留在DDR上的部分并不多,需要的时候从硬盘上调度到DDR即可,app的size增长没那么快,导致对DDR的容量需求也较为缓慢。 所以最近十年来,平均每台电脑上的DDR容量大概从7~8GB变成了23GB,十年只增长了3倍。 而这部分升级缓慢直接影响了营收,size容量计价是赚钱的主要方式,速度的提高只是技术升级,提高size的单价,这两个的升级需求都不大,需求主要是随着电脑/手机数量增长而增长 所以DRAM在带宽速度和容量这两个维度上,一直是都是芯片产业锦上添花性质的附属品,DDR升级带来的边际效用是很低的,跟CPU时代的最高KPI几乎没什么直接联系 -------------------------------------------- 而到了genAI 大模型为主导的新时代,计算范式转移让最高级KPI起了根本变化 GPU发展到AI推理的时代,不再像CPU那样只看跑分,最高级的KPI不再是算力TOPS/FLOPS,而是token的成本,特别是单位成本/单位电力下的overall token throuput 其次是token吞吐速度,因为在agent时代,很多任务变成了串行,token吞吐速度成了用户体验的重要瓶颈。 这也是为什么老黄发明AI工厂概念的原因:最低成本的输出最多token,同时尽量提高token吞吐速度 AI训练时代,老黄的经济学是TCO(total cost ownership),买的GPU越多,省的越多 而老黄在推理时代的token经济学是: AI推理的毛利润很可观,所以逻辑已经转换成:Nvidia GPU是这个世界上让token单价最便宜的GPU,买的GPU越多,赚的越多 最高的KPI变成了Pareto frontier曲线,在提高token 吞吐throughput和提高token速度两个维度上尽量优化 (见图一) NVIDIA 的 token factory 代际进步,其实是在把整条 Pareto frontier 往右上推,这就是是AI推理这个时代最重要的KPI ---------------------------------- 接下来是本文最重要的逻辑链,如何从token吞吐量指数型增长的本质出发,推导出天花板瓶颈在HBM size和HBM 带宽的指数型增长 单卡GPU推理单线程batch size = 1的时代,token吞吐只有一个维度,就是HBM的带宽速度,带宽速度越高,token吞吐越大 但进入NVL72的年代,推理不再是单卡GPU时代,而是72个GPU + 36个CPU整个系统级别的token工厂,把HBM带宽和算力用满,获得极致的token吞吐量 Token 吞吐throughput的增长,依赖两个东西:同时批处理的请求数 X 每个user请求的平均token速度 也就是batch size X per user token 速度 以Rubin NVL72为例,在平均token速度是100 token/s的情况下,同时批处理1920个请求,得到token吞吐量是19.2万token/s 一个Rubin NVL72大概是120KW(0.12MW)的功率,所以得到单位MW能处理1.6M token/s (见图一) 所以,我们需要想方设法提高这两个参数:批处理数量batch size和per user token的平均速度,这两者相乘就是我们的最高KPI,也就是token的吞吐量 ------- 第一个参数:batch size的增长,瓶颈在HBM size 批处理量里的每一个请求req,都会自带kv cache,这部分kv cache是需要存在HBM里的,大小大概在几个GB到数十GB不等 因为hot kv cache是随时需要高频高速读取,所以必须放在HBM里,比如一个大模型的层数是80层,那么每一个token的生成阶段,都需要读取80次HBM里的kv cache 随着批处理数量batch size的增长,会带来hot kv cache的线性增长 又因为这个批处理量的所有请求的hot kv cache,都要放在HBM上,这也就带来了HBM size必须要随着批处理量batch size线性增长 就像是机场接驳车,登机口尽量快的接旅客到飞机,HBM size小了,相当于接驳车size小了,就得多接一趟 结论是:批处理量的数量batch size,瓶颈依赖于HBM size的增长 --------- 第二个参数:每个user请求的平均token速度,瓶颈在HBM带宽 大模型decode阶段的速度,瓶颈取决于HBM的带宽速度,因为每生成一个 token,都要把激活的权重和kv cache 读很多遍 LPU的出现,在batch不那么大的情况下,把激活权重这个部分搬到了SRAM上,但是每生成一个 token仍然要从HBM读很多次KV cache。HBM带宽越高,生成每一个token的速度也就越快,基本上是线性对应的 就像是机场接驳车,登机口尽量快的接旅客到飞机,hbm本身带宽速度就像是接驳车的车门有多宽,门越宽,旅客上接驳车越快 GPU的其他配置,都是在适配batch的增长以及要让token compute的速度配平HBM的增长,甚至会用多余的算力来获得部分的带宽(比如部分带宽压缩技术) —----- 在那个接驳车的比喻例子里 接驳车的车厢大小 = HBM Size(容量): 决定了一次能装下多少名旅客(也就是能同时装下多少个请求的 KV Cache)。车厢越大,一次能拉载的旅客(Batch Size)就越多。如果车太小,想拉100个人就得分两趟,系统整体的吞吐量就上不去。 接驳车的车门宽度 = HBM Bandwidth(带宽): 决定了旅客上下车的速度。门越宽,大家呼啦啦一下全上去了(Decode/生成Token的速度极快)。如果门很窄,哪怕车厢巨大能装200人,大家也得排着队一个一个挤上去,全耗在上下车的时间里了。 旅客的吞吐量 = 接驳车车厢容量 x 接驳车旅客上车速度(车门宽度) —--------------------------- 至此,我们从逻辑上推演出了token经济学的硬件需求第一性原理: Token throughput = HBM size X HBM Bandwidth AI推理这个时代的最高KPI,实际上是高度依赖于HBM的两个维度的进步的 如果要维持token throuput每一代两倍的增长,实际上意味着,每一代的单GPU上,HBM size X HBM BW带宽之积要增长两倍! 这也是历史上第一次,HBM内存的size可以影响最高的KPI token throughput! 要验证这个理论,可以把Nvidia从A100到Rubin Ultra这几代的token 吞吐throughput,和HBM size X HBM BW 放在同一个图里比较 (见图二) 可以发现,这两个曲线的走势在对数轴上惊人的一致 HBM size x HBM带宽增长的甚至要比token吞吐量更快,毕竟HBM决定的是天花板,实际上这个天花板增长的利用率utilization是很难达到100%的,也就是说,HBM size x HBM 带宽就算增长1000倍,其他算力和架构的配合下,很难把这1000倍的天花板潜力全部榨干 这条曲线不是巧合,而是系统最优化的必然解 throughput = batch × Bandwidth,这就是token factory 经济学最绕不开的第一性原理 —-------- 软件的影响呢?软件的优化会不会降低带宽的需求?降低HBM的需求? 这跟硬件是独立两个维度的,这好像在问,如果CPU上的软件优化了之后跑的更快,是不是CPU就十年不用发展了?反正软件跑的更快了嘛 这样的话,CPU厂还能赚得到钱吗?CPU想要存活下去,只有一条路可走,在标准benchmark,不考虑软件优化,每一代CPU必须要跑分更高,不然就卖不出去 GPU也是一样,软件优化如何,和自己的token吞吐量KPI每年都要大幅进步,是两回事 只要token的需求继续增长,对token throuput的追求就绝不会停止,那么对HBM size X HBM 带宽的追求也不会停止 如果HBM size和HBM 带宽发展慢了,老黄一定会亲自到御三家逼着他们技术升级,因为这就是老黄gpu的天花板,天花板要是钉死了不进步,老黄的GPU还能卖出去吗? 当然了,Nvidia需要绞尽脑汁去从异构计算的架构角度榨取HBM天花板之外的部分,比如LPU就是一个很好的尝试,把Pareto frontier从另一个角度改善了很多 (右半边高token速度的部分) —-------------------------------------- HBM内存已然告别了那个随波逐流的旧时代,在这条由指数级需求铺就的单行道上,以一种近乎宿命的方式走到了产业史诗的主舞台中央 推理范式第一性原理演化到这一步,只要老黄还要卖GPU,HBM就必须翻倍,而且必须代代翻倍。这是supply side的内生压力,与AI需求无关,与宏观周期无关,与hyperscaler的心情也无关 剩下的问题,只有一个: 当需求被物理锁定为指数增长的时候,供给侧的三个玩家,会不会还像过去三十年那样,亲手把自己再拖回一次周期的泥潭?

中文
14
52
291
136.2K
Joseph Sirosh retweetledi
COATUE
COATUE@coatuemgmt·
Memory is the new bottleneck. Nick Gagnet, Coatue Sector Head, on the AI infrastructure shift and why memory demand could 5x in 5 years.
English
99
250
2.4K
1.9M
Joseph Sirosh retweetledi
Yann LeCun
Yann LeCun@ylecun·
@eladgil BS. Attention was born in Montréal PyTorch in NYC. AlphaGo in London AlphaFold in London ESMFold in NYC Llama 1 in Paris. Llama 2 in Paris+NYC+SV DeepSeek in Hangzhou Plus: DINO in Paris JEPA in Montréal+Paris+NYC SV is 3 mos ahead on topics SV is singularly obsessed with.
English
183
492
7.8K
726.7K
Joseph Sirosh
Joseph Sirosh@josephsirosh·
Right - a large fraction of all generated tokens (except thinking tokens) are stored in the memory hierarchy. This is indeed the crazy compounding of memory that's particularly acute for Agentic AI. Memory has become structurally as valuable as logic in computing. Even algorithmic improvements like Turboquant can't fight a compounding acceleration - such improvements are just a one time step improvement - but compounding continues on top. Also, we don't have similar improvements for the CPU calcs done with tool calls.
English
0
0
0
83
Trade Whisperer
Trade Whisperer@TradexWhisperer·
$MU $SNDK $DRAM Memory is now viciously cyclical. In the most bullish way possible. We used to store photos, videos, and games in memory. Now HBM helps to generate all of that. Full movies. Full games. Entire synthetic worlds created by billions of people on earth. Paradigm shift right? Wait for the crazier part. And where does all that newly generated content go? Back into the conventional memory & storage. Every single byte of it. The loop never closes. This cycle has no fucking exit. Let that sink in.
Trade Whisperer tweet media
English
32
28
362
43.8K
Joseph Sirosh
Joseph Sirosh@josephsirosh·
@dmweisberger @AOC right. the same way communist countries built an authoritarian corrupt state controlled economy. the answer is truly less corruptible state managed overhead.
English
0
0
0
21
Dave W
Dave W@dmweisberger·
This is a classic case of presenting FACTS, but making the WRONG conclusion. You are 100% correct that CRONYISM, where LARGE companies DONATE (bribe) politicians to skew regulations in their favor is a MAJOR problem in the U.S. The CORRECT answer, however, is LESS regulation and state control, which is most often used to help the large companies. YOUR answer of MORE state control, particularly, if money in politics stays, is certain to make the problem worse. If, in addition, you remove the incentives to innovate and take risks, then the incumbent large companies will dominate even more.
English
3
2
61
4K
Alexandria Ocasio-Cortez
Someone can certainly *make* a billion dollars. That’s not the same thing as earning. Growing fast and disrupting markets also often means chasing and wielding market power, political influence, and scale. Take Airbnb. They heavily lobby politicians against passing housing laws to protect working class residents because it’s bad for their business model. Airbnb could not exist at its current scale and size without the housing market destabilizations, displacements, and exploits that are supercharging the evictions of working people everywhere from Puerto Rico to Jackson Hole. Now young people are planning for a future where they will never be able to afford to own a home while others have 20 and live off renting it out to them at extortionate rates with zero protections. Yes, a tiny amount of people can make billions of dollars doing that. And millions of everyday Americans are bearing the cost.
Paul Graham@paulg

Sure you can earn a billion dollars. I've been teaching people how to do it for 20 years. The way you do it is to start a company that grows fast. You don't have to do anything bad to make a company grow fast. You just have to make something people want. paulgraham.com/ace.html

English
5.8K
2.1K
17.9K
4M
Joseph Sirosh
Joseph Sirosh@josephsirosh·
@AOC Every hit artist, every technology innovation in history created enormous wealth. These types of ignorant arguments are why the current democratic leadership looses the faith of even their strongest democratic supporters.
English
0
0
0
1.2K
Joseph Sirosh retweetledi
Tim Ferriss
Tim Ferriss@tferriss·
"The single most important thing for anybody wanting to break into any industry is go to the headquarters or cluster of that industry. Move to wherever that thing is. And all the advice that you can do anything from anywhere and everything's remote is all BS. With AI, 91 percent of private technology market cap is in the Bay Area. Ninety-one percent of the entire global set of AI market cap is all in one 10 by 10 area." — Elad Gil Listen to my interview with @eladgil: tim.blog/2026/04/29/ela…
English
67
183
1.8K
684.8K