Cooper Leong

593 posts

Cooper Leong

@cooperleong22

PhD student @HongKongPolyU LLM Mechinterp/Safety

Katılım Ekim 2021

2.4K Takip Edilen271 Takipçiler

Sabitlenmiş Tweet

Cooper Leong@cooperleong22·28 Şub

Explore LLM Interpretability with this comprehensive resource compilation: github.com/cooperleong00/… 📚 Tutorials, libraries, surveys, papers, blogs & more! 📂 Categorized for easy navigation 🔄 Continually updated 🗨️ Your thoughts & feedback are welcome! #NLProc #LLM

English

2.8K

Cooper Leong@cooperleong22·7m

👀

Tibo@thsottiaux

Codex team is aware of reports of GPT-5.5 performing worse for some users and investigating. We don't have anything conclusive yet and systems are healthy but we will share updates as we go.

ART

Cooper Leong@cooperleong22·1d

The hard part isn’t doing the right thing. It’s making fewer mistakes while staying sustainable across longer time horizons and larger scales — in data, infrastructure, and life.

English

Cooper Leong retweetledi

Will Held@WilliamBarrHeld·3d

To train better open models, we need predictable scaling. Delphi is Marin’s first step: we pretrained many small models with one recipe, then extrapolated 300× to predict a 25B-param / 600B-token run with just 0.2% error. Getting there took some work 🧵

English

450

133.2K

Cooper Leong retweetledi

Lee Sharkey@leedsharkey·5 May

My team at @GoodfireAI has been cooking up a new way to do interpretability: decompose a language model’s weights, not its activations. Our decomposition natively handles attention (!) and behaves less like a lookup table and more like a generalizing algorithm. (1/6)

English

192

1.5K

234.5K

Cooper Leong@cooperleong22·4 May

The fact that a token embedding is trained to contain information about this token is questionable if you think that the goal of feeding this is to predict the next token.

Muyu He@HeMuyu0327

I am now highly skeptical of the claim that adding the token embedding to deeper layers improves the model by "preserving the original token information", and think that the reason it improves at all is much simpler. How the hypothesis was made. It was proposed in the Value Residual Learning Paper based on the **fact** that if you add the first value vector v1 / token embedding x0 to deeper layers' value vector / residual stream with equal weight (0.5 * v1 + 0.5 * v), the model's validation loss improves significantly. And we later found that adding **any** linear transformation of x0 helps just as much. Ablation setup. If the model truly improves because deeper layers have access to the original x0 information, then this ablation should not change the model performance: killing the gradient of x0 when it is added to subsequent layer in this extra path. Since x0 (and the embedding layer) will receive regular updates via the standard computation path, x0 will always be able to supply token information to deeper layers, and deep layers' attention module can learn to use it properly. Therefore, for both adding x0 to later x and adding a linear transformation of x0 to later v, we run an ablation that detaches/kills x0's gradient during the forward pass. Experiment result. In both ablations, we find that most of the improvement is gone. Although the model has access to a perfectly valid x0 information in deeper layers and can update attn/MLP weights to utilize it, it never recovers most of the benefits we see in the baseline. This seems to suggest that "value residual learning" mostly (not all) works not because valuable x0 info is passed down, but because there is some benefit to **the embedding layer** by adding x0 to deeper layers. There might be two ways the embedding layer can be benefitted. One is just pure gradient benefits: that value residual learning is some advanced form of residual connection that handles vanishing gradients better. Need to do some math to see if this holds. The other is that the forward pass set up in this way updates the embedding space in a meaningful way, so tokens can have a more optimal representation. Next up will want to do ablations to test both hypotheses. And of course I might just have missed something simple.

English

Cooper Leong retweetledi

Yanshu Li✈️ICML2026@karrsen0713·30 Nis

Accepted at ICML 2026! Take Action for a Better RLVR!

MikaStars★@MikaStars39

Stop using LoRA for RLVR!!! New paper released👉Evaluating Parameter Efficient Methods for RLVR 📖Alphaxiv: alphaxiv.org/abs/2512.23165 💻Github: github.com/MikaStars39/Pe… Is standard LoRA truly the optimal choice for Reinforcement Learning?. We present the first large-scale evaluation of over 12 PEFT methodologies using the DeepSeek-R1-Distill family on complex mathematical reasoning benchmarks. Key Finding: Standard LoRA is suboptimal. Structural variants such as DoRA, AdaLoRA, and MiSS consistently outperform standard LoRA. Notably, DoRA (46.6% avg. accuracy) even surpasses full-parameter fine-tuning (44.9%) across multiple benchmarks. The failure of SVD-based initialization. Strategies like PiSSA and MiLORA experience significant performance degradation or total training collapse. This is due to a fundamental "spectral misalignment": these methods force updates on principal components, while RLVR intrinsically operates in the off-principal regime. The Expressivity Floor. While RLVR can tolerate moderate parameter reduction, extreme compression (e.g., VeRA, IA³, or Rank-1 adapters) creates an information bottleneck. Reasoning tasks require a minimum threshold of trainable capacity to successfully reorient policy circuits. Recommendations for the community: a. Move beyond the default adoption of standard LoRA. b. Prioritize geometry-aware adapters like DoRA that decouple magnitude and direction. c. Avoid SVD-informed initializations for RL tasks.

English

5.5K

Cooper Leong@cooperleong22·1 May

👀

Xiaoyin Qu@quxiaoyin

I can’t believe I stopped using Claude Code max and entirely use DeepSeek and Hermes. It’s so fast, so so fast, 3x faster for the same task. So cheap. I spent $5 last week and never need worry about being rate limited or usage hit limits very two hours. For most tasks it’s perfect enough.

ART

Cooper Leong retweetledi

Goodfire@GoodfireAI·30 Nis

Introducing Silico: the platform for building AI models with the precision of written software. Silico lets researchers and engineers see inside their models, debug failures, and intentionally design them from the ground up. Early access is open now. 🧵(1/10)

English

113

868

108.9K

Cooper Leong retweetledi

Bojie Li@bojie_li·29 Nis

Closed labs hide model sizes. They can't hide what their models know, and what a model knows is an indicator on how big it is. Reasoning compresses. Factual knowledge doesn't. So you can size a frontier model from black-box API calls alone, and across releases you can literally watch a single fact arrive in the parameters over time. For three years, my friends Jiyan He and Zihan Zheng have been asking frontier LLMs the same question: "what do you know about USTC Hackergame?", a CTF contest. May 2024: GPT-4o invented fake titles. Feb 2025: Claude 3.7 Sonnet listed 19 verified 2023 challenges. By April 2026, frontier models recall specific challenges across consecutive years. After DeepSeek-V4 dropped, I instructed my agent to spend four days autonomously turning that habit into Incompressible Knowledge Probes (IKP) — 1,400 questions, 7 tiers of obscurity, 188 models, 27 vendors. Three findings: 1/ You can approximately size any black-box LLM from factual accuracy alone. Penalized accuracy is log-linear in log(params), R² = 0.917 on 89 open-weight models from 135M to 1.6T params. Project closed APIs onto the curve → GPT-5.5 ~9T, Claude Opus 4.7 ~4T, GPT-5.4 ~2.2T, Claude Sonnet 4.6 ~1.7T, Gemini 2.5 Pro ~1.2T (90% CI: 0.3-3x size). 2/ Citation count and h-index don't predict whether a frontier model recognizes a researcher. Two researchers with similar citation profiles get very different responses. Models memorize impact — work that shaped a field, not many incremental papers. 3/ Factual capacity doesn't compress over time. Across 96 open-weight models across 3 years, the IKP time coefficient is statistically zero, rejecting the Densing-Law prediction of +0.0117/month at p<10⁻¹⁵. Reasoning benchmarks saturate; factual capacity keeps scaling with parameters. Website: 01.me/research/ikp/ Paper: arxiv.org/pdf/2604.24827

English

234

2.2K

387.5K

Cooper Leong retweetledi

Z.ai@Zai_org·30 Nis

Scaling laws push model capability forward. But whether that capability becomes reliable in production depends on how we handle Scaling Pain. z.ai/blog/scaling-p… In our latest blog, we share how we debugged GLM-5 serving at scale: reproducing rare garbled outputs, repetition, and rare-character generation; tracing and eliminating KV Cache race conditions; fixing HiCache synchronization issues; and introducing LayerSplit for up to 132% throughput improvement. We hope these lessons help the community avoid similar pitfalls and build more robust inference infrastructure.

English

878

80.2K

Cooper Leong retweetledi

fin@fi56622380·29 Nis

AI半导体终局推演2026(I) 当新token经济学范式从GPU算力转移到HBM 本文从从GPU架构进化路线本质出发，解释这个市场长久以来担心的问题: 每个GPU的HBM内存需求为什么一定会是指数增长，为什么HBM需求指数增长不会停滞? 并推导token经济学在当前架构下第一性原理:token吞吐 = HBM size X HBM BW带宽同时讨论了，为什么GPU的天花板被HBM的两个发展维度所决定 HBM周期性这个话题争议一直很大，乐观派认为AI带来的需求比以前要大的多，但市场主流仍然认为前几次上升周期也有需求每年20%+增长，这次又有什么不一样呢？AI不影响HBM和传统DRAM一样有commodity属性，一旦在需求顶峰扩产遇上需求下行又会重蹈覆辙。我们可以从算力芯片架构视角,从第一性原理出发，来拆解和推演一下这个问题：为什么这次真的不一样 ------------------------------- 历史：CPU算力时代很久以来，我们都处在CPU主导算力的时代，CPU的最高级KPI就是performance，跑的更快，所以每一代的CPU都用各种方法来提高跑分，最开始是频率上升，后来是架构演进superscaler等等这个时候为什么DDR不需要很快的技术进步速度？比如DDR3到DDR5竟然经历了15年之久因为这个时期的DDR的角色是纯粹的辅助，而且辅助功能极弱，以业界经验，DDR的速度即便是提高一倍，CPU的performance一般只能提高不到20%这个量级为什么DDR带宽速度提高了用处不大？两个原因 1. CPU设计了各种架构去隐藏 DDR延迟，比如superscaler，加大发射宽度，用海量的ROB和register renaming来提高并行度隐藏延迟，一级缓存cache，二级缓存cache，削弱了DDR的带宽速度需求 2. CPU workload对DDR带宽要求并不高，大部分日常负载比如打开网页，DDR带宽是严重过剩的，甚至云端负载也就是说，在CPU时代，DDR的带宽速度是不太有所谓的，DDR4和DDR5除了少数游戏就没啥差别，甚至JEDEC标准也进步缓慢。另外，绝大部分app需要一直停留在DDR上的部分并不多,需要的时候从硬盘上调度到DDR即可，app的size增长没那么快，导致对DDR的容量需求也较为缓慢。所以最近十年来，平均每台电脑上的DDR容量大概从7~8GB变成了23GB，十年只增长了3倍。而这部分升级缓慢直接影响了营收，size容量计价是赚钱的主要方式，速度的提高只是技术升级，提高size的单价，这两个的升级需求都不大，需求主要是随着电脑/手机数量增长而增长所以DRAM在带宽速度和容量这两个维度上，一直是都是芯片产业锦上添花性质的附属品，DDR升级带来的边际效用是很低的，跟CPU时代的最高KPI几乎没什么直接联系 -------------------------------------------- 而到了genAI 大模型为主导的新时代，计算范式转移让最高级KPI起了根本变化 GPU发展到AI推理的时代，不再像CPU那样只看跑分，最高级的KPI不再是算力TOPS/FLOPS，而是token的成本，特别是单位成本/单位电力下的overall token throuput 其次是token吞吐速度，因为在agent时代，很多任务变成了串行，token吞吐速度成了用户体验的重要瓶颈。这也是为什么老黄发明AI工厂概念的原因：最低成本的输出最多token，同时尽量提高token吞吐速度 AI训练时代，老黄的经济学是TCO(total cost ownership)，买的GPU越多，省的越多而老黄在推理时代的token经济学是： AI推理的毛利润很可观，所以逻辑已经转换成：Nvidia GPU是这个世界上让token单价最便宜的GPU，买的GPU越多，赚的越多最高的KPI变成了Pareto frontier曲线，在提高token 吞吐throughput和提高token速度两个维度上尽量优化（见图一） NVIDIA 的 token factory 代际进步，其实是在把整条 Pareto frontier 往右上推，这就是是AI推理这个时代最重要的KPI ---------------------------------- 接下来是本文最重要的逻辑链，如何从token吞吐量指数型增长的本质出发，推导出天花板瓶颈在HBM size和HBM 带宽的指数型增长单卡GPU推理单线程batch size = 1的时代，token吞吐只有一个维度，就是HBM的带宽速度，带宽速度越高，token吞吐越大但进入NVL72的年代，推理不再是单卡GPU时代，而是72个GPU + 36个CPU整个系统级别的token工厂，把HBM带宽和算力用满，获得极致的token吞吐量 Token 吞吐throughput的增长，依赖两个东西：同时批处理的请求数 X 每个user请求的平均token速度也就是batch size X per user token 速度以Rubin NVL72为例，在平均token速度是100 token/s的情况下，同时批处理1920个请求，得到token吞吐量是19.2万token/s 一个Rubin NVL72大概是120KW（0.12MW）的功率，所以得到单位MW能处理1.6M token/s （见图一）所以，我们需要想方设法提高这两个参数：批处理数量batch size和per user token的平均速度，这两者相乘就是我们的最高KPI，也就是token的吞吐量 ------- 第一个参数：batch size的增长，瓶颈在HBM size 批处理量里的每一个请求req，都会自带kv cache，这部分kv cache是需要存在HBM里的，大小大概在几个GB到数十GB不等因为hot kv cache是随时需要高频高速读取，所以必须放在HBM里，比如一个大模型的层数是80层，那么每一个token的生成阶段，都需要读取80次HBM里的kv cache 随着批处理数量batch size的增长，会带来hot kv cache的线性增长又因为这个批处理量的所有请求的hot kv cache，都要放在HBM上，这也就带来了HBM size必须要随着批处理量batch size线性增长就像是机场接驳车，登机口尽量快的接旅客到飞机，HBM size小了，相当于接驳车size小了，就得多接一趟结论是：批处理量的数量batch size，瓶颈依赖于HBM size的增长 --------- 第二个参数：每个user请求的平均token速度，瓶颈在HBM带宽大模型decode阶段的速度，瓶颈取决于HBM的带宽速度，因为每生成一个 token,都要把激活的权重和kv cache 读很多遍 LPU的出现，在batch不那么大的情况下，把激活权重这个部分搬到了SRAM上，但是每生成一个 token仍然要从HBM读很多次KV cache。HBM带宽越高，生成每一个token的速度也就越快，基本上是线性对应的就像是机场接驳车，登机口尽量快的接旅客到飞机，hbm本身带宽速度就像是接驳车的车门有多宽，门越宽，旅客上接驳车越快 GPU的其他配置，都是在适配batch的增长以及要让token compute的速度配平HBM的增长，甚至会用多余的算力来获得部分的带宽（比如部分带宽压缩技术） —----- 在那个接驳车的比喻例子里接驳车的车厢大小 = HBM Size（容量）：决定了一次能装下多少名旅客（也就是能同时装下多少个请求的 KV Cache）。车厢越大，一次能拉载的旅客（Batch Size）就越多。如果车太小，想拉100个人就得分两趟，系统整体的吞吐量就上不去。接驳车的车门宽度 = HBM Bandwidth（带宽）：决定了旅客上下车的速度。门越宽，大家呼啦啦一下全上去了（Decode/生成Token的速度极快）。如果门很窄，哪怕车厢巨大能装200人，大家也得排着队一个一个挤上去，全耗在上下车的时间里了。旅客的吞吐量 = 接驳车车厢容量 x 接驳车旅客上车速度(车门宽度) —--------------------------- 至此，我们从逻辑上推演出了token经济学的硬件需求第一性原理： Token throughput = HBM size X HBM Bandwidth AI推理这个时代的最高KPI，实际上是高度依赖于HBM的两个维度的进步的如果要维持token throuput每一代两倍的增长，实际上意味着，每一代的单GPU上，HBM size X HBM BW带宽之积要增长两倍！这也是历史上第一次，HBM内存的size可以影响最高的KPI token throughput！要验证这个理论，可以把Nvidia从A100到Rubin Ultra这几代的token 吞吐throughput，和HBM size X HBM BW 放在同一个图里比较（见图二）可以发现，这两个曲线的走势在对数轴上惊人的一致 HBM size x HBM带宽增长的甚至要比token吞吐量更快，毕竟HBM决定的是天花板，实际上这个天花板增长的利用率utilization是很难达到100%的，也就是说，HBM size x HBM 带宽就算增长1000倍，其他算力和架构的配合下，很难把这1000倍的天花板潜力全部榨干这条曲线不是巧合，而是系统最优化的必然解 throughput = batch × Bandwidth，这就是token factory 经济学最绕不开的第一性原理 —-------- 软件的影响呢？软件的优化会不会降低带宽的需求？降低HBM的需求？这跟硬件是独立两个维度的，这好像在问，如果CPU上的软件优化了之后跑的更快，是不是CPU就十年不用发展了？反正软件跑的更快了嘛这样的话，CPU厂还能赚得到钱吗？CPU想要存活下去，只有一条路可走，在标准benchmark，不考虑软件优化，每一代CPU必须要跑分更高，不然就卖不出去 GPU也是一样，软件优化如何，和自己的token吞吐量KPI每年都要大幅进步，是两回事只要token的需求继续增长，对token throuput的追求就绝不会停止，那么对HBM size X HBM 带宽的追求也不会停止如果HBM size和HBM 带宽发展慢了，老黄一定会亲自到御三家逼着他们技术升级，因为这就是老黄gpu的天花板，天花板要是钉死了不进步，老黄的GPU还能卖出去吗？当然了，Nvidia需要绞尽脑汁去从异构计算的架构角度榨取HBM天花板之外的部分，比如LPU就是一个很好的尝试，把Pareto frontier从另一个角度改善了很多（右半边高token速度的部分） —-------------------------------------- HBM内存已然告别了那个随波逐流的旧时代，在这条由指数级需求铺就的单行道上，以一种近乎宿命的方式走到了产业史诗的主舞台中央推理范式第一性原理演化到这一步，只要老黄还要卖GPU，HBM就必须翻倍，而且必须代代翻倍。这是supply side的内生压力，与AI需求无关，与宏观周期无关，与hyperscaler的心情也无关剩下的问题，只有一个：当需求被物理锁定为指数增长的时候，供给侧的三个玩家，会不会还像过去三十年那样，亲手把自己再拖回一次周期的泥潭？

fin@fi56622380

回顾2025年半导体市场，真的是有太多太多精彩的故事，最大的主题就是: AI需求驱动导致半导体基建的估值体系重构 + 产业链的价值分配重写从2024年开始，半导体基建正在飞速吞噬整个IT产业利润，SP500里半导体净利润EPS在IT行业里占比，在两年时间从不到20%上升了到了40%，而且还在呈加速上升姿态半导体整体前瞻利润率从2023年的25%已经升到了2025年11月的43%，已经明显超过了几个互联网巨头的平均利润率，这也印证了半导体利润率超过互联网会是新常态。整个IT产业的利润分配，流向半导体的比例越来越大。要知道，就算是20~22年的半导体芯片荒，短缺如此严重，半导体的利润率和整个IT利润分配也没有显著增长这就是故事的上半篇：AI需求驱动导致半导体基建的估值体系重构，不再是互联网时期的基建从属地位 ------------------------ 这个现象背后的逻辑是商业模式随着技术特性的变迁：互联网时代，每次请求的网络和算力成本，边际成本极低，scaling的效果极好，分发的边际成本几乎为零在AI时代，这个互联网时代分发边际成本几乎为零利于scalable的特性遭遇了根本性的重大挑战：且不说训练成本从此不是一次性开销而是年年增长，就客户的AI推理请求而言，由于inference scaling成为共识，加上垂直领域仍然需要更大规模的旗舰模型来保持竞争力，推理的成本不会随着硬件算力价格的通缩而同步降低互联网企业从前的最大成本只有OPEX尤其是SDE人工成本，而现在，互联网公司历史上第一次像半导体厂foundry那样背上高折旧成本的资产负债表，商业模型恨不得要慢慢从“流量 × 转化率”部分转向“每 token 毛利”了简单的说，互联网时代到AI时代的成本分布，在人力成本opex的基础上又加上了沉重的硬件/算力成本capex(财报里占比：MSFT 33%, Meta 38%)。上个时代的互联网公司+CSP+SAAS是收租行业里的大赢家，而AI时代，算力(半导体/芯片折旧)成为了新的收租行业，整个IT行业的利润分布发生了剧烈的重新分配(EPS利润流向半导体从20%升到40%而且持续攀升中)，这就是半导体基建估值体系重构最重要的原因 --------------- 半导体高利润率的新常态趋势能持续多久？目前的高溢价来自于前期不计成本的军备竞赛造成的半导体订单积压过多但很显然，hyperscalers都不愿意当冤大头，都在试图自建ASIC降低成本，那么可以从2030年远期的算力分布来回看这个问题长线来看，openai已经明牌了标准答案，10GW Nvidia，10GW ASIC，6GW AMD，其他hyperscaler划分比例有类似考虑比如说，推理端希望ASIC >50%，GPU里再细分的话，AMD和NV(legacy)对半分。训练还是得NV占大头，60%+,剩下的自研ASIC和AMD对半分 2030年按60%推理，40%训练比例划分，算下来NV 38%， ASIC 39%， AMD 23%，跟openAI比例是几乎完全一致的，算是一个标准答案参考值当然了，微软，Amazon，Google，Anthropic这几家里AMD的比例会比这个标准答案中枢/参考值明显低一些，xAI则是没有ASIC只有Nvidia+少量AMD AMD的风险在于，当2030年再往后的更长期，CSP的in house ASIC越来越成熟(微软除外)，推理端ASIC占比可能越来越高，很难有incentive新买入大量GPU了，除非卖的足够便宜最近风头正劲的TPU呢？Meta是不是要转向TPU？对Nvidia的利润率影响大吗？实际上，Meta今年capex72B，明年capex110B，未来六年capex平均值可能达到160B附近，而Meta 6年10B的TPU订单算下来年均只有1.6B，而且购买的是TPU云服务，并不是裸TPU 也就是说，Meta这笔TPU订单只占到Meta未来6年capex的1%，并没有严肃的考虑大规模部署，可能只是作为和Nvidia讨价还价的手段而已另外从Meta最近几个月的招聘广告来看，也并没有看到任何TPU engineer方面的招聘，不像 Anthropic那样从五月就招一堆TPU kernel engineer，十月才宣布大规模采购TPU做训练所以说，不管原因是diversify供货商，还是给自研ASIC延迟做退路，还是因为AMD的MI350X延迟，Meta买TPU基本上只有一个考虑：增加买Nvidia GPU的议价权，但顶多只有推理份额里能讨价还价，实际效果很有限，对Nvidia利润率影响也很有限。要知道，22年加密货币熊市矿难的时候，NVDA库存上升到了198天，利润率只是从65%回撤到了56%，算上PE/宏观双杀股价才从300变100，现在一直供不应求，利润率没道理能降下来再加上TPU v8设计过于保守(没用HBM4)，Kyber rack的Rubin方案会比TPU v8的TCO更好，到头来最后还是得继续依赖Nvidia，很难议价。只要Nvidia继续保持这样的大踏步前进，竞争对手其实要跟上还是不容易的。总之，一方面，全产业链瓶颈，比如cowos扩张都很谨慎，供不应求的状态还能持续多年。另一方面，AI变现的利润曲线和硬件投入曲线存在“时间错配”，应用端的增长曲线会落后几年，只要这个应用端和基建端的增长曲线的时间错位依旧存在，半导体在IT行业的利润分配就会一直占优势。从OpenAI的到2030年的投入曲线来看，这个时间错位至少要持续到2030年附近。也就是说半导体行业的超级扩张期带来的在IT产业利润划分的主导地位，目前看至少能持续到2030年而半导体高利润率可能会维持的更长远一些，因为从互联网时代一次性基建属性变成了现在的收租基建属性 --------------------------------------------------- AI 不是只养活了 GPU，而是在用算力预算把“能把电变成 token 的每一环”都抬了一轮，从内存，存储，互联，光纤，电力，储能…..等等上半篇讲完了“半导体吞噬IT利润”，那么下半篇讲的就是“AI算力价值溢出效应（Spillover Effect）重塑半导体内部格局”：GPU算力增长 -> 内存/存储/互联/CPU瓶颈 -> 溢出效应 -> 结构性机会 2025 年更有趣的故事，是巨大的行业红利在半导体内部怎么诞生结构性新机会，比如说，一个super cluster需要几个数据中心互联，光纤互联的长度需要上百万mile这个级别，这就是新机会半导体产业链的结构性趋势带来的新机会，最典型的例子就是内存(DRAM/HBM)和存储(SSD)，HBM的需求增长太夸张，连带挤压DDR4/5产能，直接让以周期性为标志的内存行业甚至喊出了“周期不存在”了，Hynix因为在HBM上领先，甚至都开始憧憬起了几年后年利润1000亿美元，妥妥一个万亿市值的公司这两个板块背后，是结构性趋势的转变：AI workload从训练逐渐往推理延申，推理比例越来越大。而推理是一个非常纯粹的吃内存带宽速度(memory bound)的事情，可以说带宽速度=token/s。模型尺寸越来越大，以及上下文context length的增加，对内存的尺寸要求也相应增大，导致了内存的需求激增：推理即内存下一代的的GPU/ASIC内存已经成了暴力美学，配备的内存size之巨大，是三年前无法想象的，回看22年H100的80GB简直像个玩具，这才几年就增长了十倍： Nvidia Ultra Rubin - 1024GB HBM Qualcomm AI200 - 768GB LPDDR AMD MI400x - 432GB HBM 内存的另外一个潜在的爆发点在端侧，也就是手机/PC/汽车/机器人的端侧LLM，这两年主流的手机旗舰机已经从6GB升级到了8GB/12GB/16GB，提前为可能的端侧LLM生态做准备，毕竟手机算力下一代就能达到150TOPS量级，妥妥的桌面级，非常暴力潜力上来说，端侧内存升级是比云端内存增量要更大的市场，毕竟端侧终端device的数量太惊人了，每年都是billion级别，一旦端侧LLM生态繁荣起来，内存用量翻倍轻而易举，针对端侧低功耗内存/存算一体的各种设计都会跟上但端侧genAI的软件生态，似乎明显滞后，一直比我想象的进度要慢，可能是因为这方面还处于摸索期，并没有云端那么确定的ROI，厂商们在投入上都很谨慎，我在23~24年时候看好27年，可能还是太乐观了互联网->移动互联网用了10~15年，端侧genAI/LLM可能也需要7~10年，可能得等云端ROI开发的差不多了，边际收益下降了，才能轮得到端侧genAI/LLM拿到开发资源，跑通端侧ROI。 -------------------------------------- 另一个2025年半导体内部结构性转变的故事是NAND存储，特别是企业级eSSD硬盘结构性趋势来源也是同一个，AI workload的推理需求越来越大。内存红利也外溢到了SSD存储，甚至HDD存储，因为内存不够用就用高速SSD作为多级缓存主要逻辑是AI推理过程中内存溢出KV cache offloading到下一层SSD存储，以及向量数据库检索/indexing，都在增加SSD存储的需求 Micron财报说的精准又直白：“AI inference use cases such as KV cache tiering and vector database search and indexing, are driving demand for performance storage.” 至于为什么存储价格在第四季度才爆发，这需要区分一下合约价格和现货价格，合约价格涨幅会温和一些，就算是最紧缺的企业级eSSD合约Q4上涨大概25%。而当NAND产能在2025年被合约慢慢的吃光，现货的价格就造成了观感上强烈的冲击，一个月上涨50%以上。另一个未经验证的逻辑是多模态的爆发，特别是AI图片和AI视频的需求爆发，也会加剧存储的短缺，我觉得这条线只能说未来可期，但目前的视频/图片精细程度，可能还不到当年GPT3的水平，要达到出圈效果还需要一些时日。 ------------------------ 那么下一步还有什么趋势转移带来的半导体结构性的机会呢那么就要先看下一步AI推理端的需求趋势是什么，毫无疑问，agentic flow的比例会越来越大，2025并不是year of agent，而是一个decade of agent 从CPU视角去看agentic workload，routing和工具处理都在CPU上，如果把常用的agentic框架做profiling，比如SWE-Agent, LangChain, Toolformer，CPU最长可以占到90%的E2E端到端延迟，throughput瓶颈也更多的卡在CPU，甚至CPU能耗也超过了总能耗的40% Agentic AI目前是一个CPU瓶颈更多的事情，在 agentic 框架里，CPU 是永远在忙的总指挥orchestrator, 很可能会成就CPU需求的新一波回暖 AMD 2025年Q2财报（8月5日），Lisa Su明确表述了这一现象："In particular, adoption of agentic AI is creating additional demand for general-purpose compute infrastructure, as customers quickly realize that each token generated by a GPU triggers multiple CPU-intensive tasks." "agent AI的采用正在对通用计算基础架构产生额外的需求，因为客户很快就意识到GPU产生的每个令牌都会触发多个CPU密集型任务。" Q3 财报里Lisa又明牌了一次CPU TAM increasing due to Gen AI. "Many customers are now planning substantially larger CPU build outs over the coming quarters to support increased demands from AI, serving as a powerful new catalyst for our server business." Nvidia也是把agent flow视为CPU需求，GB200/300 架构配置的CPU比例也比以往大的多，36颗 Grace CPU : 72颗 Blackwell GPU，直接达到了1：2的水平，AMD的路线则是用1~4个256核的EPYC去服务MI400系列72~128个GPU 以后的硬件架构，一定会往优化agent workload方向发展，比如agent task graph的调度和load balancing，CPU/GPU协同micro-batching 算力上的比较，说不定以后也会摆脱现在的纯GPU token rate比较，转向整个系统级全栈agentic benchmark比较. -------------------------- 半导体结构性转变带来的机会同时，下一步，可能也会带来一些意想不到的次生效应云端AI数据中心需求爆发，造成内存和存储的暴涨，给消费电子的成本带来了很大压力，在2026年，这也许会演变成消费电子产业潜在的黑天鹅 PC厂商最近的股票大跌，也是这个原因。HP已经说了要减少内存配置，暗示要把PC重回8GB内存+256GB存储的时代了。 DRAM内存和存储再这么涨下去，可能会出现很离谱的情况：内存/存储现货价格比CPU和GPU还要更贵。尴尬的是，这可能直接延缓了消费电子期望的AI PC的进程，毕竟大内存是更有利AI PC的表现力的。夸张的说，每个PC厂商和手机厂商的员工，甚至是消费电子厂商的员工，都应该买入存储和内存，作为职业风险对冲明年年初开始，安卓阵营的内存以及存储成本要压不住了，三星，小米的手机售价都提高的话(美国市场现在已经提高不少了)，利好最大的就是苹果苹果的内存产能，nand产能都是专属长约锁价特供的，顺带还把Kioxia给坑了好多不涨价产能，导致苹果的成本优势进一步扩大，苹果全球手机销量市占率增长可能会非常可观，接下来一阵子可能会是iphone辉煌的时光。 ----------------------- 2025年半导体市场真的是太多精彩的故事了，Nvidia/AMD/TPU和各家hyperscaler的恩怨情仇引得各路下注的吃瓜群众心情跌宕起伏。 HBM/内存厂商吃到了memory-bound的红利，NAND厂商意外收获了KV cache的溢出效应，CPU在沉寂近十年后，可能会因agent orchestration再次回到增长叙事的中心不再是Nvidia/AVGO几家算力厂商独大，而是AI workload算力价值溢出后的每一次演进，从训练到推理，从文本到多模态，从单模型调用到agentic flow，都在重写产业链的价值分配。云端AI的繁荣正在挤压消费电子的生存空间——当PC厂商被迫讨论重回8GB时代，苹果却因供应链优势坐收渔利。这场算力军备竞赛的次生效应，可能在2026年以意想不到的方式重塑整个消费电子格局半导体的故事不再是一条单线，而是一张持续自我重构的网。而 2025 年，大概只是合纵连横的第一回合

中文

301

998

521.9K

Cooper Leong@cooperleong22·29 Nis

@wzenus 那么公开记录学习是？

中文

Zihan "Zenus" Wang@wzenus·29 Nis

表达欲，是财富还是负债？ The more you express the more you bring to others, so it is wealth? The more you express the less time you have for learning, so it is debt?

English

Cooper Leong retweetledi

Yuhan Liu@YuhanLiu_nlp·28 Nis

Can LLMs generate diverse outputs for open-ended questions? Is it helpful if we ensemble outputs from multiple models? We study 18 LLMs on 4 datasets and find that no single model is best at generating diverse outputs 👇/ 🧵

English

171

22.8K

Cooper Leong retweetledi

idan shenfeld@IdanShenfeld·29 Nis

Self-distillation can reduce hallucinations when teaching LLMs new knowledge. I think the first time I heard about how RL enable learning without increased hallucination was in @johnschulman2 talk in 2023. Turns out, like many of RL’s benefits, this one also comes from learning on-policy.

Guy Kaplan@GKaplan38844

Fine-Tuning LLMs on New Knowledge Encourages Hallucinations. (@zorikgekhman) But why? We found something unexpected: 1M facts about city-like names →hallucinations explode. 1M facts about random identifiers →near zero! Same model. Same number of facts. Only the names change.🧵

English

116

11.9K

Cooper Leong retweetledi

Weiwei Sun@sunweiwei12·28 Nis

Scaling laws can save millions, but fitting them can also cost millions🤑 Our new paper shows how to fit better scaling laws with only 10% of the training budget! ✅ Spend Less, Fit Better: arxiv.org/abs/2604.22753

Shanda Li 黎善达@Shanda_Li_2000

New paper: Spend Less, Fit Better Fitting scaling laws for LLMs can cost millions💰-but what if you can get the same insights with just ~10% of the budget? We frame scaling-law fitting as budget-aware experimental design and propose a method to pick the most valuable runs.#LLM

English

204

22.8K

Cooper Leong retweetledi

Fatih Dinc@fatihdin4en·28 Nis

Where does agency come from in a neural network? We trained RNNs to chase a moving target. Some learn to react. Others learn to predict, anticipate, and even wait (schematic gif attached) The difference comes down to the dimensionality of the neural code. 🧵 (1/7)

GIF

English

103

633

54K

Cooper Leong retweetledi

Jackson Stokes@jackson_stokes·29 Nis

can we train a model in single RL step? During recent experiments, @Logangrasby found that a single step of OAPL increased model performance from ~0 to 48% on a clinical reasoning and prediction task. Turns out, data staleness might matter less than we think. with @pathos :

English

6.6K

Cooper Leong retweetledi

Aran Komatsuzaki@arankomatsuzaki·28 Nis

The non-English tax is real. Sutton's Bitter Lesson, translated across languages and normalized to OpenAI English token count: Hindi: OpenAI 1.37×, Anthropic 3.24× Arabic: OpenAI 1.31×, Anthropic 2.86× Chinese: OpenAI 1.15×, Anthropic 1.71× Claude’s tokenizer charges a much higher linguistic tax.

English

264

1.6K

856.4K

Cooper Leong retweetledi

keshav@kshenoy_·28 Nis

Can LLMs simply tell us about unwanted behaviors they’ve picked up in training? We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors. It generalizes to detecting hidden misalignment, backdoors and safeguard removal.

English

559

281.5K

Cooper Leong retweetledi

mufeez@moofeez·28 Nis

I post-trained Qwen3-Coder to fix bugs using an actual debugger. The result: Solve rate: 70% → 89% Median turns to fix: 46 → 19 (-59%) Instead of just reading code or print-debugging, it: - reasons from execution - inspects live variables and call stacks - sets breakpoints, steps, and evaluates expressions

English

118

1.6K

121.6K

Keşfet

@GoodfireAI @wzenus @johnschulman2 @LoganGrasby @pathos @elonmusk @BarackObama @taylorswift13