J

3.5K posts

J

J

@ggpcap

RT/likes ≠ endorse. Not advice

San Francisco, CA Katılım Temmuz 2015
268 Takip Edilen583 Takipçiler
J retweetledi
will depue
will depue@willdepue·
Let's read the technical report. TLDR; No real answers on how their method works. Doesn't make me feel better about it. They seem to understand the problem: "[Attention] is expensive for the same reason: every query compares against every key. The result is an all-pairs computation whose cost grows quadratically with sequence length." And that just sliding window, KV dropping, etc. doesn't work: "Fixed-pattern sparse attention reduces compute by limiting which positions a token can attend to. The model decides where to look before it knows what it is looking for. When the relevant information falls outside the pattern, it is simply not seen." This is spot on: The reason why you can't get rid of attention is that you can't know ahead of time whether some information will be useful. For ex, phonebook evals are built to measure this: List 100,000 names, then ask for a certain person's information. You don't know what name you're going to be asked for in the future, so you have to keep everything in context. They also understand why RNNs and State Space models are always doomed. "They remove the all-pairs comparison entirely, replacing it with a compressed state that evolves across the sequence. This yields linear scaling by construction. It also introduces a constraint: the state has fixed capacity. As the sequence grows, information must be summarized, blurred, or discarded." The phonebook test holds here as well. Unlike attention, the fixed size of your recurrent state means that you're guaranteed to run out of space at some point as the context grows larger. You have decide what to remember as you go, you have to leave things out, and so for tasks like Phonebook — where you have to remember literally everything to pass — you're stuck, regardless of how good your compression is. So how does Subquadratic's SSA (Subquadratic Selective Attention) work? "The core idea is content-dependent selection. For each query, the model selects which parts of the sequence are worth attending to, and computes attention exactly over those positions." "Dense attention assumes every pair might matter, so it evaluates all of them. In practice, almost none do. Most pairwise interactions carry negligible signal, but the model still pays the full quadratic cost to compute them. SSA removes that assumption. It does not approximate attention. It restricts attention to the positions that actually carry signal, and skips the rest." This doesn't really tell us anything. How do you know what to attend to before you run attention? How would you know what interactions carry signal before you've compared them? There's a loop here: There are lots of interactions that carry no signal, but you would only know they don't carry signal after you've checked them. For example, DeepSeek's DSA still 'checks everything' but does so with a lightweight attention-based indexer, saving the expensive stuff for the tokens that the indexer thinks are worth paying attention to. Until we get an answer to that, it's hard to say anything about what they're doing. For example, if we assume they are using some learned token dropping module, where the model learns to restrict attention to position it guesses are helpful ahead of time, you're either unable to pass the Phonebook eval (since you're dropping information you might need in the future) or you're just back to normal quadratic attention (if you're not dropping any information). And it does seems like that might be what they're doing, as they mention training the selection mechanism to decide what to route where: "The training data emphasizes long-form sources with high information density and cross-reference structure. This is the kind of data that forces the selection mechanism to learn routing over large positional distances." The statement "It does not approximate attention." is also interesting. Even more worrisome, it could seem like they could be training on the benchamrks to teach their selector what to route where. This sentence seemed potentially suggestive of this: "The goal is not benchmark memorization. The goal is to teach the model to attend to what matters regardless of where it sits." I'm excited to hear more details from the team.
will depue@willdepue

my first take, and a good lesson on good research epistemics here: what can we infer from ~82% SWE-Bench? it’s possible they (1) they trained a new model, from scratch, that is unlike a regular transformer but i’ve never heard of this company before, and checking their funding round they’ve only raised ~30M, so it’s unlikely they could/afford to train a Opus/GPT-5/Kimi 2.6 level coding model right now from scratch so this tells us that (2) they need to bootstrap off of an existing pretrained model, likely RL too, to get that performance! this tells us they’ve taken a vanilla Transformer and modified the attention mechanism, likely finetuning/midtraining in a subquadratic attention method its quite possible it doesn’t really work and that there’s some degeneracy to the method, or it’s just plain fake but if it’s not, you could expect that given how long it takes to do weight surgery on big models (bigger changes to a pretrained model == longer mid training to recover performance), it’s a lightweight change id lean towards something mostly leveraging existing attention key value protections like a fancy version of deepseeks sparse attention paper, but it could also be some unique test-time KV compression, which would come with its own downsides

English
16
7
162
27.8K
Jigsaw
Jigsaw@JigsawCap·
I love the arrogance of public markets guys with a LT focus - "We're not traders (eww, gross). We're investors!" Note that almost all public equity market jobs are just moving around non-primary capital and are not actually investing
English
4
0
30
5K
J
J@ggpcap·
@emollick They are figuring out as they go, same as everyone else (this not being a comment on trends observable across the ecosystem)
English
0
0
0
323
Ethan Mollick
Ethan Mollick@emollick·
Co-founder of Anthropic, interesting that he refers to public sources when he is also obviously privy to lots of internal sources that he cannot discuss. I assume he sees the same thing at Anthropic.
Jack Clark@jackclarkSF

I've spent the past few weeks reading 100s of public data sources about AI development. I now believe that recursive self-improvement has a 60% chance of happening by the end of 2028. In other words, AI systems might soon be capable of building themselves.

English
31
64
1.3K
272.8K
J
J@ggpcap·
Buffett and Hohn’s record will likely be a death knell for a lot of people. To elaborate — a style factor that is universally known converges to beta if not less. Cf. @BillAckman
English
0
0
0
177
J
J@ggpcap·
1/3 of the year down. Quite happy so far -- could have captured more upside in April but that's what you get sometimes (reminder to just let predetermined price levels gauge entry, and to listen to the market earlier). I love the fundamental environment we are in -- the technicals can be something else -- because so many people have biases that they cannot let go of. Lots of behavioral alpha to continue harvesting. I'm sure many would agree. That said, degrossed a lot today while keeping opinionated exposures. Started moving from fairly obvious directional bets to being more surgical given where prices are overall, with a regime view from here ~ summer but ready to change opinions at any time / lean more strongly into whichever direction.
English
0
0
0
91
J retweetledi
Kevin Kwok
Kevin Kwok@kevinakwok·
When AI hits security there will be signs
Kevin Kwok tweet media
English
76
300
2.8K
293.2K
J retweetledi
wh
wh@nrehiew_·
You should probably read this. The best available overview of the systems required to scale large MoEs. Notes below.
wh tweet media
English
7
52
475
31.2K
J
J@ggpcap·
Excellent
fin@fi56622380

AI Semiconductor Endgame 2026 (Part 1) New Token Economics Computing Paradigm Shifts from GPU Compute to HBM This article starts from the essence of GPU architectural evolution to address a question the market has long worried about: Why must each GPU's HBM memory demand grow exponentially, and why won't this exponential growth in HBM demand stall? It then derives the first principle of token economics under the current architecture: token throughput = HBM size × HBM BW (bandwidth) It also discusses why the GPU ceiling is determined by HBM's two dimensions of progress. The topic of HBM cyclicality has long been controversial. Optimists argue that AI-driven demand is much greater than before, but the market mainstream still believes that previous up-cycles also saw 20%+ annual demand growth — so what's different this time? AI doesn't change the fact that HBM, like traditional DRAM, has commodity attributes. Once capacity expansion at the demand peak meets a downturn, history will repeat itself. We can take the perspective of compute-chip architecture, start from first principles, and unpack and reason through this question: why this time is genuinely different. ——————————————————————————————— History: The Era of CPU Compute For a very long time, we lived in the era of CPU-dominated compute. The CPU's top-level KPI was performance — running faster — and so each generation of CPUs deployed every method imaginable to push benchmark scores higher. First it was rising clock frequencies, then it was architectural evolution: superscalar designs, and so on. During this period, why didn't DDR need to advance technologically at high speed? DDR3 to DDR5 took a full 15 years. Because in this era, DDR's role was purely auxiliary — and only weakly so. By industry experience, even doubling DDR speed would generally only raise CPU performance by less than 20%. Why did improvements in DDR bandwidth and speed matter so little? Two reasons: 1. CPUs designed all kinds of architectural tricks to hide DDR latency — superscalar designs, wider issue widths, massive ROBs and register renaming to extract parallelism and hide latency, L1 caches, L2 caches — all of which weakened the demand for DDR bandwidth and speed. 2. CPU workloads don't have particularly demanding bandwidth requirements. For most everyday workloads — say, opening a webpage — DDR bandwidth is severely overprovisioned. Even cloud workloads often look the same. In other words, in the CPU era, DDR bandwidth and speed didn't really matter. There was virtually no difference between DDR4 and DDR5 except in a handful of games — and even the JEDEC standard advanced slowly. On top of that, only a small portion of any given app needs to permanently sit in DDR. Whatever is needed can be paged in from the hard drive on demand. App size grew slowly, and so DDR capacity demand grew slowly as well. That's why, over the past decade, the average PC went from 7–8GB of DDR to about 23GB — only 3× growth in ten years. This slow upgrade pace directly affected revenue. Capacity-based pricing was the main way of making money; speed improvements were just a technological upgrade that raised the unit price of capacity. With both of these dimensions advancing slowly, growth could only come from increases in PC/phone unit volumes. So along both dimensions — bandwidth/speed and capacity — DRAM was always a “nice-to-have” appendage to the chip industry. The marginal utility of DDR upgrades was very low, and almost completely disconnected from the CPU era's top-level KPI. ——————————————————————————————— The Paradigm Shift: GenAI's Top-Level KPI When we entered the era of GenAI large models, the computing paradigm shifted, and the top-level KPI changed fundamentally. By the time GPUs evolved into AI inference engines, the top-level KPI was no longer compute alone (TOPS/FLOPS), as it had been for CPUs — it became the cost of a token. Specifically: overall token throughput per unit cost / per unit power. A close second is token throughput speed — because in the agent era, many tasks have become serial, and token output speed has become a critical bottleneck for user experience. This is exactly why Jensen invented the concept of the AI factory: to produce the most tokens at the lowest cost, while pushing token throughput speed as high as possible. In the AI training era, Jensen's economics were TCO (Total Cost of Ownership): the more GPUs you buy, the more you save. In the inference era, Jensen's token economics flip the logic: AI inference has very healthy gross margins, so the logic now becomes: the NVIDIA GPU is the GPU that produces the cheapest token in the world, so the more you buy, the more you earn. The top-level KPI has become a Pareto frontier: along the two dimensions of token throughput and token speed, optimize as far as possible. Each generation of NVIDIA's token factory is essentially pushing the entire Pareto frontier up and to the right. This is the most important KPI of the AI inference era. ——————————————————————————————— From Token Throughput to HBM: The Core Logic Chain Below is the most important logical chain of this article: how to start from the exponential growth of token throughput and derive that the ceiling bottleneck lies in the exponential growth of HBM size and HBM speed. In the era of single-GPU inference with single-thread batch size = 1, token throughput had only one dimension: HBM bandwidth speed. Higher bandwidth = higher token throughput. But once we entered the NVL72 era, inference is no longer single-GPU. It is a system-level token factory composed of 72 GPUs + 36 CPUs, designed to fully saturate HBM bandwidth and compute simultaneously, in pursuit of the ultimate token throughput. Token throughput growth depends on two things: the number of requests batched simultaneously × the average token speed per request. That is: batch size × token speed. Take Rubin NVL72 as an example. At an average token speed of 100 tokens/s, processing 1,920 simultaneous requests yields a token throughput of 192,000 tokens/s. A Rubin NVL72 draws roughly 120kW (0.12MW), so per MW it can handle 1.6M tokens/s. So we need to find ways to push both parameters up: batch size and average token speed. Their product is our top-level KPI — token throughput. Parameter 1: Batch growth — bottleneck is HBM size Every request in the batch carries its own KV cache, which has to live in HBM, with sizes ranging from a few GB to tens of GB. Because hot KV cache must be read at high frequency and high speed at any moment, it must reside in HBM. For a model with, say, 80 layers, every token generation step requires reading the KV cache 80 times from HBM. As batch size grows, hot KV cache grows linearly. And because the hot KV cache for every request in the batch must sit in HBM, HBM size must grow linearly with batch size. Like an airport shuttle bus: the gate wants to move passengers to the plane as fast as possible. If HBM size is small, the shuttle is small, so you have to make extra trips. Conclusion: batch size growth bottlenecks on HBM size growth. Parameter 2: Average token speed per request — bottleneck is HBM bandwidth The decode-phase speed of a large model bottlenecks on HBM bandwidth, because every token generated requires reading the activated weights and KV cache many times over. The emergence of LPUs has, in cases where batch size isn't very large, moved the activated weights portion onto SRAM — but every generated token still requires many reads of the KV cache from HBM. The higher the HBM bandwidth, the faster each token is generated, in essentially linear correspondence. Like the airport shuttle bus: HBM bandwidth is like the width of the door — wider doors mean passengers board faster. The rest of the GPU's configuration is essentially adapted to support batch growth and to keep token compute speed in step with HBM growth. In some cases the GPU even spends excess compute to recover effective bandwidth (e.g., bandwidth compression techniques). —------- To return to the shuttle bus analogy: • Shuttle bus cabin size = HBM Size (capacity): determines how many passengers can fit at once (i.e., how many requests' KV caches can sit in HBM simultaneously). Bigger cabin = more passengers (higher batch size) per trip. If the bus is too small, moving 100 people takes two trips — and total throughput suffers. • Shuttle bus door width = HBM Bandwidth: determines how fast passengers get on and off. A wide door, and everyone piles on at once (decode/token generation is fast). A narrow door, and even with a giant cabin, people queue up and most of the time is spent boarding. • Passenger throughput = cabin size × door-width-determined boarding speed. —------- At this point, we've logically derived the first principle of token-economics hardware demand: Token throughput = HBM size × HBM Bandwidth The top-level KPI of the AI inference era is highly dependent on progress along both HBM dimensions. If we want to maintain 2× token throughput growth per generation, that means each generation of single GPU must grow HBM size × HBM BW speed by 2×! This is the first time in history that HBM memory size can influence the top-level KPI — token throughput. To validate this thesis, we can put NVIDIA's token throughput from A100 to Rubin Ultra on the same chart as HBM size × HBM BW speed. What you find is that the two curves track each other startlingly closely on log axes. HBM size × speed actually grows even faster than token throughput — which makes sense, because HBM defines the ceiling, and in practice utilization of that ceiling is very hard to push to 100%. Even if HBM size × HBM speed grew by 1,000×, with the supporting compute and architecture, it would be very hard to wring out the full 1,000× of headroom. This curve isn't a coincidence — it's the necessary solution of system optimization. throughput = batch × speed. This is the unavoidable first principle of token factory economics. —------- What about software? Won't software optimization reduce bandwidth demand? Reduce HBM demand? This is an independent dimension from hardware. It's like asking: if software on a CPU runs faster after optimization, does that mean the CPU doesn't need to advance for ten years? After all, software is faster now. If that were the case, would CPU vendors still make money? For a CPU vendor to survive, there's only one path: in standardized benchmarks, ignoring software optimization, every new CPU generation must score higher — otherwise it doesn't sell. GPUs are exactly the same. How well software is optimized, and the requirement that the GPU's own token-throughput KPI must improve dramatically every year, are two separate things. As long as token demand keeps growing, the pursuit of higher token throughput will not stop — and so neither will the pursuit of higher HBM size × HBM speed. If HBM size and HBM speed were to slow down, Jensen would personally fly to the Big Three and pressure them to accelerate, because that ishis GPU ceiling. If the ceiling stops rising, can his GPU still sell? Of course, NVIDIA also needs to wrack its brains to extract performance beyond the HBM ceiling through heterogeneous architectural angles. The LPU is a great example — it improved the Pareto frontier substantially from a different angle (the right-hand high-token-speed portion). —-------------------- HBM memory has now bid farewell to that old era of drifting with the tide. On this one-way road paved by exponential demand, it has, in something close to a destined fashion, walked onto the central stage of the industry's epic. When the inference paradigm's first principles evolve to this point, as long as Jensen still wants to sell GPUs, HBM must double — and it must double every generation. This is endogenous pressure from the supply side. It has nothing to do with AI demand, nothing to do with macro cycles, and nothing to do with the moods of the hyperscalers. The only remaining question is this: When demand has been physically locked into exponential growth, will the three players on the supply side — like they have for the past thirty years — once again drag themselves back into the mire of the cycle by their own hands?

Français
0
0
0
141
J retweetledi
Goshawk Trades
Goshawk Trades@GoshawkTrades·
i come back to this clip of Jim Simons every few months. the man who built the most successful hedge fund in history sharing his guiding principles in life. 2 and a half minutes that will stick with you.
English
17
427
4K
241.6K
J retweetledi
Hossein Rassam
Hossein Rassam@rassam_hossein·
Very important! And chatter from Tehran somehow corroborates this. It seems there’s been a shift of action. Rather than one-venue meetings, plan of action for Iran is that details of a framework agreement would be worked out through negotiations Araghchi holds in various capitals. While the nuclear dossier and the HEU stockpile would be discussed in Moscow, the issue of the Strait of Hormuz/toll as well as blockade would move toward a resolution via Muscat. Broader matters would be examined within that framework in conversations with other regional players such as Turkey and Saudi Arabia. Beijing will be kept in the loop. Islamabad would remain as the hub while other capitals function as nodes. Ghalibaf would remain in Tehran until the framework is shaped. Part of the reason, for Tehran mainly, seems to be that direct talks creates unwanted hype/ confusion (that is always aggravated by Trump) and factional bickering internally. Finally when Araghchi completes the tour and all details are hammered out, the real show will go on stage.
Kamran Yousaf@Kamran_Yousaf

SOME KEY DEVELPMENTS As things stand, slowly but quietly, a "mega deal," not just involving the principal parties (Iran & the US), but several regional players and beyond, is in the making. The likely deal between Iran and the US will not only reflect their concerns and demands, but also those of other stakeholders. A flurry of diplomatic activities over the past 48 hours suggests a serious push to end the Iran-US war permanently. Besides Pakistan, several other countries, including Russia, China, Saudi Arabia, Turkey, and Egypt, to name a few, are working in tandem to thrash out a deal acceptable to everyone. The Iranian Foreign Minister spoke to his Saudi counterpart on his way back from Oman to Islamabad. While most are focused on Trump’s bluster, the real work is being done quietly and efficiently.

English
29
130
411
122.1K
J retweetledi
JabroniCoin.USD
JabroniCoin.USD@TheBenSchmark·
AI helps solve the hearing aid “cocktail party problem.” AI-native startup Fortell, that raised $160m, showed “overwhelming” superiority over Sonova’s top-of-line Phonak device. High-margin high-multiple legacy hearing aid companies might be the next AI losers $SOON.SW $SONVY
JabroniCoin.USD tweet mediaJabroniCoin.USD tweet media
English
1
3
9
1.3K
J retweetledi
Mehdi H.
Mehdi H.@mhmiranusa·
Sentinel-2 satellite image today shows what looks like a flotilla of IRGCN fast attack crafts sailing north of strait of Hormuz near Kargan coast. At least 33 boats can be seen in what looks like a show of force enforcing the strait closure by Iran. Geo-location: 26.899,56.824
Mehdi H. tweet mediaMehdi H. tweet media
English
115
507
2.6K
492.6K
J retweetledi
Macro_Lin | 市场观察员
台积电又要赢麻了。几乎所有的芯片架构师都在往推理芯片里塞更多的SRAM。Google TPU v8i 384MB,Groq LPU单颗500MB,微软Maia 200 272MB。这个逻辑是对的,推理的decode阶段是memory-bound的,每生成一个token都要把权重从显存里读一遍,SRAM比HBM快一个数量级,片上SRAM越多,token生成越快。 问题是SRAM吃硅片面积,而且这个面积几乎不随制程进步而减少。台积电3nm节点的高密度SRAM bit cell面积跟5nm完全一样,整整一代没有缩小。到2nm用GAA晶体管终于有了改善,但主要是外围电路在缩,bit cell本身的进展有限。 这意味着什么?推理芯片对SRAM的需求在翻倍甚至三倍增长,但每平方毫米能塞进去的SRAM容量几乎没怎么变。结果就是die面积被SRAM撑大,384MB SRAM在2nm下光SRAM本身就要占掉80-100mm²以上,占一颗大芯片面积的10-15%。Die越大,每片wafer切出的good die越少,良率越低,单颗成本越高。 面积问题之外还有产能问题。这些推理芯片全部挤在台积电2nm节点上,跟Apple、AMD、高通抢同一条产线。Google一家到2028年TPU出货规划就超过3500万颗。同时这些芯片还需要配大容量HBM,CoWoS先进封装产能已经紧张到部分客户开始评估Intel的替代方案。 先进逻辑节点和先进封装,两条线同时绷紧。Groq的LPU走三星4nm算是分流了一部分压力,部分客户也开始评估Intel的EMIB和Foveros作为CoWoS的替代。但对大多数推理芯片来说,台积电2nm加CoWoS仍然是唯一选项,产能争夺只会越来越激烈。 AMD的3D V-Cache指了一条可能的出路,把SRAM单独做成die用成熟节点生产,再3D堆叠到逻辑die上面。如果推理芯片的SRAM需求继续往GB级别走,这可能是绕开SRAM缩放困境的唯一路径。但3D堆叠又绕回了先进封装的瓶颈。 推理时代的硬件瓶颈,依然在存储密度和封装产能。
中文
31
90
564
100.4K
J retweetledi
Miad Maleki
Miad Maleki@miadmaleki·
8/10 Extremely important topic is the storage clock: Iran has ~50-55M barrels of total onshore oil storage, roughly 60% full. Spare capacity: ~20M barrels. With 1.5M bbl/day of surplus production that normally exports, storage fills in ~13 DAYS. After that, Iran must shut in wells. Why is this very important: when mature oil wells shut down, bottom water rushes in, a process called water coning. Oil droplets get permanently trapped in rock pores. This oil can never be recovered. Iran's fields already decline 5-8% annually. Forced shut-ins could permanently destroy 300,000-500,000 bbl/day of production capacity, that's $9-15B/year in revenue, gone forever.
English
34
270
1.7K
197.1K
J
J@ggpcap·
@S_Kadakia Not seeing anything !
English
1
0
0
21
Sagar Kadakia
Sagar Kadakia@S_Kadakia·
Introducing Qualitate. AI built for primary intelligence.  Qualitate conducts thousands of expert discussions each month, delivering structured intelligence for the world’s leading investment firms and enterprises. Here's how:
English
97
37
265
255.9K