Brian Costello

12.6K posts

Brian Costello banner
Brian Costello

Brian Costello

@bpcostello

Trying to do the right thing. Launching new AI start-up.

Los Angeles, CA เข้าร่วม Aralık 2008
2.8K กำลังติดตาม22.6K ผู้ติดตาม
Brian Costello
Brian Costello@bpcostello·
@GoogleResearch The real requirement is not “preserve every number precisely,” but “preserve the geometry well enough that attention behaves the same.” We’ve been spending a lot of memory and bandwidth on precision the model doesn’t truly need.
English
0
0
0
179
Google Research
Google Research@GoogleResearch·
Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI
GIF
English
777
4.5K
31.3K
14.1M
Brian Costello
Brian Costello@bpcostello·
@DylanMitic Yes. The real requirement may not be “preserve every number precisely,” but “preserve the geometry well enough that attention behaves the same.” If that holds, then we’ve been spending a lot of memory and bandwidth on precision the model doesn’t truly need.
English
0
0
0
11
Brian Costello
Brian Costello@bpcostello·
Agree wholeheartedly: hardware is the key. The biggest misconception in AI is that the model is just software. It isn’t. At runtime, what matters is state (weights, activations, and KV cache) that must be stored, updated, and reused in memory. The future of AI performance and cost will depend as much on how hardware manages that state as on the model itself. On Earth and in space. In robotics and in LLMs. The future belongs to true co-design between AI models and the actual hardware it runs on.
English
0
2
18
2.8K
TBPN
TBPN@tbpn·
Sequoia’s @shaunmmaguire wrote a private hardware manifesto arguing that over the next 25 years, most of the money will be made in hardware: "Every software revolution is preceded by a hardware revolution." "To have the iOS App Store that enabled Uber, DoorDash, and all of these great companies - you needed to have the iPhone." "This AI revolution - we're seeing what it can do from the software layer, but it's still limited by hardware." "The hardware we were doing for a long time was all following Moore's Law. It was all branching out of this decision in the mid-1950s to go all in on the silicon supply chain." "That has created magic, and there's still a couple orders of magnitude of juice to squeeze, but we’re hitting fundamental physics limits - Dennard scaling, things like that." "I think this tech tree is branching into humanoid robots, into silicon photonics, into orbital data centers - all of these new hardware areas where there's going to be 20+ years of progress." "There's going to be incredible businesses built on the back of this. And a lot of dumpster fires."
English
38
76
875
188.4K
Brian Costello
Brian Costello@bpcostello·
Tao is right about the core tension. We understand how these systems run, but we still do not understand what makes their intelligence dependable. We can train them, scale them, and watch impressive capabilities appear, yet we still cannot predict why performance shifts so sharply across tasks. To me, that suggests the missing piece is not just better learning math. It is a better way to decide what matters. The real problem is not only whether the model can compute, but whether it can reliably preserve the important signal, carry it forward through time, and ignore what does not matter. Memory is more important then compute. Until that is solved, intelligence will keep emerging in a powerful but inconsistent way. These systems are very good at pattern generation, but much less reliable at importance control. They work when they compress the right signal. They fail when that signal gets diluted, distracted, or lost.
English
0
0
1
52
Prof. Brian Keating
Prof. Brian Keating@DrBrianKeating·
Terence Tao told me something that is both clarifying and unsettling about large language models. The mathematics underlying today’s LLMs is not especially exotic. At its core, training and inference mostly involve linear algebra, matrix multiplication, and some calculus. This is material a competent undergraduate could learn. In that sense, there is very little mystery about how these systems are constructed or how they run. And yet the real mystery begins there. What we do not understand well is why these models perform so impressively on certain tasks while failing unexpectedly on others. Even more striking, we lack reliable principles that allow us to predict this behavior in advance. Progress in the field remains largely empirical. Researchers scale models, change datasets, run experiments, and observe what emerges. Part of the difficulty lies in the nature of the data itself. Pure randomness is mathematically tractable. Perfectly structured systems are also tractable. But natural language, like most real-world phenomena, lives in an intermediate regime. And we humans hate that liminal space! It is neither noise nor order but a mixture of both. The mathematics for this middle ground remains comparatively underdeveloped. So we find ourselves in a peculiar position. We understand the machinery, yet we cannot reliably explain its capabilities. We can describe the mechanisms that produce these systems, but we cannot predict when new abilities will appear or how performance will vary across tasks. That tension, between relatively simple mathematical tools and highly unpredictable behavior, is the central puzzle of modern AI. (Video link in comments)
English
42
79
402
40.8K
Brian Costello
Brian Costello@bpcostello·
@a16z Great strategic insight by @ssankar on how to bring back real American innovation.
English
0
0
0
246
a16z
a16z@a16z·
"We became very good at financial engineering and forgot about engineering." Palantir CTO Shyam Sankar on how tech companies lose their edge: "Europe has created exactly zero companies from scratch in the last 50 years worth more than a hundred billion euro. We have created all of our trillion dollar companies from scratch in America in the last 50 years." "The difference is founders." "Intel, at some point, there was this fork in the road, where they could have promoted the CFO to be the CEO or Pat Gelsinger as CTO." "They picked the CFO. The person that Wall Street would understand, not the person who could actually determine the future roadmap." "It really looked like it was working for 10 years until it fell off a cliff." "But that was all financial engineering, not real engineering." @PalantirTech CTO @ssankar with @KTmBoyle
a16z@a16z

"I think our biggest risk as a country is suicide, not homicide." Palantir CTO Shyam Sankar joins a16z's Katherine Boyle and Erik Torenberg to discuss Shyam's new book, Mobilize, as well as defense, AI, the SaaSpocalypse, and more. 00:00 Introduction 07:53 Rebuilding the industrial base 18:01 Modernizing the Army 24:20 The SaaSpocalypse 29:42 Agency over automation 38:24 Beating China without self-sabotage 40:42 Film as cultural willpower 49:57 The story of Admiral Rickover @ssankar @KTmBoyle @eriktorenberg @PalantirTech

English
49
166
1.4K
187.6K
Brian Costello
Brian Costello@bpcostello·
@SenSanders No, he is doing it to bring manufacturing back to the US. That is a good thing.
English
0
0
4
200
Sen. Bernie Sanders
Sen. Bernie Sanders@SenSanders·
Jeff Bezos, one of the richest men on earth, is raising $100 billion to replace workers with robots around the world. The oligarchs want it all. Not going to happen. Stand up and FIGHT BACK.
English
1.2K
3K
12.3K
295.6K
Brian Costello
Brian Costello@bpcostello·
@alex_prompter Not sure about the 60 percent number, but there is substantial waste in current transformer execution, and much of it lives in the mechanics of execution rather than in “intelligence” itself.
English
0
0
1
118
Alex Prompter
Alex Prompter@alex_prompter·
🚨 BREAKING: NVIDIA sold the most powerful AI chip ever built. Then Princeton discovered the software running on it was wasting 60% of it. Every inference job. Every training run. 60 cents on every dollar, gone. > NVIDIA doubled the raw compute power of their Blackwell B200 GPUs compared to Hopper H100. Tensor core throughput went from 1 PFLOPS to 2.25 PFLOPS. The most powerful AI chip ever built. > The problem: the rest of the chip didn't scale with it. Memory bandwidth stayed the same. The exponential unit stayed the same. So the bottleneck moved and all that extra compute sat idle while the slower parts of the chip became the new ceiling. > Every existing attention implementation, including FlashAttention-3, was designed for Hopper. On Blackwell they either left massive performance on the table or couldn't run at all. > Princeton, Meta, and Together AI spent months redesigning attention from scratch around the new bottleneck. New pipelines. Software emulated exponential functions. A completely different backward pass. The result: FlashAttention 4. → Up to 2.7× faster than Triton on B200 GPUs → Up to 1.3× faster than NVIDIA's own cuDNN library → Reaches 1,613 TFLOPs/s 71% of theoretical maximum → Compile time dropped from 55 seconds to 2.5 seconds (22× faster) → Written entirely in Python no C++ template expertise required The scariest part: this wasn't a hardware problem. The chip was delivering exactly what NVIDIA promised. The software just wasn't designed for it. Every AI lab running B200s before this paper was paying for compute they couldn't use.
Alex Prompter tweet media
English
42
63
400
37.1K
Brian Costello
Brian Costello@bpcostello·
@chamath This could easily be fixed by legislators by implementing zoning in CA. Only require big cities and not rural areas to use certain quality gas. Push down price.
English
0
0
2
207
Chamath Palihapitiya
Chamath Palihapitiya@chamath·
Newsom drove one refinery to shut down last year and one that will shut down in another month. Our gas prices were already sky high because of these actual and forecasted closures - it forced us to import gas and have little bargaining power in doing it.
James Blair@JamesBlairUSA

Californians already pay 50% more for gas than the rest of the country, and, thanks to Gavin shutting down the state’s refineries, they are estimated to pay another $.50 a gallon on top. Add math and basic economics to the list of subjects Gavin struggles with.

English
127
375
2.9K
183.9K
Brian Costello
Brian Costello@bpcostello·
Always worthwhile checking out @TheDragonFeeder thoughts on our important global competition (economic war) w/China. Look forward to taking it in over the weekend. @APompliano
Anthony Pompliano 🌪@APompliano

The US and China are locked in a global competition. I sat down with @TheDragonFeeder to discuss war in Iran, central banks buying gold, how bitcoin affects the geopolitical relationship, where humanoids fit in, and how China is manipulating the media. Enjoy!

English
0
1
8
742
Brian Costello รีทวีตแล้ว
Brian Costello
Brian Costello@bpcostello·
Prefill requires massive parallel computation (Rubin CPX). Decode is not a compute problem. It's a "how fast can I read from memory" problem. What Groq's SRAM chip is uses for. Dynamo ends up being the orchestration software that moves the KV cache (model's working memory) from the prefill chip to the decode chip and manages the handoff. Nvidia is splitting inference into two specialized jobs because one chip can't do both well. But what if the big leap was we need a better memory system and over 70% of what we're moving is not even needed in the first place?
English
0
1
4
480
Chamath Palihapitiya
Chamath Palihapitiya@chamath·
The next phase of AI silicon is all about cheap, abundant decode. Groq was just the appetizer…This paper is a very good guide.
Chris Laub@ChrisLaubAI

🚨 BREAKING: A Google researcher and a Turing Award winner just published a paper that exposes the real crisis in AI. It's not training. It's inference. And the hardware we're using was never designed for it. The paper is by Xiaoyu Ma and David Patterson. Accepted by IEEE Computer, 2026. No hype. No product launch. Just a cold breakdown of why serving LLMs is fundamentally broken at the hardware level. The core argument is brutal: → GPU FLOPS grew 80X from 2012 to 2022 → Memory bandwidth grew only 17X in that same period → HBM costs per GB are going UP, not down → The Decode phase is memory-bound, not compute-bound → We're building inference on chips designed for training Here's the wildest part: OpenAI lost roughly $5B on $3.7B in revenue. The bottleneck isn't model quality. It's the cost of serving every single token to every single user. Inference is bleeding these companies dry. And five trends are making it worse simultaneously: → MoE models like DeepSeek-V3 with 256 experts exploding memory → Reasoning models generating massive thought chains before answering → Multimodal inputs (image, audio, video) dwarfing text → Long-context windows straining KV caches → RAG pipelines injecting more context per request Their four proposed hardware shifts: → High Bandwidth Flash: 512GB stacks at HBM-level bandwidth, 10X more memory per node → Processing-Near-Memory: logic dies placed next to memory, not on the same chip → 3D Memory-Logic Stacking: vertical connections delivering 2-3X lower power than HBM → Low-Latency Interconnect: fewer hops, in-network compute, SRAM packet buffers Companies that tried SRAM-only chips like Cerebras and Groq already failed and had to add DRAM back. This paper doesn't sell a product. It maps the entire hardware bottleneck and says: the industry is solving the wrong problem. Paper dropped January 2026. Link in the first comment 👇

English
58
76
851
275.8K
Brian Costello
Brian Costello@bpcostello·
Great point by @NaveenGRao that we started in reverse w/brute force. We're only now learning how much compute/math we don't actually need. But the brain (biology) analogy gets overplayed. A bird and a plane both fly, yet they are very different systems with very different energy profiles. Biology can inspire AI without defining its endpoint. A brain runs on tiny power, but it also cannot train on trillions of tokens or replicate itself perfectly. Machine intelligence will get dramatically more efficient not by becoming biological, but by eliminating unnecessary computation for the kind of system it actually is.
English
0
1
2
298
Brian Costello
Brian Costello@bpcostello·
The most important part was not the AGI capability claim. It was the architectural admission in Section 10.2: today’s systems are “stateless,” may need long-term memory built into the architecture, even “a vector which represents the context” alongside tokens, a “slow-thinking” mechanism for planning and verification, and perhaps to go beyond “single-word prediction.” That’s a diagnosis of needs in the substrate. A before-its-time admission that scale alone will not close the gap; some of the missing pieces are architectural.
English
0
0
1
421
a16z
a16z@a16z·
Unconventional AI CEO Naveen Rao on the incredible energy efficiencies of biological systems relative to technology: "Biology sort of started small, figured out some basic principles, and those principles scaled. So the efficiency came first." "What we've done is actually the inverse. We've brute forced our way through it, throwing everything we possibly could, and now we're understanding, 'Oh actually I didn't need to do all of that, there's a lot of things I can keep chipping away at, and I can go smaller, and smaller, and smaller.'" "Just to kind of put it in perspective — biology, through this process of kind of the bottoms-up — a squirrel runs on 10 milliwatts of energy. Your cell phone runs on about one watt." "10 milliwatts, and it can do things at precision levels that we cannot do in a megawatt. I can't make a robot jump between branches in the wind and hit the branch perfectly a thousand times out of a thousand right now. I can't do it." @NaveenGRao @unconvAI
English
14
22
117
18.2K
Brian Roemmele
Brian Roemmele@BrianRoemmele·
NEW NVIDIA JOB LISTING. It should tell you all you need to know about where datacenter jobs will go… IN SPACE. “What you will be doing: •Drive architecture for orbital datacenter systems considering everything from the chip out to the satellite and connectivity between satellites” Link: nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAEx…
Brian Roemmele tweet media
English
14
20
122
21.6K