Chase Holmes 🇺🇸

1K posts

Chase Holmes 🇺🇸

@chase1440

frontier inference @etched // prev expeditions @databricks @mosaicml @redpoint @amplitude_hq

San Francisco, CA Katılım Nisan 2013

339 Takip Edilen349 Takipçiler

Sabitlenmiş Tweet

Chase Holmes 🇺🇸@chase1440·28 Nis

💯

tyler hogge@thogge

it's really hard to compete with someone who is having fun.

ART

5.9K

Chase Holmes 🇺🇸@chase1440·26 Nis

He came, he Sawe, he conquered Sub 2 was a dream for decades and is now a club with 2 members + many more to come A new Bannister is born 🫡

BBC Sport@BBCSport

WHAT HAVE WE JUST WITNESSED? 🤯 Sabastian Sawe has just become the first person in history to run a sub two-hour marathon in race conditions. Yomif Kejelcha was also under two hours for second!

English

Chase Holmes 🇺🇸@chase1440·24 Nis

SaaS

The Kobeissi Letter@KobeissiLetter

BREAKING: Alphabet, $GOOGL, says it will invest up to an additional $40 billion in Anthropic. Alphabet will also provide Anthropic with at least 5 GW of computing power. $GOOGL has officially crossed above $4 trillion in market cap.

Eesti

117

Chase Holmes 🇺🇸@chase1440·24 Nis

@NaveenGRao @databricks

QME

498

Naveen Rao@NaveenGRao·23 Nis

Holy crap...@databricks is likely worth more than Salesforce now

English

118

25.4K

Chase Holmes 🇺🇸@chase1440·23 Nis

Intelligence without action is worthless. Progress is a function of inference compute

Noam Brown@polynoamial

A hill that I will die on: with today's AI models, intelligence is a function of inference compute. Comparing models by a single number hasn't made sense since 2024. What matters is intelligence per token or per $. This is especially true when using it in a product like Codex.

English

100

Chase Holmes 🇺🇸@chase1440·23 Nis

🎯

Ben Bajarin@BenBajarin

Compute shortage is unprecedented. We are long way off from satisfying demand, and almost everything you used to know about the semiconuductor industry has changed in fundamental ways.

ART

Chase Holmes 🇺🇸 retweetledi

Linden Li@lindensli·22 Nis

Demand for tokens has skyrocketed this year as long-horizon agents have gotten more useful. The heavier load associated with each request has placed immense pressure on inference systems, producing a new set of systems challenges for inference engines compared to the chatbot/RAG workloads of the past. Most inference load tests right now exercise a very different pattern by submitting single turn requests. Inference engines (proprietary API providers, SGLang, vLLM, TRT-LLM, etc) have made immense progress in serving these very well for interactive chatbot-like use cases, but modern workloads are no longer as latency-sensitive and process far more tokens. We’ve run into challenges when adapting these engines to agent workloads during RL training and production serving. Some observations from optimizing the workloads that we’re open sourcing today: (1) Inference workloads have become more prefill heavy than we’d expected. In our runs, this has come from either initial user prompts containing a large dump of initial context (see the Q/A workload) or a large number of turns feeding in more context (e.g., the office work workload). Improvements in harness engineering should give the model the minimum amount of context necessary to complete a task, but tool call tokens will remain a significant part of the workload. Techniques/parallelism configurations that worked well for training workloads, that tend to have higher token batch sizes, are worth investigating here. (2) KV cache management has become an increasingly large obstacle to GPU utilization. You can observe this in the completion throughput vs. interactivity plots posted in the blog, where higher concurrencies start off positively correlated with completion throughput, but see a sharp dropoff after a point. The system begins to thrash when the load is too severe, and schedulers have no choice but to begin evicting cache tokens and cache locality degrades. (3) Finding the right north star metric matters a lot for measuring system goodput. The analogue in training is MFU vs. HFU, where the former measures how effectively a training stack completes the necessary work to get done (e.g., excluding activation recomputation). Completion tokens per second is the analogue in inference: the rate at which prompt tokens are processed is irrelevant if the system is re-doing work it’s done before. We’re releasing a simple harness that runs three workload shapes that give a flavor of the models we’ve trained for enterprises in production. The JSONL files in the repo should contain a fully replayable set of workloads. We hope this will spur more innovation with inference engine design with these new workloads.

Applied Compute@appliedcompute

Inference demand in 2026 has surged, but not for single-turn workloads that most engines are benchmarked on. Agentic workloads have a different structure: traces consist of many tool-calling turns with heavy-tailed distributions over assistant and tool output. These workloads introduce a new set of challenges for efficient serving. We pulled production traces from over 100 post-training runs and are open sourcing these workloads to help define a new target for inference engine optimization.

English

6.6K

Chase Holmes 🇺🇸@chase1440·13 Nis

@matt_slotnick always were

English

213

Matt Slotnick@matt_slotnick·13 Nis

if all new AI software revenue moves to consumption with a commit and burn down structure, we're going to need to redo SaaS metrics from the ground up

English

4.7K

Chase Holmes 🇺🇸@chase1440·12 Nis

Divergence is in full effect

Chase Holmes 🇺🇸@chase1440

Welcome to the decade of divergence. Weights, wafers, and watts create structural advantages that Rockefeller would envy Incrementalism won’t work - there won’t be enough time to compound. Winning is no longer a matter of will or time, it’s a matter of chokepoints + compute

English

120

Chase Holmes 🇺🇸@chase1440·10 Nis

Long science as a service

Michelle Lee@michellearning

The next industrial revolution isn’t software. It’s science.

English

135

Chase Holmes 🇺🇸 retweetledi

Chase Holmes 🇺🇸@chase1440·6 Tem

@HighyieldHarry

QME

135

12.4K

Chase Holmes 🇺🇸@chase1440·3 Nis

@loganbartlett @PaneerCap 🎯

QME

logan bartlett@loganbartlett·3 Nis

@PaneerCap Most of it is culture.

English

1.4K

logan bartlett@loganbartlett·3 Nis

Nine months ago, I thought AI would resemble mobile for software vendors, where incumbents largely adapted and survived. I now think it looks much more like the internet, where the default outcome is that incumbents lose.

English

225

33.4K

Chase Holmes 🇺🇸@chase1440·3 Nis

@thogge

GIF

QME

1.9K

tyler hogge@thogge·2 Nis

well, i guess we know the ACV now

English

161

31.7K

Chase Holmes 🇺🇸@chase1440·31 Mar

@tbpn @loganbartlett @loganbartlett isn't this the bear case for incumbent SaaS? Eg agents can't save low-agency people/companies

English

759

TBPN@tbpn·31 Mar

Redpoint's @loganbartlett says AI has completely changed hiring—favoring people with unique backgrounds: "Agency might be the only thing that matters." "That's the thing that we are trying to figure out—where do you find pockets of people who still want to do the job talent-wise, or have the capability to do the job, but also have agency?"

English

258

314.5K

Chase Holmes 🇺🇸 retweetledi

Ross@rpoo·9 Mar

with the exception of energy, chips, and labor - we are about to be limited by ambition

English

651

28.4K

Chase Holmes 🇺🇸@chase1440·23 Mar

clear writing is clear thinking if I could only follow 1 account it'd be @TechEmails

Jeff Weinstein@jeff_weinstein

Obsessed with plainspoken business writing. If you think your strategy is too complex to articulate crisply, ask yourself if it's more complex than Anthropic in summer 2023 thinking through various product plans for early 2025.

English

182

Chase Holmes 🇺🇸@chase1440·22 Mar

Starships + space chips

SpaceX@SpaceX

Announcing TERAFAB: the next step towards becoming a galactic civilization twitter.com/i/broadcasts/1…

English

142

Chase Holmes 🇺🇸@chase1440·21 Mar

Can confirm

Daniel Roberts@danroberts0101

What a week at @NVIDIAGTC Three themes: 1. Time-to-compute 2. Scale 3. Execution Not just how much compute, but how fast you can deliver it.