Chase Holmes 🇺🇸

1K posts

Chase Holmes 🇺🇸 banner
Chase Holmes 🇺🇸

Chase Holmes 🇺🇸

@chase1440

frontier inference @etched // prev expeditions @databricks @mosaicml @redpoint @amplitude_hq

San Francisco, CA Katılım Nisan 2013
339 Takip Edilen349 Takipçiler
Naveen Rao
Naveen Rao@NaveenGRao·
Holy crap...@databricks is likely worth more than Salesforce now
Naveen Rao tweet media
English
10
2
118
25.4K
Chase Holmes 🇺🇸 retweetledi
Linden Li
Linden Li@lindensli·
Demand for tokens has skyrocketed this year as long-horizon agents have gotten more useful. The heavier load associated with each request has placed immense pressure on inference systems, producing a new set of systems challenges for inference engines compared to the chatbot/RAG workloads of the past. Most inference load tests right now exercise a very different pattern by submitting single turn requests. Inference engines (proprietary API providers, SGLang, vLLM, TRT-LLM, etc) have made immense progress in serving these very well for interactive chatbot-like use cases, but modern workloads are no longer as latency-sensitive and process far more tokens. We’ve run into challenges when adapting these engines to agent workloads during RL training and production serving. Some observations from optimizing the workloads that we’re open sourcing today: (1) Inference workloads have become more prefill heavy than we’d expected. In our runs, this has come from either initial user prompts containing a large dump of initial context (see the Q/A workload) or a large number of turns feeding in more context (e.g., the office work workload). Improvements in harness engineering should give the model the minimum amount of context necessary to complete a task, but tool call tokens will remain a significant part of the workload. Techniques/parallelism configurations that worked well for training workloads, that tend to have higher token batch sizes, are worth investigating here. (2) KV cache management has become an increasingly large obstacle to GPU utilization. You can observe this in the completion throughput vs. interactivity plots posted in the blog, where higher concurrencies start off positively correlated with completion throughput, but see a sharp dropoff after a point. The system begins to thrash when the load is too severe, and schedulers have no choice but to begin evicting cache tokens and cache locality degrades. (3) Finding the right north star metric matters a lot for measuring system goodput. The analogue in training is MFU vs. HFU, where the former measures how effectively a training stack completes the necessary work to get done (e.g., excluding activation recomputation). Completion tokens per second is the analogue in inference: the rate at which prompt tokens are processed is irrelevant if the system is re-doing work it’s done before. We’re releasing a simple harness that runs three workload shapes that give a flavor of the models we’ve trained for enterprises in production. The JSONL files in the repo should contain a fully replayable set of workloads. We hope this will spur more innovation with inference engine design with these new workloads.
Applied Compute@appliedcompute

Inference demand in 2026 has surged, but not for single-turn workloads that most engines are benchmarked on. Agentic workloads have a different structure: traces consist of many tool-calling turns with heavy-tailed distributions over assistant and tool output. These workloads introduce a new set of challenges for efficient serving. We pulled production traces from over 100 post-training runs and are open sourcing these workloads to help define a new target for inference engine optimization.

English
2
1
45
6.6K
Matt Slotnick
Matt Slotnick@matt_slotnick·
if all new AI software revenue moves to consumption with a commit and burn down structure, we're going to need to redo SaaS metrics from the ground up
English
5
1
35
4.7K
logan bartlett
logan bartlett@loganbartlett·
Nine months ago, I thought AI would resemble mobile for software vendors, where incumbents largely adapted and survived. I now think it looks much more like the internet, where the default outcome is that incumbents lose.
English
28
12
225
33.4K
tyler hogge
tyler hogge@thogge·
well, i guess we know the ACV now
tyler hogge tweet media
English
7
0
161
31.7K
TBPN
TBPN@tbpn·
Redpoint's @loganbartlett says AI has completely changed hiring—favoring people with unique backgrounds: "Agency might be the only thing that matters." "That's the thing that we are trying to figure out—where do you find pockets of people who still want to do the job talent-wise, or have the capability to do the job, but also have agency?"
English
15
23
258
314.5K
Chase Holmes 🇺🇸 retweetledi
Ross
Ross@rpoo·
with the exception of energy, chips, and labor - we are about to be limited by ambition
English
52
44
651
28.4K