Ashish Tuli

242 posts

Ashish Tuli banner
Ashish Tuli

Ashish Tuli

@ashishtuli

Chips. AI. F1. https://t.co/6dAKvFPVYz All views my own.

Menlo Park, CA Katılım Aralık 2008
175 Takip Edilen279 Takipçiler
Ashish Tuli
Ashish Tuli@ashishtuli·
TokenSpeed reporting up to 580 tok/s on Qwen3.5-397B-A17B is a useful signal, but not just because the number is high. The interesting part is how the speed was achieved: less copying, more kernel fusion, and better overlap between the CPU and GPU so the accelerator spends less time waiting. Agentic workloads change the inference problem. Multi-turn context, tool histories, and long state make latency a systems question, not just a model or accelerator question.
PyTorch@PyTorch

The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a record-breaking 580 tokens per second (tps) for agentic workloads on NVIDIA GPUs. In the PyTorch Foundation's latest community blog post, you can learn all about the complete design, implementation, and optimization of Qwen3.5 models in the TokenSpeed inference framework and see for yourself how this work is improving performance 👉 bit.ly/4uGUvIS This achievement was a joint effort between the @Alibaba_Qwen inference team, @lightseekorg Foundation TokenSpeed team, @NVIDIAAI , and the Mooncake team, with special contributions from @tri_dao for FlashAttention-4 (FA4) optimization. @KVCache_AI

English
0
0
0
66
Ashish Tuli
Ashish Tuli@ashishtuli·
OpenAI's Tax AI case study is easy to read as a tax automation story. I think the more important signal is the operating loop. Every expert correction becomes a production trace. Recurring traces become evals. Evals become bounded engineering tasks. Those tasks ship only after regression checks. That is a very different pattern from “AI gives an answer and a human checks it.” It is a way to turn expert judgment into compounding product improvement. For vertical AI, the first model answer matters. The learning loop around the model may matter more.
OpenAI Developers@OpenAIDevs

⚙️ Behind the build of self-improving tax agents with Codex We co-built Tax AI with @ThriveHoldings around tax prep workflows so when reviewers fix any errors, Codex can trace the failure, improve the system, and test the change before it ships. openai.com/index/building…

English
0
0
0
89
Ashish Tuli
Ashish Tuli@ashishtuli·
The HBM bottleneck is no longer just bandwidth or supply. It is the heat path. SK hynix’s iHBM puts integrated cooling elements inside the HBM package, directly in the D2D PHY region where HBM connects to the accelerator, and says the design reduces thermal resistance by 30%. That matters because HBM scaling is moving in two directions at once: taller stacks and faster interfaces. Both increase power density in places traditional system-level cooling does not reach cleanly. The interesting shift is where the fix now lives: not around the rack, but inside the memory package.
English
0
0
1
74
Ashish Tuli
Ashish Tuli@ashishtuli·
The goal in AI infrastructure is tokens per dollar. One of the constraints now determining it is tokens per watt. As racks move from tens of kilowatts toward hundreds, 54V power delivery stops being a background choice. Lower voltage means much higher current, and current is what drives copper mass, heat, voltage drop and wasted space. Moving from 54V to 800V cuts current by roughly 15x. Because resistive losses scale with current squared, the power-chain math changes fast. That is why 800VDC is not just an electrical upgrade. It is a way to move the power-delivery ceiling higher so dense AI racks can keep scaling. Really good @SemiAnalysis_ article on this: x.com/SemiAnalysis_/…
English
1
0
1
106
Ashish Tuli retweetledi
Vaibhav (VB) Srivastav
UPDATE: Came up with an even better version of this prompt after the feedback Ask Codex to look across your sessions, Memories, and Chronicle, identify patterns, reuse what already exists, and only create the smallest useful skill, subagent, or automation. "Look back over my recent work from the last 30 days, or all available history if shorter, and identify repeated manual workflows worth packaging. Use available evidence in this order: - Recent Codex sessions and task summaries. - Codex Memories and rollout summaries to find patterns repeated across sessions. - Chronicle, if enabled, to spot repeated work outside Codex. Use Chronicle for discovery only; confirm important details in the relevant source system when possible. - Existing skills, custom agents, and automations, so you reuse or extend what already exists instead of duplicating it. Look broadly for work that is repeated, time-consuming, error-prone, context-heavy, or benefits from a consistent process. Include workflows across coding, research, writing, planning, communication, operations, analysis, and personal administration. Only act on a candidate when it: - occurred at least twice, or is clearly likely to recur and costly to repeat; - has stable inputs, a repeatable procedure, and a clear output or stopping condition; - would materially improve speed, quality, consistency, or reliability; - is not already adequately covered. Choose the smallest appropriate form: - Skill: a reusable workflow or playbook. - Custom subagent: a bounded specialist role or investigation task suitable for delegation. - Automation: a scheduled or recurring check, report, reminder, or monitor. - Skip: work that is too one-off, ambiguous, sensitive, or poorly evidenced to package. First produce a compact shortlist with: - repeated workflow - supporting evidence and dates - frequency/confidence - recommended form: skill, subagent, automation, extend existing, or skip - why it is or is not worth creating Then create only the high-confidence missing items. Keep them narrow, practical, source-aware, and easy to validate. Do not create speculative, overlapping, or overly broad assets. Finish with: - what you created or extended - what you deliberately skipped - what needs more evidence before packaging"
Vaibhav (VB) Srivastav tweet media
Vaibhav (VB) Srivastav@reach_vb

Copy and paste this into your codex: “Look through my recent Codex sessions and identify repeated workflows or repeated asks. For anything I keep doing manually, suggest: 1. a skill if it is a reusable workflow 2. a custom subagent if it is a bounded role or investigation task Focus on practical things like CI failures, PR reviews, changelogs, docs updates, release prep, debugging, and test triage. Create the useful ones only. Keep them simple.”

English
95
366
3.6K
850.8K
Ashish Tuli
Ashish Tuli@ashishtuli·
@cytrusf1 We need a downvote button for a post, not just comments.
English
0
0
0
28
Cytrus 🍋
Cytrus 🍋@cytrusf1·
He can't deal with adversity. When things don't go his way he lashes out with unnecessary anger and borderline violence.
English
388
343
8.7K
671.2K
Ashish Tuli
Ashish Tuli@ashishtuli·
Coding agents are exposing a critical infrastructure sizing problem. @SemiAnalysis_ measured 432k real coding-agent requests. Median input was 96k tokens, roughly 3x what much of today's AI infrastructure was sized for. Nearly half already exceed 128k. The important part is what those tokens are. Not users writing longer prompts. The agent harness is assembling a working set: system prompts, tool definitions, MCP schemas, skills, prior turns, repo structure, file contents, retrieval, and tool state. That changes the workload. 3x the context is not a 3x cost problem. In long-context inference, prefill becomes the bottleneck, and standard dense attention compute scales quadratically with sequence length. That is the accelerator-side problem. The CPU-side problem is everything around the model call: tokenization, request scheduling, cache management, tool execution, API calls, permission checks, and validation loops. A chat request is mostly prompt in, tokens out. A coding-agent request is a distributed systems workflow wrapped around inference. That is why the CPU:GPU ratio debate matters. Not because GPUs matter less. Because agents increase the CPU, memory, and orchestration required per GPU. The real issue is not that agents are "a little more expensive" than chat. It is that infrastructure built for chat is structurally undersized for agents.
SemiAnalysis@SemiAnalysis_

Agentic workloads are quietly rewriting inference economics. We pulled data from 432k real coding agent requests at SemiAnalysis and the median one isn't 32k, isn't 64k, but 96k input tokens. For context, that's more than the entire text of The Great Gatsby being shoved into the model before you've even typed your question. (1/3)🧵

English
2
0
2
170
Ashish Tuli
Ashish Tuli@ashishtuli·
The interesting question in EMIB vs. CoWoS is not which packaging technology is “better.” It is what each architecture makes easier or harder as AI accelerator packages get larger. Once a single die is no longer enough, packaging stops being back-end assembly and becomes system architecture. That is why the comparison is not just about interconnect density. It is about how multi-die systems scale around reticle limits, warpage, yield, utilization, and cost. Must read primer from @austinsemis: chipstrat.com/p/advanced-pac…
English
1
15
102
44.9K
Ashish Tuli
Ashish Tuli@ashishtuli·
AI credits are becoming a go-to-market strategy. OpenAI’s YC offer made the pattern explicit: tokens for equity. Hyperagent’s $10M Founding 500 program is the lighter version: credits to seed the next wave of agent-first builders. This is not just “startup support.” It is distribution. Cloud credits helped shape where startups built their infrastructure. Token credits may do the same for AI platforms: get founders building, testing, debugging, and scaling inside your stack before defaults harden. In AI, usage is the new land grab.
English
0
0
2
127
Ashish Tuli
Ashish Tuli@ashishtuli·
“Fabless” used to mean asset-light. Now it increasingly means prepaying, co-investing, and locking up years of wafer and advanced packaging capacity just to secure supply. You may not own the fab, but you are still paying to make sure it exists for you. The label is still “fabless.” But it is not truly asset-light anymore.
AMD@AMD

Today, we announced more than $10B in investment across Taiwan’s ecosystem to scale advanced packaging and accelerate next-gen AI infrastructure, from 6th Gen EPYC CPUs codenamed “Venice” to our Helios rack-scale platform including Instinct MI450X GPUs, with multi-gigawatt deployments beginning in 2H 2026. Additionally, AMD and TSMC have hit another major production milestone, with Venice EPYC CPUs ramping on TSMC 2nm technology in Taiwan with future plans to ramp production at TSMC’s Arizona Fab. More on the news: bit.ly/4tJrUkR

English
12
17
133
34.9K
Ashish Tuli
Ashish Tuli@ashishtuli·
@kinsonwu08 Exactly, and this exposure isn’t just an AMD problem, it’s the reality for every major fabless company.
English
0
0
1
235
techQuicker
techQuicker@kinsonwu08·
@ashishtuli Correct. If AMD do not establish its own robust supply chain now, AMD will be dependent on others, especially by overly relying on Taiwan's industrial chain, which could become a fatal problem in the coming years.
English
1
0
1
328
Alex
Alex@Alex_Intel_·
@ashishtuli 💯 CPU TAM growing so large, all of Intel's excess clean room space looks like a great decision now too
English
1
1
14
586