Prashanth (Manohar) Velidandi

1.6K posts

Prashanth (Manohar) Velidandi

@PMV_InferX

AI Infrastructure Researcher | Co-founder at @InferXai , a Multi-Tenant Serverless Platform for Scalable Inference

San Francisco, CA Katılım Kasım 2022

129 Takip Edilen240 Takipçiler

Sabitlenmiş Tweet

Prashanth (Manohar) Velidandi@PMV_InferX·21 Şub

‘Civilized inference’ is about respect. Respect for compute. Respect for shared GPUs. Respect for developers’ time. Respect for cost transparency. Most infrastructure normalizes waste. Idle VRAM. Hidden billing. Manual warm pools. ‘Civilized inference’ refuses to. @InferXai inferx.net

English

285

Prashanth (Manohar) Velidandi@PMV_InferX·1h

@RajaXg @Midnight_Captl this is exactly right. once you remove model loading from the critical path, everything changes. we’ve been able to run ~200 models per node on demand because of that.

English

Raja Koduri@RajaXg·5h

@Midnight_Captl Where did I say the prices will fall? I'm making a point that the current supply constrained scenario will create new approaches , as the demand for more memory capacity and bandwidth far exceeds the supply!

English

696

Midnight Capital LLC@Midnight_Captl·9h

The quoted tweet is complete nonsense. Memory needs will keep exploding. A lot of ill informed people thinking DRAM / HBM prices will fall... Sorry but keep dreaming Take a look at the NVDA gen / gen diagram below - which component looks like it will grow the most? MU / SK / Samsung pricing power isn't going anywhere

Raja Koduri@RajaXg

I warned my memory friends a few months ago..there are tons of optimizations available across the whole stack to reduce memory capacity and bandwidth...as long as memory was relatively "cheap" , we stay lazy...constraints unleash creativity..I hear the memory supply chain constraints won't be solved till 2030..prepare for deluge of creativity..it hasn't been a week since Turbo quant... not only in software, but you will some insanely cool hardware improvisations and new suppliers emerge to to the top as well

English

104

11.9K

Prashanth (Manohar) Velidandi@PMV_InferX·1h

@RajaXg we’ve been building for this constraint from day one. optimizing utilization should I be the priority for mini for a long time but they literally abused it.

English

Raja Koduri@RajaXg·10h

ComfyUI@ComfyUI

Upgrading your RAM is now unnecessary. Introducing our new ComfyUI Dynamic VRAM optimization. Running local models is now possible on even the most memory constrained hardware. Read more here: blog.comfy.org/p/dynamic-vram…

English

287

94.9K

Prashanth (Manohar) Velidandi@PMV_InferX·11h

@clintoptions I adopted this when I was 19.

English

Clint | Options@clintoptions·1d

I have a secret to share After your first $2–$3 million, a paid off home and a good car, there is no difference in quality of life between you and Jeff Bezos. Both of you have limited amount of time on earth; you have twice if not more than Jeff, so you are richer than him. A cheeseburger is a cheeseburger whether a billionaire eats or you do. Money is nothing but a piece of paper or a number in your app. Real life is outdoors. Become financially independent; that’s usually 2–3mil. Have good food. Enjoy the relations. Workout. Sleep well. Call your parents. That’s all there is to life. Greed has no end. Repeat after me: Time is the currency of life. Money is not. Sooner you figure this out, happier you will be.

English

908

2.8K

21.4K

2.9M

Prashanth (Manohar) Velidandi@PMV_InferX·14h

3,500,000 models on Hugging Face. Less than 0.1% have an inference endpoint. Not a model problem. A serving problem. Dedicating a GPU to every model is uneconomical. Most models never run because the math doesn’t work. inferx.net

Prashanth (Manohar) Velidandi tweet media

English

Prashanth (Manohar) Velidandi@PMV_InferX·2d

@Prince_Canuma Exactly. TurboQuant tackles KV cache. Model loading is a different bottleneck. We’ve been pushing that down to sub-second.

English

186

Prince Canuma@Prince_Canuma·2d

TurboQuant ≠ model compression. It quantizes the KV cache (the memory that grows with context length), not the model itself. No training, no fine-tuning, zero accuracy loss at 3 bits. But if the model doesn’t fit your VRAM? TurboQuant won’t change that. It solves the inference bottleneck, not the loading problem.

Prince Canuma@Prince_Canuma

Just implemented Google’s TurboQuant in MLX and the results are wild! Needle-in-a-haystack using Qwen3.5-35B-A3B across 8.5K, 32.7K, and 64.2K context lengths: → 6/6 exact match at every quant level → TurboQuant 2.5-bit: 4.9x smaller KV cache → TurboQuant 3.5-bit: 3.8x smaller KV cache The best part: Zero accuracy loss compared to full KV cache.

English

469

41.2K

Prashanth (Manohar) Velidandi retweetledi

InferX@InferXai·2d

What is InferX? youtu.be/RNfbOdpmOjM?si… via @YouTube

YouTube

English

Prashanth (Manohar) Velidandi@PMV_InferX·2d

@jukan05 Paper ≠ production impact.

English

6.7K

Jukan@jukan05·2d

Bro, that shit you guys are hyping dropped in April last year. Why are you acting like it’s new now?

Jukan@jukan05

The fact that memory stocks are crashing because of Google’s Turboquant is a pretty good indicator of how many clueless people this market is filled with. It’s like saying Aramco should crash because Toyota came out with a next-generation hybrid engine.

English

117

164

2.7K

405.6K

Prashanth (Manohar) Velidandi@PMV_InferX·2d

This is where inference is going.

Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English

Prashanth (Manohar) Velidandi@PMV_InferX·3d

ZXX

Prashanth (Manohar) Velidandi@PMV_InferX·3d

“Training happens sporadically. Inference happens on every prompt. Inference systems run the brains for every agent.” — Jonathan Bryce, Linux Foundation (KubeCon EU 2026) Hard to Disagree

English

Prashanth (Manohar) Velidandi@PMV_InferX·4d

@pierrelezan @TheAhmadOsman True. exactly the tradeoff. vLLM gives you better steady state throughput but pushes more into init time. been experimenting with snapshotting fully initialized GPU state including CUDA context and KV cache so you restore instead of reinit.

English

Pierre L@pierrelezan·5d

@PMV_InferX @TheAhmadOsman I know and often anything production ready will use vLLM wich has a longer intialisation phase because of the optimisations that other runtime like llama.cpp doesn't have

English

Ahmad@TheAhmadOsman·6d

When running LLMs locally, the bottleneck isn’t just “VRAM size” It’s: - memory bandwidth - interconnect (PCIe vs NVLink vs RDMA) - inference engine (vLLM, TensorRT-LLM, SGLang) Unified Memory is way slower than VRAM btw

English

375

24.8K

Prashanth (Manohar) Velidandi@PMV_InferX·5d

@kamathematic Firecracker made serverless practical for stateless workloads. LLMs are different. Once you’re dealing with large models and GPU state, the bottleneck isn’t just isolation, it’s how you restore state without paying the full cold start every time.

English

289

anirudh@kamathematic·6d

"AI infra" would be nothing if AWS never open sourced Firecracker

English

468

62.6K

Prashanth (Manohar) Velidandi@PMV_InferX·5d

True, but LLMs amplify it quite a bit. In most serverless systems you’re bringing up relatively lightweight state. But with LLMs, you’re dealing with tens of GBs plus GPU runtime initialization, KV cache, etc. that combination makes cold starts much more expensive than typical serverless workloads.

English

Pierre L@pierrelezan·5d

@PMV_InferX @TheAhmadOsman Yes, but that also apply to other kind of serverless, especially if you can scale to 0

English

Prashanth (Manohar) Velidandi@PMV_InferX·5d

That’s part of it, but even if storage and network are fast, you still pay the cost of reinitializing the runtime. loading weights is just one piece. CUDA context, kernel setup, KV cache all add latency on every cold start. that ends up being a big chunk of the delay in serverless.

English

Pierre L@pierrelezan·5d

@PMV_InferX @TheAhmadOsman It's not a llm issue, it's a serverless issue. You are bottelnecked by throughput of your storage/network

English

Prashanth (Manohar) Velidandi@PMV_InferX·21 Mar

Inference 🚀

TBPN@tbpn

"Inference, if you look at it as a market, will be much, much bigger than cloud computing was pre-ChatGPT." Lightspeed’s @buckymoore says inference is an underrated investment category in AI, and expects the market to break up into large, specialized platforms for each modality: "The GPU supply crunch that we're seeing right now is largely, as @dylan522p has said on the show before, due to the fact that not only these consumer products, but also B2B products like Claude Code and Codex are just really taking off and creating insane demand for inference." "We're talking hundreds of billions in spend every year. And if that's true, I think there will be very, very large inference platforms built in each modality." "So there will be an inference platform for real-time video models, there will be an inference platform for open-source and custom language models, there will be an inference platform built specifically for long-running agents." "So I think we're just going to see that industry, which today looks like one industry, break up into many because of how big it is and how much room for specialization there is."

English

Prashanth (Manohar) Velidandi@PMV_InferX·21 Mar

The problem isn’t just scaling capacity, it’s what happens inside the GPU. Autoscaling still forces you to reload models, reinitialize CUDA, rebuild state. That’s where the real delay comes from. If you can restore that state instead of rebuilding it, spikes stop being a problem. You don’t need to over-provision, and you don’t take the latency hit.

English

Ivan Burazin@ivanburazin·20 Mar

Every infra company is dealing with spiky loads now. Massive unpredictable spikes followed by sharp drops because agents create traffic patterns humans never did. Can't smooth them out with autoscaling. You either over-provision (expensive) or accept that the consumer will have delays (unacceptable).

English

4.8K

Prashanth (Manohar) Velidandi@PMV_InferX·20 Mar

the real problem is hardly 1% of hugging face models get served because it’s not economical to serve them. millions of models on demand changes that. once we get there, this becomes a complete ecosystem we’re building the infrastructure to serve millions of models on demand in a serverless way @InferXai

English

clem 🤗@ClementDelangue·20 Mar

Talked with @dee_bosa @CNBC about @nvidia and everything open-source AI! Some key points: - Nvidia is the new American open-source AI king - 30% of fortune 500 are using Hugging Face and our goal is to get to the majority of them by the end of the year - Agents will be much more open-source based than chatbots (ex OpenClaw) - Agents empower all to train, fine-tune, and run their own models based on open-source - We crossed 15M AI builders on HF and hope to have as much agents using the platform by the end of the year. Agents are the new users and customers of tech platforms

English

119

30.4K

Prashanth (Manohar) Velidandi@PMV_InferX·20 Mar

A lot of this comes down to access vs usability. We now have millions of models, but actually running them is still constrained by GPU cost, cold starts, and having to keep things warm. real potential is making inference available on demand across all models, not just a few that stay resident. And that’s where infrastructure becomes the bottleneck. We’ve been working on this layer, enabling models to be restored and run when needed instead of kept alive. this is what will make open-source AI truly accessible at scale.

clem 🤗@ClementDelangue

English

Keşfet

@RajaXg @Midnight_Captl @clintoptions @Prince_Canuma @YouTube @jukan05 @pierrelezan @TheAhmadOsman