Prashanth (Manohar) Velidandi

1.6K posts

Prashanth (Manohar) Velidandi banner
Prashanth (Manohar) Velidandi

Prashanth (Manohar) Velidandi

@PMV_InferX

AI Infrastructure Researcher | Co-founder at @InferXai , a Multi-Tenant Serverless Platform for Scalable Inference

San Francisco, CA Katılım Kasım 2022
129 Takip Edilen240 Takipçiler
Sabitlenmiş Tweet
Prashanth (Manohar) Velidandi
Prashanth (Manohar) Velidandi@PMV_InferX·
‘Civilized inference’ is about respect. Respect for compute. Respect for shared GPUs. Respect for developers’ time. Respect for cost transparency. Most infrastructure normalizes waste. Idle VRAM. Hidden billing. Manual warm pools. ‘Civilized inference’ refuses to. @InferXai inferx.net
English
0
0
3
285
Raja Koduri
Raja Koduri@RajaXg·
@Midnight_Captl Where did I say the prices will fall? I'm making a point that the current supply constrained scenario will create new approaches , as the demand for more memory capacity and bandwidth far exceeds the supply!
English
2
0
15
696
Prashanth (Manohar) Velidandi
@RajaXg we’ve been building for this constraint from day one. optimizing utilization should I be the priority for mini for a long time but they literally abused it.
English
0
0
0
44
Raja Koduri
Raja Koduri@RajaXg·
I warned my memory friends a few months ago..there are tons of optimizations available across the whole stack to reduce memory capacity and bandwidth...as long as memory was relatively "cheap" , we stay lazy...constraints unleash creativity..I hear the memory supply chain constraints won't be solved till 2030..prepare for deluge of creativity..it hasn't been a week since Turbo quant... not only in software, but you will some insanely cool hardware improvisations and new suppliers emerge to to the top as well
ComfyUI@ComfyUI

Upgrading your RAM is now unnecessary. Introducing our new ComfyUI Dynamic VRAM optimization. Running local models is now possible on even the most memory constrained hardware. Read more here: blog.comfy.org/p/dynamic-vram…

English
21
30
287
94.9K
Clint | Options
Clint | Options@clintoptions·
I have a secret to share After your first $2–$3 million, a paid off home and a good car, there is no difference in quality of life between you and Jeff Bezos. Both of you have limited amount of time on earth; you have twice if not more than Jeff, so you are richer than him. A cheeseburger is a cheeseburger whether a billionaire eats or you do. Money is nothing but a piece of paper or a number in your app. Real life is outdoors. Become financially independent; that’s usually 2–3mil. Have good food. Enjoy the relations. Workout. Sleep well. Call your parents. That’s all there is to life. Greed has no end. Repeat after me: Time is the currency of life. Money is not. Sooner you figure this out, happier you will be.
English
908
2.8K
21.4K
2.9M
Prashanth (Manohar) Velidandi
3,500,000 models on Hugging Face. Less than 0.1% have an inference endpoint. Not a model problem. A serving problem. Dedicating a GPU to every model is uneconomical. Most models never run because the math doesn’t work. inferx.net
Prashanth (Manohar) Velidandi tweet media
English
0
0
0
22
Prince Canuma
Prince Canuma@Prince_Canuma·
TurboQuant ≠ model compression. It quantizes the KV cache (the memory that grows with context length), not the model itself. No training, no fine-tuning, zero accuracy loss at 3 bits. But if the model doesn’t fit your VRAM? TurboQuant won’t change that. It solves the inference bottleneck, not the loading problem.
Prince Canuma@Prince_Canuma

Just implemented Google’s TurboQuant in MLX and the results are wild! Needle-in-a-haystack using Qwen3.5-35B-A3B across 8.5K, 32.7K, and 64.2K context lengths: → 6/6 exact match at every quant level → TurboQuant 2.5-bit: 4.9x smaller KV cache → TurboQuant 3.5-bit: 3.8x smaller KV cache The best part: Zero accuracy loss compared to full KV cache.

English
30
35
469
41.2K
Prashanth (Manohar) Velidandi
“Training happens sporadically. Inference happens on every prompt. Inference systems run the brains for every agent.” — Jonathan Bryce, Linux Foundation (KubeCon EU 2026) Hard to Disagree
Prashanth (Manohar) Velidandi tweet media
English
1
0
2
52
Prashanth (Manohar) Velidandi
@pierrelezan @TheAhmadOsman True. exactly the tradeoff. vLLM gives you better steady state throughput but pushes more into init time. been experimenting with snapshotting fully initialized GPU state including CUDA context and KV cache so you restore instead of reinit.
English
0
0
1
39
Pierre L
Pierre L@pierrelezan·
@PMV_InferX @TheAhmadOsman I know and often anything production ready will use vLLM wich has a longer intialisation phase because of the optimisations that other runtime like llama.cpp doesn't have
English
1
0
0
31
Ahmad
Ahmad@TheAhmadOsman·
When running LLMs locally, the bottleneck isn’t just “VRAM size” It’s: - memory bandwidth - interconnect (PCIe vs NVLink vs RDMA) - inference engine (vLLM, TensorRT-LLM, SGLang) Unified Memory is way slower than VRAM btw
English
28
12
375
24.8K
Prashanth (Manohar) Velidandi
@kamathematic Firecracker made serverless practical for stateless workloads. LLMs are different. Once you’re dealing with large models and GPU state, the bottleneck isn’t just isolation, it’s how you restore state without paying the full cold start every time.
English
0
0
1
289
anirudh
anirudh@kamathematic·
"AI infra" would be nothing if AWS never open sourced Firecracker
English
21
9
468
62.6K
Prashanth (Manohar) Velidandi
True, but LLMs amplify it quite a bit. In most serverless systems you’re bringing up relatively lightweight state. But with LLMs, you’re dealing with tens of GBs plus GPU runtime initialization, KV cache, etc. that combination makes cold starts much more expensive than typical serverless workloads.
English
1
0
1
27
Prashanth (Manohar) Velidandi
That’s part of it, but even if storage and network are fast, you still pay the cost of reinitializing the runtime. loading weights is just one piece. CUDA context, kernel setup, KV cache all add latency on every cold start. that ends up being a big chunk of the delay in serverless.
English
1
0
0
26
Pierre L
Pierre L@pierrelezan·
@PMV_InferX @TheAhmadOsman It's not a llm issue, it's a serverless issue. You are bottelnecked by throughput of your storage/network
English
1
0
0
40
Prashanth (Manohar) Velidandi
Inference 🚀
TBPN@tbpn

"Inference, if you look at it as a market, will be much, much bigger than cloud computing was pre-ChatGPT." Lightspeed’s @buckymoore says inference is an underrated investment category in AI, and expects the market to break up into large, specialized platforms for each modality: "The GPU supply crunch that we're seeing right now is largely, as @dylan522p has said on the show before, due to the fact that not only these consumer products, but also B2B products like Claude Code and Codex are just really taking off and creating insane demand for inference." "We're talking hundreds of billions in spend every year. And if that's true, I think there will be very, very large inference platforms built in each modality." "So there will be an inference platform for real-time video models, there will be an inference platform for open-source and custom language models, there will be an inference platform built specifically for long-running agents." "So I think we're just going to see that industry, which today looks like one industry, break up into many because of how big it is and how much room for specialization there is."

English
0
0
4
56
Prashanth (Manohar) Velidandi
The problem isn’t just scaling capacity, it’s what happens inside the GPU. Autoscaling still forces you to reload models, reinitialize CUDA, rebuild state. That’s where the real delay comes from. If you can restore that state instead of rebuilding it, spikes stop being a problem. You don’t need to over-provision, and you don’t take the latency hit.
English
0
0
1
60
Ivan Burazin
Ivan Burazin@ivanburazin·
Every infra company is dealing with spiky loads now. Massive unpredictable spikes followed by sharp drops because agents create traffic patterns humans never did. Can't smooth them out with autoscaling. You either over-provision (expensive) or accept that the consumer will have delays (unacceptable).
English
15
2
48
4.8K
Prashanth (Manohar) Velidandi
the real problem is hardly 1% of hugging face models get served because it’s not economical to serve them. millions of models on demand changes that. once we get there, this becomes a complete ecosystem we’re building the infrastructure to serve millions of models on demand in a serverless way @InferXai
English
0
0
0
73
clem 🤗
clem 🤗@ClementDelangue·
Talked with @dee_bosa @CNBC about @nvidia and everything open-source AI! Some key points: - Nvidia is the new American open-source AI king - 30% of fortune 500 are using Hugging Face and our goal is to get to the majority of them by the end of the year - Agents will be much more open-source based than chatbots (ex OpenClaw) - Agents empower all to train, fine-tune, and run their own models based on open-source - We crossed 15M AI builders on HF and hope to have as much agents using the platform by the end of the year. Agents are the new users and customers of tech platforms
English
14
18
119
30.4K
Prashanth (Manohar) Velidandi
A lot of this comes down to access vs usability. We now have millions of models, but actually running them is still constrained by GPU cost, cold starts, and having to keep things warm. real potential is making inference available on demand across all models, not just a few that stay resident. And that’s where infrastructure becomes the bottleneck. We’ve been working on this layer, enabling models to be restored and run when needed instead of kept alive. this is what will make open-source AI truly accessible at scale.
clem 🤗@ClementDelangue

Talked with @dee_bosa @CNBC about @nvidia and everything open-source AI! Some key points: - Nvidia is the new American open-source AI king - 30% of fortune 500 are using Hugging Face and our goal is to get to the majority of them by the end of the year - Agents will be much more open-source based than chatbots (ex OpenClaw) - Agents empower all to train, fine-tune, and run their own models based on open-source - We crossed 15M AI builders on HF and hope to have as much agents using the platform by the end of the year. Agents are the new users and customers of tech platforms

English
0
0
1
67