InferX

309 posts

InferX banner
InferX

InferX

@InferXai

Serverless GPU inference. Sub-Second cold starts — even for 32B(64GB)+ models. Scale to zero. No idle billing. https://t.co/GbD8l2JPy0

San Francisco Beigetreten Mart 2025
42 Folgt137 Follower
InferX
InferX@InferXai·
Talking to your infra is not a feature. Your model should already be ready. That’s not a chat interface problem. That’s an inference problem. inferx.net/?v=1
English
0
0
0
18
InferX
InferX@InferXai·
The biggest bottleneck in AI inference isn’t compute. It’s memory. Specifically, the KV cache. Google’s TurboQuant compresses KV cache ~6x (down to ~3 bits) with no accuracy loss. That’s a big unlock for inference economics. But compression still needs the right runtime to matter. At InferX, we already get sub second cold starts via state snapshotting. Now combine that with 6x smaller state: ⚡ Faster resumes 📈 Higher density per GPU 💰 Better economics Models will keep improving. We’re building the runtime to take advantage of all of it.
Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English
0
1
3
122
Tony Davis
Tony Davis@TonyD993·
@InferXai Yes highly interested. Do you support custom models?
English
1
0
0
9
InferX
InferX@InferXai·
A couple of weeks ago we demonstrated 1.5s cold starts for a 32B model. Today we’ve pushed it even further. We’re now seeing sub-second cold starts for models of this size. Why does this matter? Cold start latency is one of the biggest barriers to true serverless inference. If models take 40 seconds or minutes to start, developers are forced to keep GPUs running 24/7. Fast cold starts change the developer experience completely. Models can finally run on demand instead of sitting idle. We’ll be talking about how this works during our live technical webinar on Wednesday, March 18 at 8:30 AM PST. Link in the comments.
English
3
0
3
198
InferX
InferX@InferXai·
@TonyD993 We are in Private beta. Happy to give you access if you want to try it out.
English
1
0
0
14
InferX
InferX@InferXai·
@TonyD993 Yes, we expose an API. It’s compatible with the OpenAI-style interface, so you can plug it into existing workflows pretty easily.
English
1
0
0
23
InferX
InferX@InferXai·
Getting a 32B model to cold start in ~1.5s is not trivial. You’re not just loading weights. You’re restoring tens of GB of GPU memory, CUDA kernels, and runtime state that normally gets rebuilt from scratch. Making this work required going deeper in the stack: capturing CUDA graphs, intercepting CUDA calls, and restoring GPU execution state. Here’s a quick clip.
English
0
0
2
77
InferX
InferX@InferXai·
The interesting part is that agent loops produce many intermediate checkpoints. Keeping GPUs warm for all of them is economically impossible. Infrastructure needs to handle bursty inference instead of steady workloads.
English
0
0
0
11
InferX
InferX@InferXai·
InferX self serve UI is now live in private beta. Early users can deploy custom and fine tuned models and test inference workloads with $30 in compute credits. We’re keeping access limited while we gather feedback and refine the platform. If you’re building with custom models and want to try it, reach out. ⚡
InferX tweet media
English
2
0
2
54
InferX retweetet
Prashanth (Manohar) Velidandi
Prashanth (Manohar) Velidandi@PMV_InferX·
Model APIs were step one. Stateful execution environments are step two. Agents need a runtime, not just a prompt.
English
0
1
0
65