InferX

309 posts

InferX

@InferXai

Serverless GPU inference. Sub-Second cold starts — even for 32B(64GB)+ models. Scale to zero. No idle billing. https://t.co/GbD8l2JPy0

San Francisco Beigetreten Mart 2025

42 Folgt137 Follower

Angehefteter Tweet

InferX@InferXai·18 Mar

Sub-second Cold Starts for 32B Models | Live Demo & Technical Discussion youtube.com/live/oI_eg5x1I…

YouTube

English

170

InferX@InferXai·12h

Talking to your infra is not a feature. Your model should already be ready. That’s not a chat interface problem. That’s an inference problem. inferx.net/?v=1

English

InferX@InferXai·14h

instead of talking to your infra… what if it’s already ready? InferX brings models online before the first prompt.😉

Runpod@runpod

We put a chat interface in the Runpod console. 23 tools across the full REST API. If you can do it in the dashboard, you can ask for it in chat. Find it in the console ;)

English

InferX@InferXai·2d

What is InferX? youtu.be/RNfbOdpmOjM?si… via @YouTube

YouTube

English

InferX@InferXai·2d

The biggest bottleneck in AI inference isn’t compute. It’s memory. Specifically, the KV cache. Google’s TurboQuant compresses KV cache ~6x (down to ~3 bits) with no accuracy loss. That’s a big unlock for inference economics. But compression still needs the right runtime to matter. At InferX, we already get sub second cold starts via state snapshotting. Now combine that with 6x smaller state: ⚡ Faster resumes 📈 Higher density per GPU 💰 Better economics Models will keep improving. We’re building the runtime to take advantage of all of it.

Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English

122

InferX@InferXai·18 Mar

@TonyD993 Yes, Absolutely! We are specially built for custom and finetuned models. You can join our slack community slack here. inferxcommunity.slack.com

English

Tony Davis@TonyD993·18 Mar

@InferXai Yes highly interested. Do you support custom models?

English

InferX@InferXai·17 Mar

A couple of weeks ago we demonstrated 1.5s cold starts for a 32B model. Today we’ve pushed it even further. We’re now seeing sub-second cold starts for models of this size. Why does this matter? Cold start latency is one of the biggest barriers to true serverless inference. If models take 40 seconds or minutes to start, developers are forced to keep GPUs running 24/7. Fast cold starts change the developer experience completely. Models can finally run on demand instead of sitting idle. We’ll be talking about how this works during our live technical webinar on Wednesday, March 18 at 8:30 AM PST. Link in the comments.

English

198

InferX@InferXai·17 Mar

@TonyD993 We are in Private beta. Happy to give you access if you want to try it out.

English

Tony Davis@TonyD993·17 Mar

@InferXai Not general access yet?

English

InferX@InferXai·17 Mar

@TonyD993 Yes, we expose an API. It’s compatible with the OpenAI-style interface, so you can plug it into existing workflows pretty easily.

English

Tony Davis@TonyD993·17 Mar

@InferXai You guys have an api yet?

English

InferX@InferXai·17 Mar

Live Webinar: linkedin.com/events/7438475…

English

InferX@InferXai·13 Mar

Getting a 32B model to cold start in ~1.5s is not trivial. You’re not just loading weights. You’re restoring tens of GB of GPU memory, CUDA kernels, and runtime state that normally gets rebuilt from scratch. Making this work required going deeper in the stack: capturing CUDA graphs, intercepting CUDA calls, and restoring GPU execution state. Here’s a quick clip.

English

InferX@InferXai·10 Mar

The interesting part is that agent loops produce many intermediate checkpoints. Keeping GPUs warm for all of them is economically impossible. Infrastructure needs to handle bursty inference instead of steady workloads.

English

InferX@InferXai·8 Mar

Autoresearch loops = dozens of fine-tuned checkpoints that need instant inference. That's exactly what InferX is built for. Deploy your model. Get an endpoint. Pay only when it runs.

Andrej Karpathy@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English

InferX@InferXai·6 Mar

32B model cold start in ~1.5s on InferX. Runtime snapshotting lets models resume without keeping them pinned in VRAM.

Prashanth (Manohar) Velidandi@PMV_InferX

32B model cold start in ~1.5s. Instead of reloading weights and reinitializing CUDA every time, we capture the entire GPU runtime state after initialization — weights, CUDA context, memory layout. The runtime simply restores the snapshot, so the model resumes almost instantly. No need to keep models sitting in VRAM 24/7 just to avoid cold starts. Config: Qwen 32B · H100 · FP16

English

InferX@InferXai·4 Mar

Join the community: inferxcommunity.slack.com

English

InferX@InferXai·4 Mar

InferX self serve UI is now live in private beta. Early users can deploy custom and fine tuned models and test inference workloads with $30 in compute credits. We’re keeping access limited while we gather feedback and refine the platform. If you’re building with custom models and want to try it, reach out. ⚡

English

InferX@InferXai·3 Mar

Small models are getting surprisingly capable. We’re seeing strong demand for fine-tuned 4B–9B variants in agent workflows. As models shrink, infrastructure efficiency matters even more.

Qwen@Alibaba_Qwen

🚀 Introducing the Qwen 3.5 Small Model Series Qwen3.5-0.8B · Qwen3.5-2B · Qwen3.5-4B · Qwen3.5-9B ✨ More intelligence, less compute. These small models are built on the same Qwen3.5 foundation — native multimodal, improved architecture, scaled RL: • 0.8B / 2B → tiny, fast, great for edge device • 4B → a surprisingly strong multimodal base for lightweight agents • 9B → compact, but already closing the gap with much larger models And yes — we’re also releasing the Base models as well. We hope this better supports research, experimentation, and real-world industrial innovation. Hugging Face: huggingface.co/collections/Qw… ModelScope: modelscope.cn/collections/Qw…

English

InferX retweetet

Prashanth (Manohar) Velidandi@PMV_InferX·28 Şub

Model APIs were step one. Stateful execution environments are step two. Agents need a runtime, not just a prompt.

English

Entdecken

@YouTube @TonyD993 @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA