


Signal65
281 posts

@Signal_65
We are here to ensure our partners become the signal of innovation in the noise of the technology markets.




As we get into @NVIDIAGTC week, one topic I expect to get some attention is on the storage front. GPU utilization in AI inference is a storage problem as much as a compute problem. @Signal_65 worked with @HPE and @KamiwazaAI to test KV-Cache offloading to the HPE Alletra Storage MP X10000, and the results were significant. Full report: signal65.com/research/maxim… Key findings from our testing: ➡️ Output token generation rates increased up to 19.4x compared to systems with no KV-Cache ➡️ Time to first token improved up to 21.5x vs. no KV-Cache ➡️ Even vs. memory-only offload, adding the X10000 delivered a 5.9x token rate increase and 5.6x TTFT reduction ➡️ RDMA for S3 storage delivered up to 2x the throughput of traditional S3 over HTTP with 80% lower latency and dramatically reduced CPU overhead That last point matters. The table below shows why GPU Direct Storage via RDMA changes the equation: ~20 GB/s+ throughput, 5.1x latency reduction, and consistent P99 performance where traditional S3 showed high jitter.
































🚀 Raising the bar for AI inference. The NVIDIA Blackwell architecture delivers world-class token throughput across every scale: 📈 New NVIDIA TensorRT-LLM upgrades boost MoE performance, delivering 2x+ token throughput on NVIDIA GB200 NVL72. 📈 NVIDIA HGX B200 delivers massive gains with multi-token prediction and NVFP4 precision. Learn how #NVIDIABlackwell continues to push token throughput higher across AI models ➡️ nvda.ws/45BVWxx


