


Harshit
411 posts

@harshitspark
Building AI Agents | Crossfit Athlete | Ex - Founder@56secure | https://t.co/gnLoO902QC | @Ola












💾🚀 Run Llama-3.1-405B FP8 (410GB) on a single 180GB GPU #NVIDIA Introducing FlexTensor — NVIDIA's new library that makes host RAM a transparent extension of your GPU memory. One call: flextensor.offload(model). No model rewrites, no framework changes. Works with vLLM, HuggingFace, and any PyTorch model. Traditional offloading is reactive — move data when you run out of memory, stall the GPU while you wait. FlexTensor instead profiles your model's layer access patterns, then solves a knapsack optimization to schedule prefetches that overlap with compute. By the time a layer needs its weights, they're already there. The freed VRAM gives vLLM more room for KV cache — enabling 4x longer contexts (8K→32K) or 4x larger batches. For video generation (Wan2.2-T2V-A14B on GB200): +0.1% overhead. Handles FP8, custom Triton kernels, and multi-GPU. Profiles saved to disk — no warmup on repeated runs. Check it out: github.com/ai-dynamo/flex…

The new model from Meta is already looking like a disappointment: overoptimized for public benchmark numbers at the detriment of everything else. Knowing how to evaluate models in a way that correlates with actual usefulness is a core competency for AI labs, and any new lab is unlikely to be successful without first figuring that out.

Stanford @CS153Systems '26, Session 3 (Full lecture) The Future of Voice Systems with @matiii from @ElevenLabs 00:00 Welcome and Intro 01:31 Origin Story on Discord 05:15 The Dubbing Problem 07:44 Pipeline and Early Pivot 12:38 Building the First Model 15:24 Compute Costs and Patents 17:34 Roadmap Through 2025 22:00 Cascaded vs Fused Agents 30:38 Collaboration Over Competition 35:05 Revenue Growth and Team Design 37:56 Predictable Deployment Engine 42:32 Voice Safety and Watermarking 44:27 Research Bottlenecks Personalization 46:24 Training Tradeoffs Cascade vs Fuse 48:20 Five Year Vision Platform 51:08 Impact Work ALS and Ukraine 54:40 China Distillation and Openness 59:24 Studios AI Voice Economics 01:03:04 On Device Models and Platform Gap 01:04:36 Enterprise Tooling and Wrap Up







We're aware people are hitting usage limits in Claude Code way faster than expected. Actively investigating, will share more when we have an update!