Kyle Kranen

142 posts

Kyle Kranen

Kyle Kranen

@KranenKyle

Engineering Leader for Planetary Scale Inference with NVIDIA Dynamo

San Francisco Katılım Mart 2021
77 Takip Edilen481 Takipçiler
Kyle Kranen retweetledi
Florian Brand
Florian Brand@xeophon·
amazing post and great timing w.r.t. ant's post yesterday we must build open ai to not get locked in by the vendors who will decide who gets which capabilities and the west has to realize that open models are important and support open model efforts (like @arcee_ai, @NVIDIAAI)
Florian Brand tweet media
Bill Gurley@bgurley

A new @bgurley blog post! I have been thinking about how sophisticated executives are using open source in super creative ways. Started writing this three years ago. Excited to finish it up and publish it! And with the new @p3institute brand. substack.com/home/post/p-19…

English
9
21
245
43.1K
dstack
dstack@dstackai·
dstack 0.20.20 is out 🚀 New in services: Prefill-Decode disaggregated inference with @NVIDIA Dynamo, with workers running @lmsysorg SGLang, @vllm_project, or TensorRT-LLM. Also: ⚡️ Easier provisioning and management of @awscloud clusters that use EFA ⚡️ Sharing gateways across projects github.com/dstackai/dstac…
English
3
2
11
2.1K
Kyle Kranen
Kyle Kranen@KranenKyle·
We’re in the early innings of agentic optimizations, both for GPU and CPU! There’s a lot of structure to agentic workloads we can leverage for speedup that we aren’t yet: improving overlapping, parallelization, KV programming, etc, etc.
English
1
1
35
2.5K
Hao Kang
Hao Kang@GT_HaoKang·
We are also working with the NVIDIA Dynamo team to test the stability of agentic infrastructure at 100+ GPU scale. Stay tuned!
English
4
2
17
1K
Kyle Kranen
Kyle Kranen@KranenKyle·
@GT_HaoKang Let’s get the Blackwell racks printing tokens 😀 Really excited to explore agentic optimizations with you!
English
1
0
3
64
Hao Kang
Hao Kang@GT_HaoKang·
ThunderAgent has contributed a coding-agentic RL training recipe to SkyRL, achieving a 3.01× rollout speedup with no accuracy loss!🚄 Using this stack, we successfully trained a 32B coding model on 5 H100 nodes! ThunderAgent is an efficient agentic serving runtime and accepted by ICML2026 as Spotlight paper and have been used in TogetherAI and other industry products. More from our team is coming this August. Agents are reshaping the LLM infrastructure stack. code: github.com/ThunderAgent-o… pr: github.com/ergt10/SkyRL/t… paper: arxiv.org/pdf/2602.13692 @NovaSkyAI @istoica05 @charlie_ruan @togethercompute @NVIDIAAI @DachengLi177
Hao Kang tweet mediaHao Kang tweet media
English
5
8
61
7K
Kyle Kranen
Kyle Kranen@KranenKyle·
Interestingly, long context attention actually enables more opportunities to stream weights into HBM, decreasing the memory requirements for weights stored in HBM at any given time, which pairs well with managing the larger KV cache. Check out our work on this:
English
1
1
7
698
0xSero
0xSero@0xSero·
My next big thing.
0xSero tweet media
English
7
3
95
6.2K
Kyle Kranen
Kyle Kranen@KranenKyle·
@0xSero This is awesome! Let us know if you have any Dynamo questions or suggestions!
English
0
0
1
76
Kyle Kranen
Kyle Kranen@KranenKyle·
@richardczl We have some neat CUDA checkpoint work for faster starts in NVIDIA Dynamo!
English
1
0
2
200
Richard Chen
Richard Chen@richardczl·
two things stand out in this data SGLang cold start without snapshots (83s) is already ahead at the baseline. Snapshots bring things down dramatically. CUDA context checkpointing is the right place to attack this problem. Cold start latency is one of those costs that quietly kills production economics at scale @modal @lmsysorg
Modal@modal

New replicas of @vllm_project and @sgl_project servers start up 3-10x faster on Modal. Read the article to learn how -- from GPU health management to CUDA context checkpointing.

English
4
4
83
10.6K
himanshu
himanshu@himanshustwts·
the harness of claude code is very interesting. a random unstable header at the start of the prompt was breaking KV-cache reuse on a 52k-token context. NVIDIA stripped it out and TTFT dropped by 5x.
himanshu tweet media
NVIDIA AI@NVIDIAAI

Most agentic stacks run into the same problems pretty quickly: reasoning and tool parsing drift across turns, KV cache reuse falls apart, or tools fire too late. We’ve been hardening Dynamo’s harness-facing path so @Claudeai Code, @OpenClaw, and @openai Codex-style agent patterns behave reliably on custom stacks and inference endpoints: • Stable prompts for KV reuse and lower TTFT • Interleaved reasoning + tool calls preserved across turns • Streaming tool dispatch instead of end-of-turn buffering • Harness behavior aligned with real multi-turn agent runtimes If you’re building your own agent stack or serving endpoint, this blog goes through the infrastructure issues that tend to show up in practice and the patterns we’ve been using to fix them. Tech blog ➡️nvda.ws/4dj5KzF

English
15
28
376
56.9K
Kyle Kranen retweetledi
ishan
ishan@0xishand·
One of the hardest parts of LLM inference is staying compliant with the various API specs and keeping up with the nuances of every new reasoning/tool-call parser. We now ship 3 standalone crates (dynamo protocols, tokenizers, and parsers) to ensure that everyone can now build on and contribute to this battle-hardened foundation. Links in next tweet:
NVIDIA AI@NVIDIAAI

Most agentic stacks run into the same problems pretty quickly: reasoning and tool parsing drift across turns, KV cache reuse falls apart, or tools fire too late. We’ve been hardening Dynamo’s harness-facing path so @Claudeai Code, @OpenClaw, and @openai Codex-style agent patterns behave reliably on custom stacks and inference endpoints: • Stable prompts for KV reuse and lower TTFT • Interleaved reasoning + tool calls preserved across turns • Streaming tool dispatch instead of end-of-turn buffering • Harness behavior aligned with real multi-turn agent runtimes If you’re building your own agent stack or serving endpoint, this blog goes through the infrastructure issues that tend to show up in practice and the patterns we’ve been using to fix them. Tech blog ➡️nvda.ws/4dj5KzF

English
1
2
19
2.3K
Kyle Kranen
Kyle Kranen@KranenKyle·
Making sure that streaming harness behavior matches reference API behavior is essential for ensuring your inference supports harnesses. Check out how we’ve done it for Dynamo! developer.nvidia.com/blog/streaming…
English
0
2
4
195
Lillian Ma
Lillian Ma@lillian_ma_·
Big day 🚀 As early adopters of TRT-LLM, Dynamo, and NIM, we’re at @nvidia’s Inference Codesign Day meeting the team IRL. As an inference infra provider, partnering with NVIDIA to bring world-class inference to our customers is exactly where we want to be. Heard TRT-LLM is expanding coverage for visual generative models 👀 the road ahead is going to be wild. @gmi_cloud @bj0hn5on @ReneeYao1 @NVIDIAAI @NVIDIAAIDev
Lillian Ma tweet mediaLillian Ma tweet mediaLillian Ma tweet mediaLillian Ma tweet media
English
1
0
6
356
Ying Sheng
Ying Sheng@ying11231·
Congrats @radixark ! From SGLang @lmsysorg to Miles, and to future products, RadixArk is dedicated to building a crucible capable of repeatedly producing cutting-edge AI, bringing the best of AI into every household. We believe in a future of AI diversity and hope to drive the integration of AI into every aspect of production and daily life. In the future we envision, AI will become a partner to many companies and individuals, finding ways to self-evolve—in production, in daily companionship, and within virtual worlds. Everything we have experienced and will continue to experience in the SGLang and Miles open-source communities is unforgettable and highly anticipated. It has been both demanding and exhilarating, allowing us to see friendship, the world, and the boundaries. Over the past six months, I have witnessed for the first time how a united team moves forward hand in hand, and how deeply passionate they are about creation. Each of us has taken on our respective roles and numerous new tasks for the first time; we are all stepping out of our comfort zones, growing, and creating at a rapid pace. "It’s the step-by-step journey of a thousand miles that has carried us here today, and the same relentless march that will lead us into the tens of thousands of miles yet to come." In an era where AI has made ordinary productivity cheaper, relentless, day-to-day refinement has increasingly become the rare key that drives innovation and the future. We hope this will forever remain the soul of RadixArk's culture: focused, uncompromising, humble, and fearless. The underlying logic of creation is not the deliberate pursuit of novelty, but rather independent thinking that remains unswayed by temptation, paired with a meticulous drive for perfection.
RadixArk@radixark

Today, we are thrilled to officially launch RadixArk with $100M in Seed funding at a $400M valuation. The round was led by @Accel and co-led by @sparkcapital. RadixArk exists to make frontier AI infrastructure open and accessible to everyone. Today, the systems behind the most capable AI models are concentrated in a small number of companies. As a result, most AI teams are forced to rebuild training and inference stacks from scratch, duplicating the same infrastructure work instead of focusing on new models, products, and ideas. RadixArk was founded to change that. We are building an AI platform that makes it easier for teams to train and serve the best models at scale. RadixArk comes from the open-source community. We started with SGLang, where many of us are core developers and maintainers, and expanded our work to Miles for large-scale RL and post-training. We will continue contributing to both projects and working with the community to make them the strongest open-source infrastructure foundations for frontier AI. We would like to thank our long-term partners, contributors, and the broader SGLang community for believing in this mission. We're also grateful to @Accel and @sparkcapital, NVentures (Venture capital arm of @nvidia), Salience Capital, A&E Investment, @HOFCapital, @walden_catalyst, @AMD, LDVP, WTT Fubon Family, @MediaTek, Vocal Ventures, @Sky9Capital and our angel investors @ibab, @LipBuTan1, Hock Tan, @johnschulman2, @soumithchintala, @lilianweng, @oliveur, @Thom_Wolf, @LiamFedus, @robertnishihara, @ericzelikman, @OfficialLoganK, and @multiply_matrix among others. Thanks for the exclusive interview with @MeghanBobrowsky at @WSJ about our vision.

English
21
27
209
18K
The TWIML AI Podcast
The TWIML AI Podcast@twimlai·
In this episode, @philipkiely, head of AI education at @baseten, joins us to unpack the fast-evolving discipline of inference engineering. We explore why inference has become the stickiest and most critical workload in AI, how it blends GPU programming, applied research, and large-scale distributed systems, and where the line sits between inference and model serving. Philip shares how research-to-production can move in hours, not months, and why understanding “the knobs” of inference—batching, quantization, speculation, and KV cache reuse—lets teams design better products and SLAs. We trace the inference maturity journey from closed APIs to dedicated deployments and in-house platforms, discuss GPU lifecycles, and survey today’s runtime landscape, including vLLM, SGLang, and TensorRT LLM. Finally, we look ahead to agents and multimodality, making the case for specialized, workload-specific runtimes when performance and efficiency matter most. 🗒️ For the full list of resources for this episode, visit the show notes page: twimlai.com/go/766. 📖 CHAPTERS =============================== 00:00 - Introduction 03:40 - Why inference is the most important AI workload? 06:21 - Inference vs model serving 07:18 - Inference challenges 09:57 - Pace of inference research to production timeline 13:41 - Reasons to care about inference engineering 15:49 - Considerations in build vs buy decisions 22:08 - Product maturity cycle 27:14 - GPU lifecycles in inference maturity 32:14 - LLM-assisted inference 36:46 - Agents and multimodal models in specialized inference optimization 47:21 - Open source runtimes: vLLM, SGLang, and TensorRT LLM 49:50 - Specialized AI hardware 51:24 - Future trends and predictions 52:36 - Where to find the inference engineering book
English
4
3
11
1.6K
Kyle Kranen
Kyle Kranen@KranenKyle·
Some awesome work by the SGLang and NVIDIA teams to drive GB200 performance forwards!
SemiAnalysis@SemiAnalysis_

GB300 NVL72 Rack Scale Dynamo SGLang disaggregation has up to 6.5x better performance than B200 on DeepSeekv4 Pro 1.6T 🚀   The high throughput configuration uses @deepseek_ai 's MegaMoe kernels  which fully fuses & overlaps EP dispatch & EP combine & the GEMMs into an single kernel. This performance is achieved from the 10x engineers @BanghuaZ, Tom & the rest of the team at @radixark, @lmsysorg & @NVIDIAAI for rapidly enabling this performance! Big Shoutout to @CoreWeave to contributing temporary GB300 NVL72 racks towards the open source performance optimization for all to benefit!

English
0
1
18
1.9K