Kyle Kranen

225 posts

Kyle Kranen

@KranenKyle

Engineering Leader for Planetary Scale Inference with NVIDIA Dynamo

San Francisco 가입일 Mart 2021

95 팔로잉621 팔로워

Kyle Kranen@KranenKyle·2d

Baseten are cooking! Baseten have been working with us on Dynamo since 0.1, and have been nothing but incredible partners. Really excited to see the impact that Dynamo brought to this SOTA endpoint (2x TPS! For free!)

Philip Kiely@philipkiely

x.com/i/article/2069…

English

5.4K

Kyle Kranen 리트윗함

Philip Kiely@philipkiely·2d

x.com/i/article/2069…

ZXX

132

1.4K

510.4K

Kyle Kranen@KranenKyle·2d

@luke_clancy1 This is super cool :)

English

217

luke clancy@luke_clancy1·3d

come host your company's dinners in my SF home. we've hosted tons of dinners / mixers / other shenanigans here. now we want to give you access. it's an epic venue. - 4600 sq ft - natural light - haight ashbury - upstairs + downstairs - tons of nooks to chat in can easily host a 50 person mixer; 25 person sit-down dinner. prob best for upscale mixer w/ food. we know a great chef + servers. can beat (almost) every nice venue on price. DM me or @aidanmurphy if interested upstairs: pic 1,2 downstairs: pic 3,4

English

137

29.4K

Kyle Kranen@KranenKyle·2d

@SuJinYan123 Can you ping me the PR? I’ll make sure we take a look :)

English

216

susun@SuJinYan123·3d

推理是真的复杂度密集工作，好多事情没干。这周顺利的话有个推测解码的blog吧。后面不想干小模型了，准备上moe了，moe也头大啊。p/d要不要写，dynamo还不看我pr，要不去改vllm router得了。kernel也调不动，metrics没做。头大头大

中文

1.9K

Kyle Kranen@KranenKyle·3d

@MeryemArik9 Can’t wait for a Fergus blog post on this :) SGLang cold start one was quite good!

English

Meryem Arik@MeryemArik9·3d

This is 100% true - our day 0 support is just about getting the model working & live - (We still price positive margin at this point while being most market competitive). And then as you say we optimize the deployment more over the next few days / weeks (more popular models get more optimization) - either we bank the extra margin or decrease prices further.

English

217

Kyle Kranen@KranenKyle·3d

1/ I see a lot of analysis of GLM 5.2 vs closed source models based on day 0 API pricing. Almost every day 0 model release I’ve been a part of has had *significant* room to improve purely with improvements in software (>30x in some cases).

English

2.9K

Kyle Kranen@KranenKyle·3d

@lunacleon As there with any new technology! As the value of a new technology is proven over time, people become more comfortable with it. GMOs are still unpopular with many, but are credited with saving hundreds of millions (if not billions) of lives!

English

lunacleon@lunacleon·3d

@KranenKyle ah but then there’s regulatory hurdles + liability + cultural malaise to overcome

English

Kyle Kranen@KranenKyle·4d

In Machines of Loving Grace, Dario argues there are sets of problems where diffusion of AI capability will be slow due to being Amdahl’s bottlenecked by the real world. Meds, HW, construction, all fit this class of problem. Bullish on simulation that removes that bottleneck.

English

1.3K

Kyle Kranen@KranenKyle·3d

@alphatozeta8148 You stole my next tweet 😡

English

Dhruv Singal@alphatozeta8148·3d

@KranenKyle ++ and there is so much room to optimize for specific use cases when you exploit known patterns!

English

Kyle Kranen@KranenKyle·3d

4/ This is also true of closed source models. There are opportunities to *significantly* improve margins with fixed token pricing over time!

English

235

Kyle Kranen@KranenKyle·3d

3/ The better the model is, the more incentive there will be to optimize it in both closed and open source!

English

331

Kyle Kranen@KranenKyle·3d

@gabriel1 Recruit them for your startup! You can prove out your thesis here and now 😉

English

Kyle Kranen@KranenKyle·4d

@xeophon grep is certainly faster!

English

150

Florian Brand@xeophon·4d

I asked the clanker to find performance improvements and it deleted the whole project???

English

3.8K

Kyle Kranen@KranenKyle·4d

@peholderrieth @nvidia Welcome to NVIDIA!

English

518

Peter Holderrieth@peholderrieth·4d

Hi everyone! I’ve moved to the Bay Area for a summer research internship at @nvidia. Beyond exciting work, I'd love to meet new people doing exciting stuff (incl. stuff I don't work on myself rn!). If you’re around, I’d love to connect! Even if just for a jam session!

English

352

36.3K

Kyle Kranen@KranenKyle·4d

@mweinbach Make a pool with your 7 best friends to buy a DGX B200 🤔

English

5.4K

Max Weinbach@mweinbach·4d

The minimum to run the model is ~$20K in hardware and you get ~20 tok/s out ~$20K gets you around 34.6B tokens at a 12:1 input to output ratio assuming good token caching If you ran the hardware 24/7, it would take roughly 5.5 years to break even

Jordan Nanos@JordanNanos

GLM 5.2 costs $1.40/4.40 per Mtok at 40 tok/sec and people seriously consider buying GPU rigs for it

English

130

1.5K

336K

Kyle Kranen@KranenKyle·5d

@jonoringer Note that with 8 B200s you can run larger than BS=1, improving the arithmetic intensity and efficiency per user token of the model.

English

Jon Oringer@jonoringer·5d

sooo.. To match the inference speed and intelligence of a production-hosted Claude 3 Opus (or comparable 2026 frontier model), GLM-5.2 requires 8 NVIDIA Blackwell B200 or B300 GPUs running in FP8 quantization...

English

205

127.1K

Kyle Kranen@KranenKyle·5d

@ishgirwan Intelligent engine hparam sweeping is already done in prod! I’m talking about generating the E2E code (including kernel selection, overlapping, etc). Note that the concept of stable hackable primitives does some heavy lifting here.

English

121

Ish@ishgirwan·5d

@KranenKyle Will this be more like hyperparameter optimization for a each model based on its deployment configs. what will these deployment configs be apart from kernels, tp. Also how can it be done efficiently?

English

123

Kyle Kranen@KranenKyle·5d

We feel remarkably close to auto-generating SOTA LLM inference engines to target single model single Pareto point deployments using some set of validated primitives (kernels, block manager, etc)! Seems very hill-climbable.

English

3.2K

Kyle Kranen@KranenKyle·5d

@charles_irl Where? Can I have some?

English

254

Kyle Kranen@KranenKyle·5d

@willccbb Or holistic (optimize this Pareto) benchmark

English

152

will brown@willccbb·6d

a lot of the benches beloved by model connoisseurs are things like "PostTrainBench" and "WeirdML", and we're probably due for another good kernel benchmark soon the labs will soon have to choose between "pushing the frontier" via headline numbers and self-commoditization

English

12K

탐색

@luke_clancy1 @aidanmurphy @SuJinYan123 @MeryemArik9 @lunacleon @alphatozeta8148 @gabriel1 @xeophon