Kyle Kranen

225 posts

Kyle Kranen banner
Kyle Kranen

Kyle Kranen

@KranenKyle

Engineering Leader for Planetary Scale Inference with NVIDIA Dynamo

San Francisco 가입일 Mart 2021
95 팔로잉621 팔로워
Kyle Kranen
Kyle Kranen@KranenKyle·
Baseten are cooking! Baseten have been working with us on Dynamo since 0.1, and have been nothing but incredible partners. Really excited to see the impact that Dynamo brought to this SOTA endpoint (2x TPS! For free!)
Philip Kiely@philipkiely

x.com/i/article/2069…

English
1
2
48
5.4K
luke clancy
luke clancy@luke_clancy1·
come host your company's dinners in my SF home. we've hosted tons of dinners / mixers / other shenanigans here. now we want to give you access. it's an epic venue. - 4600 sq ft - natural light - haight ashbury - upstairs + downstairs - tons of nooks to chat in can easily host a 50 person mixer; 25 person sit-down dinner. prob best for upscale mixer w/ food. we know a great chef + servers. can beat (almost) every nice venue on price. DM me or @aidanmurphy if interested upstairs: pic 1,2 downstairs: pic 3,4
luke clancy tweet medialuke clancy tweet medialuke clancy tweet medialuke clancy tweet media
English
22
1
137
29.4K
susun
susun@SuJinYan123·
推理是真的复杂度密集工作,好多事情没干。这周顺利的话有个推测解码的blog吧。后面不想干小模型了,准备上moe了,moe也头大啊。p/d要不要写,dynamo还不看我pr,要不去改vllm router得了。kernel也调不动,metrics没做。头大头大
中文
2
2
23
1.9K
Kyle Kranen
Kyle Kranen@KranenKyle·
@MeryemArik9 Can’t wait for a Fergus blog post on this :) SGLang cold start one was quite good!
English
0
0
2
84
Meryem Arik
Meryem Arik@MeryemArik9·
This is 100% true - our day 0 support is just about getting the model working & live - (We still price positive margin at this point while being most market competitive). And then as you say we optimize the deployment more over the next few days / weeks (more popular models get more optimization) - either we bank the extra margin or decrease prices further.
English
1
0
1
217
Kyle Kranen
Kyle Kranen@KranenKyle·
1/ I see a lot of analysis of GLM 5.2 vs closed source models based on day 0 API pricing. Almost every day 0 model release I’ve been a part of has had *significant* room to improve purely with improvements in software (>30x in some cases).
English
2
0
23
2.9K
Kyle Kranen
Kyle Kranen@KranenKyle·
@lunacleon As there with any new technology! As the value of a new technology is proven over time, people become more comfortable with it. GMOs are still unpopular with many, but are credited with saving hundreds of millions (if not billions) of lives!
English
0
0
0
13
lunacleon
lunacleon@lunacleon·
@KranenKyle ah but then there’s regulatory hurdles + liability + cultural malaise to overcome
English
1
0
0
18
Kyle Kranen
Kyle Kranen@KranenKyle·
In Machines of Loving Grace, Dario argues there are sets of problems where diffusion of AI capability will be slow due to being Amdahl’s bottlenecked by the real world. Meds, HW, construction, all fit this class of problem. Bullish on simulation that removes that bottleneck.
English
1
1
21
1.3K
Dhruv Singal
Dhruv Singal@alphatozeta8148·
@KranenKyle ++ and there is so much room to optimize for specific use cases when you exploit known patterns!
English
1
0
0
17
Kyle Kranen
Kyle Kranen@KranenKyle·
4/ This is also true of closed source models. There are opportunities to *significantly* improve margins with fixed token pricing over time!
English
0
0
7
235
Kyle Kranen
Kyle Kranen@KranenKyle·
3/ The better the model is, the more incentive there will be to optimize it in both closed and open source!
English
1
0
8
331
Kyle Kranen
Kyle Kranen@KranenKyle·
@gabriel1 Recruit them for your startup! You can prove out your thesis here and now 😉
English
0
0
0
21
Florian Brand
Florian Brand@xeophon·
I asked the clanker to find performance improvements and it deleted the whole project???
English
10
1
56
3.8K
Peter Holderrieth
Peter Holderrieth@peholderrieth·
Hi everyone! I’ve moved to the Bay Area for a summer research internship at @nvidia. Beyond exciting work, I'd love to meet new people doing exciting stuff (incl. stuff I don't work on myself rn!). If you’re around, I’d love to connect! Even if just for a jam session!
Peter Holderrieth tweet media
English
26
9
352
36.3K
Kyle Kranen
Kyle Kranen@KranenKyle·
@mweinbach Make a pool with your 7 best friends to buy a DGX B200 🤔
English
4
1
34
5.4K
Kyle Kranen
Kyle Kranen@KranenKyle·
@jonoringer Note that with 8 B200s you can run larger than BS=1, improving the arithmetic intensity and efficiency per user token of the model.
English
1
0
16
6K
Jon Oringer
Jon Oringer@jonoringer·
sooo.. To match the inference speed and intelligence of a production-hosted Claude 3 Opus (or comparable 2026 frontier model), GLM-5.2 requires 8 NVIDIA Blackwell B200 or B300 GPUs running in FP8 quantization...
English
17
16
205
127.1K
Kyle Kranen
Kyle Kranen@KranenKyle·
@ishgirwan Intelligent engine hparam sweeping is already done in prod! I’m talking about generating the E2E code (including kernel selection, overlapping, etc). Note that the concept of stable hackable primitives does some heavy lifting here.
English
0
0
1
121
Ish
Ish@ishgirwan·
@KranenKyle Will this be more like hyperparameter optimization for a each model based on its deployment configs. what will these deployment configs be apart from kernels, tp. Also how can it be done efficiently?
English
1
0
0
123
Kyle Kranen
Kyle Kranen@KranenKyle·
We feel remarkably close to auto-generating SOTA LLM inference engines to target single model single Pareto point deployments using some set of validated primitives (kernels, block manager, etc)! Seems very hill-climbable.
English
2
1
43
3.2K
will brown
will brown@willccbb·
a lot of the benches beloved by model connoisseurs are things like "PostTrainBench" and "WeirdML", and we're probably due for another good kernel benchmark soon the labs will soon have to choose between "pushing the frontier" via headline numbers and self-commoditization
English
6
0
97
12K