Luis Ceze

1.4K posts

Luis Ceze banner
Luis Ceze

Luis Ceze

@luisceze

computer architect. marveled by biology. professor @uwcse. ceo @OctoAICloud. venture partner @madronaventures.

? Katılım Mayıs 2010
2.1K Takip Edilen3.6K Takipçiler
Luis Ceze retweetledi
Tianqi Chen
Tianqi Chen@tqchenml·
📢#MLSys2026 this year features contest tracks, checkout the anouncement on optimizing FlashInfer-Bench LLM inference kernels for NVIDIA blackwell GPUs 👉
Zihao Ye@ye_combinator

🚀 MLSys 2026 Contest - @nvidia Track is LIVE! Registration is now open for the FlashInfer-Bench Challenge! Submit high-performance GPU kernels for cutting-edge LLM architectures on NVIDIA Blackwell GPUs. Three Tracks * MoE (Mixture of Experts) * DSA (Deepseek Sparse Attention) * GDN (Gated Delta Net) Human experts AND AI agents welcome — evaluated separately. Let's see who builds the best kernels! 🤖 🎁 Prizes: Winners take home NVIDIA GPUs and are invited for presentation at MLSys 2026. ⚡ First 50 teams to register get free GPU credits from @modal - huge thanks for the sponsorship @charles_irl ! Whether you're a kernel wizard or building autonomous coding agents, we want to see what you've got. 🔗 Contest details: mlsys26.flashinfer.ai See you at MLSys 2026! 🔥

English
0
11
62
9.3K
Luis Ceze
Luis Ceze@luisceze·
Calling for humans, AI and AI+humans to participate in this contest! Should be super fun.
Zihao Ye@ye_combinator

🚀 MLSys 2026 Contest - @nvidia Track is LIVE! Registration is now open for the FlashInfer-Bench Challenge! Submit high-performance GPU kernels for cutting-edge LLM architectures on NVIDIA Blackwell GPUs. Three Tracks * MoE (Mixture of Experts) * DSA (Deepseek Sparse Attention) * GDN (Gated Delta Net) Human experts AND AI agents welcome — evaluated separately. Let's see who builds the best kernels! 🤖 🎁 Prizes: Winners take home NVIDIA GPUs and are invited for presentation at MLSys 2026. ⚡ First 50 teams to register get free GPU credits from @modal - huge thanks for the sponsorship @charles_irl ! Whether you're a kernel wizard or building autonomous coding agents, we want to see what you've got. 🔗 Contest details: mlsys26.flashinfer.ai See you at MLSys 2026! 🔥

English
0
2
8
2K
Luis Ceze retweetledi
Tianqi Chen
Tianqi Chen@tqchenml·
CuteDSL 4.3.1 is here 🚀 Major host overhead optimization (10-40µs down to a 2µs in hot loops_, streamlined PyTorch interop (pass torch.Tensors directly, no more conversions needed) and export and use in more languages and envs. All powered by apache tvm-ffi ABI
Tianqi Chen tweet media
English
9
61
334
53K
Luis Ceze retweetledi
Tony Mongkolsmai
Tony Mongkolsmai@tonymongkolsmai·
Today we are releasing our first public beta of Nsight Python! The goal is to simplify the life of a Python developer by proving a pythonic way to analyze your kernel code! Check it out, provide feedback! Nsight Python — nsight-python docs.nvidia.com/nsight-python/
English
10
48
341
29.4K
Luis Ceze
Luis Ceze@luisceze·
FlashInfer Bench’s evaluation of kernels with real-world setups will accelerate development of kernels by both humans and agents - so cool! Can’t wait to see the advances that will come out of it.
Tianqi Chen@tqchenml

🚀Excited to launch FlashInfer Bench. We believe AI has the potential to help build LLM systems . To accelerate the path, we need an open schema for critical workloads and an AI-driven virtuous circle. First-class integration with FlashInfer, SGLang and vLLM support👉

English
0
4
18
4.9K
Luis Ceze retweetledi
Shanli Xing
Shanli Xing@shanli_xing·
🤔 Can AI optimize the systems it runs on? 🚀 Introducing FlashInfer-Bench, a workflow that makes AI systems self-improving with agents: - Standardized signature for LLM serving kernels - Implement kernels with your preferred language - Benchmark them against real-world serving workloads - Fastest kernels get day-0 integrated into production First-class integration with FlashInfer, SGLang (@lmsysorg ), and vLLM (@vllm_project ) at launch🙌 Blog post: flashinfer.ai/2025/10/21/fla… Leaderboard: bench.flashinfer.ai
Shanli Xing tweet media
English
3
44
148
59.2K
Luis Ceze retweetledi
Zhihao Jia
Zhihao Jia@JiaZhihao·
The #MLSys2026 submission deadline is only 2 weeks away (Oct 30)! Submit your best work on ML systems — spanning hardware, compilers, software, models, agents, and eval. This year features both Research and Industry Tracks! Join us in Seattle next spring! mlsys.org
English
0
14
21
4.4K
Luis Ceze retweetledi
The AI Investor
The AI Investor@The_AI_Investor·
AMD Instinct MI355X was supposed to compete with NVIDIA Blackwell right? So much for AMD having an advantage in inference.
The AI Investor tweet media
English
59
70
447
84.5K
Science girl
Science girl@sciencegirl·
Camouflage
Français
41
72
478
43.2K
Luis Ceze retweetledi
Zihao Ye
Zihao Ye@ye_combinator·
We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to @lmsysorg’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to @vllm_project for integration support. With continued help from @NVIDIAAIDev , FlashInfer is becoming more stable and faster. Let’s keep building together!
NVIDIA AI Developer@NVIDIAAIDev

🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and contributor to the project. We’ve chosen FlashInfer to release our top LLM inference kernels, including those from TensorRT-LLM, making them easy to integrate into @vllm_project, SGLang (@lmsysorg), and custom inference engines. It started as a collaborative research project at @uwcse, @CarnegieMellon, and OctoAI (acquired by NVIDIA) with the goal of creating a flexible LLM inference kernel library that is engine agnostic, highly optimized, and easy to extend for new techniques such as algorithms for KV cache reuse. It is now a thriving open source project with production deployments and contributions from research and development teams across the AI systems community. Check out FlashInfer today to get started to see our first Blackwell kernels for DeepSeek MLA available now: nvda.ws/4djKdq7 Congratulations again to Zihao Ye and all authors of the MLSys paper -- Lequn Chen, Wuwei Lin, Yineng Zhang, Stephanie Wang, Baris Kasikci, Arvind Krishnamurthy, Vinod Grover, Tianqi Chen, Ruihang Lai. And thank you to all community contributions, we look forward to continuing to grow this project. FlashInfer paper: nvda.ws/4kj2Htc Blackwell MLA kernel: nvda.ws/4jWjLW2

English
15
37
233
39.1K
Luis Ceze retweetledi
Ying Sheng
Ying Sheng@ying11231·
Congrats to @ye_combinator @tqchenml @luisceze! Flashinfer has been the real power behind various inference frameworks! Hope to see more people joining the community and build your own inference engines on top of it!
Zihao Ye@ye_combinator

We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to @lmsysorg’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to @vllm_project for integration support. With continued help from @NVIDIAAIDev , FlashInfer is becoming more stable and faster. Let’s keep building together!

English
1
4
54
12.4K
Luis Ceze
Luis Ceze@luisceze·
🚀🎉
NVIDIA AI Developer@NVIDIAAIDev

🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and contributor to the project. We’ve chosen FlashInfer to release our top LLM inference kernels, including those from TensorRT-LLM, making them easy to integrate into @vllm_project, SGLang (@lmsysorg), and custom inference engines. It started as a collaborative research project at @uwcse, @CarnegieMellon, and OctoAI (acquired by NVIDIA) with the goal of creating a flexible LLM inference kernel library that is engine agnostic, highly optimized, and easy to extend for new techniques such as algorithms for KV cache reuse. It is now a thriving open source project with production deployments and contributions from research and development teams across the AI systems community. Check out FlashInfer today to get started to see our first Blackwell kernels for DeepSeek MLA available now: nvda.ws/4djKdq7 Congratulations again to Zihao Ye and all authors of the MLSys paper -- Lequn Chen, Wuwei Lin, Yineng Zhang, Stephanie Wang, Baris Kasikci, Arvind Krishnamurthy, Vinod Grover, Tianqi Chen. And thank you to all community contributions, we look forward to continuing to grow this project. FlashInfer paper: nvda.ws/4kj2Htc Blackwell MLA kernel: nvda.ws/4jWjLW2

ART
1
3
10
1.4K
Luis Ceze retweetledi
Zihao Ye
Zihao Ye@ye_combinator·
LLM is not all about tensor cores. categorical sampling under filters (top-p/top-k/min-p) are critical operators in llms as vocabulary size grows, flashinfer uses sorting-free rejection sampling algorithm for efficient sampling. checkout this great blog post written by @0xsling0 and see how traditional parallel algorithms (e.g. reduction/scan) still shines in the era of llms.
Shanli Xing@shanli_xing

🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: flashinfer.ai/2025/03/10/sam…

English
0
9
39
4.8K
Luis Ceze retweetledi
Shanli Xing
Shanli Xing@shanli_xing·
🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: flashinfer.ai/2025/03/10/sam…
Shanli Xing tweet media
English
1
32
181
31.3K
Luis Ceze retweetledi
Tianqi Chen
Tianqi Chen@tqchenml·
Learn more about the latest advances in AI and systems, including LLM serving, efficient attentions, structured outputs, scaling up training, and more topics. Check out #MLSys2025. Accepted papers at mlsys.org/virtual/2025/p… and register today at mlsys.org/Register
Tianqi Chen tweet media
English
4
24
103
16.7K
Luis Ceze retweetledi
Zihao Ye
Zihao Ye@ye_combinator·
Check out the intra-kernel profiler in flashinfer to visualize the timeline of each SM/warpgroup in the lifecycle of a CUDA persistent kernel: github.com/flashinfer-ai/… You can clearly understand how tensor/cuda cores overlapping, variable length load-balancing and fusion works.
Zihao Ye tweet media
English
2
32
148
8.7K
Luis Ceze
Luis Ceze@luisceze·
Great to see @OctoAICloud only second to @GroqInc -- given our service runs on off-the-cloud-shelf @nvidia hardware. It is all about carefully balancing speed, quality and cost in from a whole-system, cross-stack perspective.
Alex Volkov@altryne

Wanna know whether different LLM providers serve the same LLama 3.1 70B? I sure did! So I ran a quick eval to get some surprising results + open sourced my code 👇 Check out my comparison between @GroqInc @FireworksAI_HQ @OctoAICloud @DeepInfra and @togethercompute

English
1
2
11
7.1K