Mengshiun

9 posts

Mengshiun

Mengshiun

@mengshyu

Katılım Nisan 2024
53 Takip Edilen34 Takipçiler
Mengshiun retweetledi
Tianqi Chen
Tianqi Chen@tqchenml·
FlashInfer won #MLSys2025 best paper🏆, with backing from @NVIDIAAIDev to bring top LLM inference kernels to the community
NVIDIA AI Developer@NVIDIAAIDev

🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and contributor to the project. We’ve chosen FlashInfer to release our top LLM inference kernels, including those from TensorRT-LLM, making them easy to integrate into @vllm_project, SGLang (@lmsysorg), and custom inference engines. It started as a collaborative research project at @uwcse, @CarnegieMellon, and OctoAI (acquired by NVIDIA) with the goal of creating a flexible LLM inference kernel library that is engine agnostic, highly optimized, and easy to extend for new techniques such as algorithms for KV cache reuse. It is now a thriving open source project with production deployments and contributions from research and development teams across the AI systems community. Check out FlashInfer today to get started to see our first Blackwell kernels for DeepSeek MLA available now: nvda.ws/4djKdq7 Congratulations again to Zihao Ye and all authors of the MLSys paper -- Lequn Chen, Wuwei Lin, Yineng Zhang, Stephanie Wang, Baris Kasikci, Arvind Krishnamurthy, Vinod Grover, Tianqi Chen. And thank you to all community contributions, we look forward to continuing to grow this project. FlashInfer paper: nvda.ws/4kj2Htc Blackwell MLA kernel: nvda.ws/4jWjLW2

English
5
27
140
11K
Mengshiun retweetledi
Hongyi Jin
Hongyi Jin@HongyiJin258·
🚀Making cross-engine LLM serving programmable. Introducing LLM Microserving: a new RISC-style approach to design LLM serving API at sub-request level. Scale LLM serving with programmable cross-engine serving patterns, all in a few lines of Python. blog.mlc.ai/2025/01/07/mic…
Hongyi Jin tweet media
English
0
31
64
18.5K
Mengshiun retweetledi
Yixin Dong
Yixin Dong@yi_xin_dong·
🚀✨Introducing XGrammar: a fast, flexible, and portable engine for structured generation! 🤖Accurate JSON/grammar generation ⚡️3-10x speedup in latency 🤝Easy LLM engine integration ✅ Now in MLC-LLM, SGLang, WebLLM; vLLM & HuggingFace coming soon! blog.mlc.ai/2024/11/22/ach…
Yixin Dong tweet mediaYixin Dong tweet media
English
6
64
259
72.5K
Mengshiun retweetledi
Ruihang Lai
Ruihang Lai@ruihanglai·
The latency of LLM serving has become increasingly important. How to strike a latency-throughput balance? How do TP and spec decoding help? We are thrilled to share the latest benchmark results and lessons for low-latency LLM serving through MLCEngine. blog.mlc.ai/2024/10/10/opt…
Ruihang Lai tweet media
English
1
25
60
31.8K
Mengshiun
Mengshiun@mengshyu·
Llama-3.2 3B from @AIatMeta is now available on Android! Built with MLC LLM, this lightweight model is faster and more efficient, bringing advanced AI capabilities right to your device. 🦙📱 #AI #MobileAI" Check out llm.mlc.ai/docs/deploy/an… for quick start instructions.
English
0
4
7
1.4K
Mengshiun retweetledi
Charlie Ruan
Charlie Ruan@charlie_ruan·
Excited to share WebLLM engine: a high-performance in-browser LLM inference engine! WebLLM offers local GPU acceleration via @WebGPU, fully OpenAI-compatible API, and built-in web workers support to separate backend executions. Check out the blog post: blog.mlc.ai/2024/06/13/web…
English
11
91
386
97.5K
Mengshiun
Mengshiun@mengshyu·
The latest version of MLC LLM now supports the newly released model Qwen2! Run it effortlessly on a $100 OrangePi. With Qwen2 0.5B 17.5 tok/s, 1.5B 8.9 tok/s, AI capabilities are more accessible than ever. Explore more at MLC LLM llm.mlc.ai #MLC #LLM #Qwen2 #OrangePi
Mengshiun tweet mediaMengshiun tweet mediaMengshiun tweet media
Ruihang Lai@ruihanglai

Announcing MLCEngine, a universal LLM deployment engine with ML Compilation. We rebuilt the engine with state-of-the-art serving optimizations and maximum local env portability. Fully OpenAI compatible for both cloud and local use cases. Check out the blog blog.mlc.ai/2024/06/07/uni…

English
0
5
13
2.4K