Mengshiun

9 posts

Mengshiun

@mengshyu

Katılım Nisan 2024

53 Takip Edilen34 Takipçiler

Mengshiun retweetledi

Tianqi Chen@tqchenml·13 May

FlashInfer won #MLSys2025 best paper🏆, with backing from @NVIDIAAIDev to bring top LLM inference kernels to the community

NVIDIA AI Developer@NVIDIAAIDev

🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and contributor to the project. We’ve chosen FlashInfer to release our top LLM inference kernels, including those from TensorRT-LLM, making them easy to integrate into @vllm_project, SGLang (@lmsysorg), and custom inference engines. It started as a collaborative research project at @uwcse, @CarnegieMellon, and OctoAI (acquired by NVIDIA) with the goal of creating a flexible LLM inference kernel library that is engine agnostic, highly optimized, and easy to extend for new techniques such as algorithms for KV cache reuse. It is now a thriving open source project with production deployments and contributions from research and development teams across the AI systems community. Check out FlashInfer today to get started to see our first Blackwell kernels for DeepSeek MLA available now: nvda.ws/4djKdq7 Congratulations again to Zihao Ye and all authors of the MLSys paper -- Lequn Chen, Wuwei Lin, Yineng Zhang, Stephanie Wang, Baris Kasikci, Arvind Krishnamurthy, Vinod Grover, Tianqi Chen. And thank you to all community contributions, we look forward to continuing to grow this project. FlashInfer paper: nvda.ws/4kj2Htc Blackwell MLA kernel: nvda.ws/4jWjLW2

English

140

11K

Mengshiun retweetledi

Hongyi Jin@HongyiJin258·7 Oca

🚀Making cross-engine LLM serving programmable. Introducing LLM Microserving: a new RISC-style approach to design LLM serving API at sub-request level. Scale LLM serving with programmable cross-engine serving patterns, all in a few lines of Python. blog.mlc.ai/2025/01/07/mic…

English

18.5K

Mengshiun retweetledi

Yixin Dong@yi_xin_dong·22 Kas

🚀✨Introducing XGrammar: a fast, flexible, and portable engine for structured generation! 🤖Accurate JSON/grammar generation ⚡️3-10x speedup in latency 🤝Easy LLM engine integration ✅ Now in MLC-LLM, SGLang, WebLLM; vLLM & HuggingFace coming soon! blog.mlc.ai/2024/11/22/ach…

English

259

72.5K

Mengshiun retweetledi

Ruihang Lai@ruihanglai·10 Eki

The latency of LLM serving has become increasingly important. How to strike a latency-throughput balance? How do TP and spec decoding help? We are thrilled to share the latest benchmark results and lessons for low-latency LLM serving through MLCEngine. blog.mlc.ai/2024/10/10/opt…

English

31.8K

Mengshiun@mengshyu·26 Eyl

Llama-3.2 3B from @AIatMeta is now available on Android! Built with MLC LLM, this lightweight model is faster and more efficient, bringing advanced AI capabilities right to your device. 🦙📱 #AI #MobileAI" Check out llm.mlc.ai/docs/deploy/an… for quick start instructions.

English

1.4K

Mengshiun@mengshyu·1 Ağu

Chatting with @GoogleDeepMind's Gemma 2 2B on Android using MLC LLM. It's pretty fast and accurate, beating GPT-3.5. Check out llm.mlc.ai/docs/deploy/an… for quick start instructions.

English

5.1K

Mengshiun retweetledi

Charlie Ruan@charlie_ruan·13 Haz

Excited to share WebLLM engine: a high-performance in-browser LLM inference engine! WebLLM offers local GPU acceleration via @WebGPU, fully OpenAI-compatible API, and built-in web workers support to separate backend executions. Check out the blog post: blog.mlc.ai/2024/06/13/web…

English

386

97.5K

Mengshiun@mengshyu·7 Haz

The latest version of MLC LLM now supports the newly released model Qwen2! Run it effortlessly on a $100 OrangePi. With Qwen2 0.5B 17.5 tok/s, 1.5B 8.9 tok/s, AI capabilities are more accessible than ever. Explore more at MLC LLM llm.mlc.ai #MLC #LLM #Qwen2 #OrangePi

Ruihang Lai@ruihanglai

Announcing MLCEngine, a universal LLM deployment engine with ML Compilation. We rebuilt the engine with state-of-the-art serving optimizations and maximum local env portability. Fully OpenAI compatible for both cloud and local use cases. Check out the blog blog.mlc.ai/2024/06/07/uni…

English

2.4K

Mengshiun@mengshyu·19 Nis

Deploy #Llama3 on $100 Orange Pi with GPU acceleration through MLC LLM. Try it out on your Orange Pi 👉 blog.mlc.ai/2023/08/09/GPU…

English

18.9K

Keşfet

@NVIDIAAIDev @AIatMeta @GoogleDeepMind @WebGPU @elonmusk @BarackObama @taylorswift13 @cristiano