Hongyi Jin

10 posts

Hongyi Jin

Hongyi Jin

@HongyiJin258

CS PhD Student @CSDatCMU

Katılım Nisan 2023
3 Takip Edilen888 Takipçiler
Hongyi Jin retweetledi
Shanli Xing
Shanli Xing@shanli_xing·
🤔 Can AI optimize the systems it runs on? 🚀 Introducing FlashInfer-Bench, a workflow that makes AI systems self-improving with agents: - Standardized signature for LLM serving kernels - Implement kernels with your preferred language - Benchmark them against real-world serving workloads - Fastest kernels get day-0 integrated into production First-class integration with FlashInfer, SGLang (@lmsysorg ), and vLLM (@vllm_project ) at launch🙌 Blog post: flashinfer.ai/2025/10/21/fla… Leaderboard: bench.flashinfer.ai
Shanli Xing tweet media
English
3
45
146
59.7K
Hongyi Jin retweetledi
CMU School of Computer Science
Huge thank you to @NVIDIADC for gifting a brand new #NVIDIADGX B200 to CMU’s Catalyst Research Group! This AI supercomputing system will afford Catalyst the ability to run and test their work on a world-class unified AI platform.
CMU School of Computer Science tweet mediaCMU School of Computer Science tweet mediaCMU School of Computer Science tweet mediaCMU School of Computer Science tweet media
English
3
28
140
81.8K
Hongyi Jin
Hongyi Jin@HongyiJin258·
@haozhangml @tqchenml Thank you, Hao! You and DistServe team did a great job in exploration of disaggregated LLM serving.
English
0
0
2
71
Hongyi Jin
Hongyi Jin@HongyiJin258·
🚀Making cross-engine LLM serving programmable. Introducing LLM Microserving: a new RISC-style approach to design LLM serving API at sub-request level. Scale LLM serving with programmable cross-engine serving patterns, all in a few lines of Python. blog.mlc.ai/2025/01/07/mic…
Hongyi Jin tweet media
English
0
31
64
18.5K
Hongyi Jin retweetledi
Bohan Hou
Bohan Hou@bohanhou1998·
Running LLM natively on your 🤖@Android phone, following our release of the iOS app. With MLC-LLM and TVM Unity, we are able to optimize and deploy the model in 1 week! 6~7 toks/sec on Galaxy S23. Demo: #android" target="_blank" rel="nofollow noopener">mlc.ai/mlc-llm/#andro… Check out for details: mlc.ai/blog/2023/05/0…
GIF
English
8
32
86
23.7K
Hongyi Jin retweetledi
Bohan Hou
Bohan Hou@bohanhou1998·
Can LLMs run natively on your iPhone📱? Our answer is yes, and we can do more! We are introducing MLC-LLM, an open framework that brings language models (LLMs) directly into a broad class of platforms (CUDA, Vulkan, Metal) with GPU acceleration! Demo: mlc.ai/mlc-llm/
Bohan Hou tweet media
English
30
170
684
318.4K
Hongyi Jin
Hongyi Jin@HongyiJin258·
@standot3 @TheShubhanshu @WebGPU If you mean accuracy, the answer is yes. We use a regular group quantization method just like other projects. If you mean latency/efficiency, the answer is no. We did a careful schedule to speed the dequantization up.
English
0
0
4
571
Hongyi Jin
Hongyi Jin@HongyiJin258·
Introducing WebLLM, an open-source chatbot that brings language models (LLMs) directly onto web browsers. We can now run instruction fine-tuned LLaMA (Vicuna) models natively on your browser tab via @WebGPU with no server support. Checkout our demo at mlc.ai/web-llm .
Hongyi Jin tweet media
English
41
429
1.8K
799.7K
Hongyi Jin
Hongyi Jin@HongyiJin258·
@TheShubhanshu @WebGPU We are compressing the weights into int4 format, so the weight is actually occupying about 4GB memory
English
3
4
27
6.1K
Shubhanshu Mishra
Shubhanshu Mishra@TheShubhanshu·
@HongyiJin258 @WebGPU Nice. Is it fetching the full 14gb of model weights into the local cache or is there some compression and quantization going on.
English
1
0
7
8.1K