Hongyi Jin

10 posts

Hongyi Jin

@HongyiJin258

CS PhD Student @CSDatCMU

Katılım Nisan 2023

3 Takip Edilen888 Takipçiler

Hongyi Jin retweetledi

Shanli Xing@shanli_xing·21 Eki

🤔 Can AI optimize the systems it runs on? 🚀 Introducing FlashInfer-Bench, a workflow that makes AI systems self-improving with agents: - Standardized signature for LLM serving kernels - Implement kernels with your preferred language - Benchmark them against real-world serving workloads - Fastest kernels get day-0 integrated into production First-class integration with FlashInfer, SGLang (@lmsysorg ), and vLLM (@vllm_project ) at launch🙌 Blog post: flashinfer.ai/2025/10/21/fla… Leaderboard: bench.flashinfer.ai

English

146

59.7K

Hongyi Jin retweetledi

CMU School of Computer Science@SCSatCMU·17 Nis

Huge thank you to @NVIDIADC for gifting a brand new #NVIDIADGX B200 to CMU’s Catalyst Research Group! This AI supercomputing system will afford Catalyst the ability to run and test their work on a world-class unified AI platform.

CMU School of Computer Science tweet media

English

140

81.8K

Hongyi Jin@HongyiJin258·8 Oca

@haozhangml @tqchenml Thank you, Hao! You and DistServe team did a great job in exploration of disaggregated LLM serving.

English

Hao Zhang@haozhangml·8 Oca

Great work that will even greatly amplifies the power of disaggregated LLM serving

Hongyi Jin@HongyiJin258

🚀Making cross-engine LLM serving programmable. Introducing LLM Microserving: a new RISC-style approach to design LLM serving API at sub-request level. Scale LLM serving with programmable cross-engine serving patterns, all in a few lines of Python. blog.mlc.ai/2025/01/07/mic…

English

Hongyi Jin@HongyiJin258·7 Oca

English

18.5K

Hongyi Jin retweetledi

Bohan Hou@bohanhou1998·9 May

Running LLM natively on your 🤖@Android phone, following our release of the iOS app. With MLC-LLM and TVM Unity, we are able to optimize and deploy the model in 1 week! 6~7 toks/sec on Galaxy S23. Demo: #android" target="_blank" rel="nofollow noopener">mlc.ai/mlc-llm/#andro… Check out for details: mlc.ai/blog/2023/05/0…

GIF

English

23.7K

Hongyi Jin retweetledi

Bohan Hou@bohanhou1998·29 Nis

Can LLMs run natively on your iPhone📱? Our answer is yes, and we can do more! We are introducing MLC-LLM, an open framework that brings language models (LLMs) directly into a broad class of platforms (CUDA, Vulkan, Metal) with GPU acceleration! Demo: mlc.ai/mlc-llm/

English

170

684

318.4K

Hongyi Jin@HongyiJin258·15 Nis

@standot3 @TheShubhanshu @WebGPU If you mean accuracy, the answer is yes. We use a regular group quantization method just like other projects. If you mean latency/efficiency, the answer is no. We did a careful schedule to speed the dequantization up.

English

571

stan-dot@standot3·15 Nis

@HongyiJin258 @TheShubhanshu @WebGPU is the compression causing any performance drop? I heard for some other project (can't find it now), it was significant

English

832

Hongyi Jin@HongyiJin258·15 Nis

Introducing WebLLM, an open-source chatbot that brings language models (LLMs) directly onto web browsers. We can now run instruction fine-tuned LLaMA (Vicuna) models natively on your browser tab via @WebGPU with no server support. Checkout our demo at mlc.ai/web-llm .

English

429

1.8K

799.7K

Hongyi Jin@HongyiJin258·15 Nis

@JefferyTatsuya @WebGPU Try upgrading your chrome to version 113 using the instructions on the website.

English

5.1K

Jeffery Kaneda　金田達也@JefferyTatsuya·15 Nis

@HongyiJin258 @WebGPU Only works on Mac? I got the following error：

English

7.2K

Hongyi Jin@HongyiJin258·15 Nis

@TheShubhanshu @WebGPU We are compressing the weights into int4 format, so the weight is actually occupying about 4GB memory

English

6.1K

Shubhanshu Mishra@TheShubhanshu·15 Nis

@HongyiJin258 @WebGPU Nice. Is it fetching the full 14gb of model weights into the local cache or is there some compression and quantization going on.

English

8.1K

Keşfet

@lmsysorg @vllm_project @haozhangml @tqchenml @Android @standot3 @TheShubhanshu @WebGPU