Lequn Chen

49 posts

Lequn Chen banner
Lequn Chen

Lequn Chen

@abcdabcd987

Faster and cheaper LLM inference.

Seattle, WA Beigetreten Ocak 2012
630 Folgt1.5K Follower
Lequn Chen
Lequn Chen@abcdabcd987·
@tskaerobot @Yuchenj_UW Upload all tax documents. Prompt "prepare my 2025 tax" and your information (like location, single or married, ...). Same as what you would send to CPA. (If you don't know which docs are needed, just ask it)
English
1
1
16
2.3K
tsk
tsk@tskaerobot·
@abcdabcd987 @Yuchenj_UW Wow. Can you recommend a tutorial. I paid a cpa $2000 and I think he didn’t do a great job f
English
2
0
0
2.5K
Yuchen Jin
Yuchen Jin@Yuchenj_UW·
Anthropic killed this, Anthropic killed that, why cant Anthropic kill TurboTax
English
179
135
4.9K
306.7K
Lequn Chen
Lequn Chen@abcdabcd987·
@iamup @AravSrinivas I uploaded all tax documents and also equity contracts. Same as what I sent to my CPA previously.
English
1
0
0
29
@iamup
@iamup@iamup·
@AravSrinivas Does one have to upload full tax documents showing SSN or just add the income etc and get the tax return prepared? @abcdabcd987
English
1
0
0
36
Aravind Srinivas
Aravind Srinivas@AravSrinivas·
Perplexity Computer is more reliable than a CPA for filing taxes.
Lequn Chen@abcdabcd987

@Yuchenj_UW Perplexity Computer saved me $14k in tax. It found 2 double taxing errors and 2 form filling errors from my $2000-CPA's draft, which CPA fully agreed. In another thread, I let it compute tax from scratch. It's correct to the cents.

English
35
36
691
107.5K
Lequn Chen
Lequn Chen@abcdabcd987·
Wrote a blog post on why collective communication feels awkward for newer LLM workloads (disaggregated inference, RL weight update, MoE), why people don’t just use raw RDMA, how we approached it, and some behind-the-scenes stories. le.qun.ch/en/blog/2025/1…
English
4
29
229
21.6K
Lequn Chen
Lequn Chen@abcdabcd987·
We divide the weight transfer process into pipeline stages to enable overlapped execution over different hardware resources (CPU->GPU memcpy, GPU computation, RDMA, Ethernet).
Lequn Chen tweet media
English
1
0
3
436
Lequn Chen
Lequn Chen@abcdabcd987·
We recently achieved 1.3-second cross-machine parameter update for Kimi-K2 (1T parameters), as opposed to a few minutes in popular frameworks.
Lequn Chen tweet media
English
1
2
5
775
Lequn Chen retweetet
vLLM
vLLM@vllm_project·
How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.
vLLM tweet media
DeepSeek@deepseek_ai

🚀 Introducing DeepSeek-V3.2-Exp — our latest experimental model! ✨ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context. 👉 Now live on App, Web, and API. 💰 API prices cut by 50%+! 1/n

English
11
107
699
102.4K
Lequn Chen retweetet
Perplexity
Perplexity@perplexity_ai·
Introducing Perplexity Search API We've built a search index of billions of webpages to provide real-time, quality information from the web. Now developers have access to the full power of our index, providing the most accurate results in milliseconds. perplexity.ai/hub/blog/intro…
Perplexity tweet media
English
98
244
2.2K
635.5K
Lequn Chen retweetet
Anyscale
Anyscale@anyscalecompute·
Just got a sneak peek of the breakout sessions lineup for #RaySummit2025 – and it’s 🔥 Sessions from: 🔹 @character_ai on Scaling LLM Post-Training 🔹 The State of @vllm_project in 2025 🔹 @Roblox on Training 3D Foundation Models with Ray 🔹 @xai on Scaling Image + Video Processing 🔹 @zoox on Reliable, Multimodal LLM Serving 🔹 @perplexity_ai on RDMA P2P for KvCache + MoE Looking forward to learning from the teams actually building these systems. Come join us. Save 25% with code ANYJOIN25 →anyscale.com/ray-summit/202…
Anyscale tweet media
English
1
3
12
5.5K
Lequn Chen
Lequn Chen@abcdabcd987·
@LigengZhu Glad that you enjoyed it! To be precise, it's EP64 on the inference side, around 30GB per inference GPU. So it's around 30GB / 1.3s = 23 GB/s.
Lequn Chen tweet media
English
0
2
7
737
Ligeng Zhu
Ligeng Zhu@LigengZhu·
Every RL infra resesarchers should read's @abcdabcd987 blog, 1T / 1.3s / 16 nodes = 49GB/s. Nearly fully reach the peak of the IB bandwidth! For Kimi-K2 (1T params), with 256 GPUs in BF16 training and 128 GPUs in FP8 inference, weight updates take less than 1.3 seconds.
English
2
0
24
1.6K
Lequn Chen
Lequn Chen@abcdabcd987·
@vwxyzjn Haha. Glad that you enjoy it :)
English
0
0
1
120
Lequn Chen
Lequn Chen@abcdabcd987·
1.5 seconds is long enough to transfer model weights from training nodes to RL rollout nodes (as opposed to 100s). Here's the full story of how I made it (not just presenting the solution): le.qun.ch/en/blog/2025/0…
Lequn Chen tweet media
English
8
91
470
63.9K