Lequn Chen

49 posts

Lequn Chen

@abcdabcd987

Faster and cheaper LLM inference.

Seattle, WA เข้าร่วม Ocak 2012

630 กำลังติดตาม1.5K ผู้ติดตาม

Lequn Chen@abcdabcd987·10 Nis

@tskaerobot @Yuchenj_UW Upload all tax documents. Prompt "prepare my 2025 tax" and your information (like location, single or married, ...). Same as what you would send to CPA. (If you don't know which docs are needed, just ask it)

English

2.3K

tsk@tskaerobot·10 Nis

@abcdabcd987 @Yuchenj_UW Wow. Can you recommend a tutorial. I paid a cpa $2000 and I think he didn’t do a great job f

English

2.5K

Yuchen Jin@Yuchenj_UW·10 Nis

Anthropic killed this, Anthropic killed that, why cant Anthropic kill TurboTax

English

179

135

4.9K

306.7K

Lequn Chen@abcdabcd987·10 Nis

@iamup @AravSrinivas I uploaded all tax documents and also equity contracts. Same as what I sent to my CPA previously.

English

@iamup@iamup·10 Nis

@AravSrinivas Does one have to upload full tax documents showing SSN or just add the income etc and get the tax return prepared? @abcdabcd987

English

Aravind Srinivas@AravSrinivas·10 Nis

Perplexity Computer is more reliable than a CPA for filing taxes.

Lequn Chen@abcdabcd987

@Yuchenj_UW Perplexity Computer saved me $14k in tax. It found 2 double taxing errors and 2 form filling errors from my $2000-CPA's draft, which CPA fully agreed. In another thread, I let it compute tax from scratch. It's correct to the cents.

English

691

107.5K

Lequn Chen@abcdabcd987·19 Kas

Check out my talk at Ray Summit 2025 on RDMA Point-to-Point Communication for LLM Systems youtube.com/watch?v=Nl8iqY…

YouTube

English

3.8K

Lequn Chen@abcdabcd987·10 Kas

zhihu: zhuanlan.zhihu.com/p/197123241073…

Eesti

1.2K

Lequn Chen@abcdabcd987·10 Kas

Wrote a blog post on why collective communication feels awkward for newer LLM workloads (disaggregated inference, RL weight update, MoE), why people don’t just use raw RDMA, how we approached it, and some behind-the-scenes stories. le.qun.ch/en/blog/2025/1…

English

229

21.6K

Lequn Chen@abcdabcd987·5 Kas

Faster than DeepEP for Decode on ConnectX-7. First viable kernel on EFA. SM-Free RDMA transfer. Support prefill. (Maybe portable to other hardware as well)

Perplexity@perplexity_ai

Perplexity is the first to develop custom Mixture-of-Experts (MoE) kernels that make trillion-parameter models available with cloud platform portability. Our team has published this work on arXiv as Perplexity's first research paper. Read more: research.perplexity.ai/articles/enabl…

English

12.6K

Lequn Chen@abcdabcd987·2 Eki

Read more in the blog post! research.perplexity.ai/articles/weigh…

English

315

Lequn Chen@abcdabcd987·2 Eki

We divide the weight transfer process into pipeline stages to enable overlapped execution over different hardware resources (CPU->GPU memcpy, GPU computation, RDMA, Ethernet).

English

436

Lequn Chen@abcdabcd987·2 Eki

We recently achieved 1.3-second cross-machine parameter update for Kimi-K2 (1T parameters), as opposed to a few minutes in popular frameworks.

English

775

Lequn Chen รีทวีตแล้ว

vLLM@vllm_project·29 Eyl

How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.

DeepSeek@deepseek_ai

🚀 Introducing DeepSeek-V3.2-Exp — our latest experimental model! ✨ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context. 👉 Now live on App, Web, and API. 💰 API prices cut by 50%+! 1/n

English

107

699

102.4K

Lequn Chen รีทวีตแล้ว

Perplexity@perplexity_ai·25 Eyl

Introducing Perplexity Search API We've built a search index of billions of webpages to provide real-time, quality information from the web. Now developers have access to the full power of our index, providing the most accurate results in milliseconds. perplexity.ai/hub/blog/intro…

English

244

2.2K

635.5K

Lequn Chen รีทวีตแล้ว

Anyscale@anyscalecompute·18 Eyl

Just got a sneak peek of the breakout sessions lineup for #RaySummit2025 – and it’s 🔥 Sessions from: 🔹 @character_ai on Scaling LLM Post-Training 🔹 The State of @vllm_project in 2025 🔹 @Roblox on Training 3D Foundation Models with Ray 🔹 @xai on Scaling Image + Video Processing 🔹 @zoox on Reliable, Multimodal LLM Serving 🔹 @perplexity_ai on RDMA P2P for KvCache + MoE Looking forward to learning from the teams actually building these systems. Come join us. Save 25% with code ANYJOIN25 →anyscale.com/ray-summit/202…

English

5.5K

Lequn Chen@abcdabcd987·19 Eyl

@LigengZhu Glad that you enjoyed it! To be precise, it's EP64 on the inference side, around 30GB per inference GPU. So it's around 30GB / 1.3s = 23 GB/s.

English

737

Ligeng Zhu@LigengZhu·18 Eyl

Every RL infra resesarchers should read's @abcdabcd987 blog, 1T / 1.3s / 16 nodes = 49GB/s. Nearly fully reach the peak of the IB bandwidth! For Kimi-K2 (1T params), with 256 GPUs in BF16 training and 128 GPUs in FP8 inference, weight updates take less than 1.3 seconds.

English

1.6K

Lequn Chen@abcdabcd987·19 Eyl

@LiyuanLucas Glad that you enjoyed it!

English