Jaydev (@JaydevTonde) - โปรไฟล์ Twitter

ทวีตที่ปักหมุด

Jaydev@JaydevTonde·5d

I have been writing a small series on LLM inference with @vllm_project that can be a practical starting point for people trying to understand this space. Along with the explanations, I also ran benchmarks on realistic workloads across different GPUs and datasets to evaluate how these techniques perform in practice. It covers: - Major speculative decoding techniques - Major quantization methods - Distributed inference: DP / PP / TP - Expert Parallelism and mixed parallel setups - Practical optimization techniques like prefix caching, KV cache, and disaggregated prefill/decode My goal was to explain how these techniques work, where they help, so it is easier to choose the right approach for a given workload. This series is useful not only for people getting into LLM serving, but also for engineers who are already serving LLMs and want to optimize inference, improve throughput, reduce latency, or evaluate the right serving strategy.

English

7

15

188

11.1K

Jaydev@JaydevTonde·17h

I’m expanding my focus in LLM inference to explore how different optimization techniques perform across our favorite serving frameworks: vLLM, SGLang, and NVIDIA TensorRT-LLM.

English

0

5

109

Jaydev@JaydevTonde·3d

@philipkiely Great, I have read the e-book already and am planning to buy the hard copy again.

English

0

3

521

Jaydev@JaydevTonde·5d

@armaanamatya @vllm_project Enjoy reading.

English

0

2

135

armaanamatya@armaanamatya·5d

@JaydevTonde @vllm_project Will read!

English

1

0

1

165

Jaydev@JaydevTonde·5d

I have been writing a small series on LLM inference with @vllm_project that can be a practical starting point for people trying to understand this space. Along with the explanations, I also ran benchmarks on realistic workloads across different GPUs and datasets to evaluate how these techniques perform in practice. It covers: - Major speculative decoding techniques - Major quantization methods - Distributed inference: DP / PP / TP - Expert Parallelism and mixed parallel setups - Practical optimization techniques like prefix caching, KV cache, and disaggregated prefill/decode My goal was to explain how these techniques work, where they help, so it is easier to choose the right approach for a given workload. This series is useful not only for people getting into LLM serving, but also for engineers who are already serving LLMs and want to optimize inference, improve throughput, reduce latency, or evaluate the right serving strategy.

English

7

15

188

11.1K

Jaydev@JaydevTonde·5d

@zkmmon @vllm_project I recommend using a GPU machine rather than a Mac. You can get the cheapest GPUs at @jarvislabsai with the required env setup.

English

0

1

130

kmmon@zkmmon·5d

@JaydevTonde @vllm_project When reading I want to play with it? Do you think local Mac will do the job here or should I spin up a gpu machine in cloud?

English

1

0

1

160

Jaydev@JaydevTonde·5d

Also I learned a lot from blogs of @gordic_aleksa , @rasbt , @vllm_project and inference books from @elliotarledge @philipkiely, @RedHat_AI Office hours.

English

0

7

600

Jaydev@JaydevTonde·5d

Links : - Speculative Decoding in vLLM: Complete Guide to Faster LLM Inference : docs.jarvislabs.ai/blog/speculati… - The Complete Guide to LLM Quantization with vLLM: Benchmarks & Best Practices : docs.jarvislabs.ai/blog/vllm-quan… - Scaling LLM Inference: Data, Pipeline & Tensor Parallelism in vLLM : docs.jarvislabs.ai/blog/scaling-l… - Expert Parallelism and Mixed Parallelism Strategies in vLLM : docs.jarvislabs.ai/blog/expert-pa… - vLLM Optimization Techniques: 5 Practical Methods to Improve Performance : docs.jarvislabs.ai/blog/vllm-opti…

English

2

5

24

979

Jaydev@JaydevTonde·6d

github.com/jaytonde/Tokn

ZXX

0

1

60

Jaydev@JaydevTonde·6d

LLM Decoding technique : Sampling using temperature and top_p implementation in PyTorch for Tokn(LLM Inference Server)

English

1

0

4

170

Jaydev รีทวีตแล้ว

Alvin Foo@alvinfoo·10 Nis

What determines success in life.

English

2

25

121

10K

Jaydev@JaydevTonde·11 Nis

My learning experience with LLMs like GPT-5.4 and Opus-4.6. Using LLMs, we can understand things better, but not always faster. If you just go to an LLM and prompt it like “teach me Speculative Decoding,” the model starts with explanations and examples, which are actually good. But if you ask subsequent questions, it starts repeating the same things again and again. In the end, you end up reading 10+ conversations each of one complete blog size. Instead, the best way to learn with AI is to take a blog or paper, start reading it, and if you don’t understand any paragraph or mathematical equation, put that into an LLM prompt and ask it to explain in simple language or generate a block diagram for understanding in Markdown. After this, just return to your blog or paper. Don’t keep iterating prompt by prompt on LLMs because they don’t have ends for generation and connecting things. To keep our learning time-bound, follow this approach and don’t keep prompting, as we will end up reading about some more unrelated things compared to our main goal.

English

0

2

68

Jaydev@JaydevTonde·11 Nis

Cool, we were waiting for this to include in our speculative decoding benchmarks. We have already covered draft models, n-gram, EAGLE, suffix decoding, and MLP speculators. In our upcoming experiments, we are planning to cover DFlash, PARD, and MTP. Here: docs.jarvislabs.ai/blog/speculati…

English

0

1

103

Zhijian Liu@zhijianliu_·8 Nis

DFlash just landed in both SGLang and vLLM! 🚀 More draft models dropping soon: GLM-5.1, Kimi-K2.5 (preview live now!), Qwen3.5-397B & 122B. Try it now ↓ SGLang: github.com/sgl-project/sg… (🙏 @_dcw02) vLLM: github.com/vllm-project/v… (🙏 @BenjaminCh44989)

Zhijian Liu@zhijianliu_

Holiday cooking finally ready to serve! 🥳 Introducing DFlash — speculative decoding with block diffusion. 🚀 6.2× lossless speedup on Qwen3-8B ⚡ 2.5× faster than EAGLE-3 Diffusion vs AR doesn’t have to be a fight. At today’s stage: • dLLMs = fast, highly parallel, but lossy • AR LLMs = accurate, sequential, but slow DFlash = diffusion drafts, AR verifies.

English

14

46

415

53.1K

Jaydev@JaydevTonde·10 Nis

Started developing my own LLM Inference server from scratch. Tokn : github.com/jaytonde/Tokn

English

0

3

60

Jaydev@JaydevTonde·9 Nis

Chunked prefill makes 2 improvements in LLM inference 1. Decode becomes non blocking as decode requests get processed in between chunks rather than waiting for whole prompt to get prefilled. 2. Chunk size like 512 token makes cuda compiled graphs reusable rather than compiling graph every time vector shape gets changed.

English

0

2

59

Jaydev@JaydevTonde·5 Nis

Scenarios when to use disaggregated prefill/decode. 1. Large volume traffic, 100 millions - 1 billion tokens per day. 2. Larger model, ~ 100 billions parameters. 3. Traffic is prefill heavy with long input sequences Mostly good for frontier LLMs in code editor.

English

0

3

50

Jaydev@JaydevTonde·5 Nis

Series : Rubin Coming this year Series : Feynman Coming in 2028

English

0

1

25

Jaydev@JaydevTonde·5 Nis

Series : Blackwell Name :B200 Compute : 5 petaFLOPS Memory : 192 GB Bandwidth : 8 TB/s Name :B300 Compute : 5 petaFLOPS Memory : 284 GB Bandwidth : 8 TB/s

English

1

0

1

36

Jaydev@JaydevTonde·5 Nis

@nvidia GPUs series launched till now and specification to know for LLM Inference and training.

English

1

0

3

49

Jaydev

ค้นพบ