Jaydev

397 posts

Jaydev banner
Jaydev

Jaydev

@JaydevTonde

Senior Data Scientist at Wolters Kluwer , LLM Inference, Kaggle Competitions Expert, Author at @jarvislabsai

Pune, Maharashtra India Se unió Eylül 2023
842 Siguiendo219 Seguidores
Tweet fijado
Jaydev
Jaydev@JaydevTonde·
I have been writing a small series on LLM inference with @vllm_project that can be a practical starting point for people trying to understand this space. Along with the explanations, I also ran benchmarks on realistic workloads across different GPUs and datasets to evaluate how these techniques perform in practice. It covers: - Major speculative decoding techniques - Major quantization methods - Distributed inference: DP / PP / TP - Expert Parallelism and mixed parallel setups - Practical optimization techniques like prefix caching, KV cache, and disaggregated prefill/decode My goal was to explain how these techniques work, where they help, so it is easier to choose the right approach for a given workload. This series is useful not only for people getting into LLM serving, but also for engineers who are already serving LLMs and want to optimize inference, improve throughput, reduce latency, or evaluate the right serving strategy.
English
7
15
188
11.1K
Jaydev
Jaydev@JaydevTonde·
I’m expanding my focus in LLM inference to explore how different optimization techniques perform across our favorite serving frameworks: vLLM, SGLang, and NVIDIA TensorRT-LLM.
Jaydev tweet media
English
0
0
5
123
Jaydev
Jaydev@JaydevTonde·
@philipkiely Great, I have read the e-book already and am planning to buy the hard copy again.
English
0
0
3
521
Jaydev
Jaydev@JaydevTonde·
I have been writing a small series on LLM inference with @vllm_project that can be a practical starting point for people trying to understand this space. Along with the explanations, I also ran benchmarks on realistic workloads across different GPUs and datasets to evaluate how these techniques perform in practice. It covers: - Major speculative decoding techniques - Major quantization methods - Distributed inference: DP / PP / TP - Expert Parallelism and mixed parallel setups - Practical optimization techniques like prefix caching, KV cache, and disaggregated prefill/decode My goal was to explain how these techniques work, where they help, so it is easier to choose the right approach for a given workload. This series is useful not only for people getting into LLM serving, but also for engineers who are already serving LLMs and want to optimize inference, improve throughput, reduce latency, or evaluate the right serving strategy.
English
7
15
188
11.1K
kmmon
kmmon@zkmmon·
@JaydevTonde @vllm_project When reading I want to play with it? Do you think local Mac will do the job here or should I spin up a gpu machine in cloud?
English
1
0
1
160
Jaydev
Jaydev@JaydevTonde·
Links : - Speculative Decoding in vLLM: Complete Guide to Faster LLM Inference : docs.jarvislabs.ai/blog/speculati… - The Complete Guide to LLM Quantization with vLLM: Benchmarks & Best Practices : docs.jarvislabs.ai/blog/vllm-quan… - Scaling LLM Inference: Data, Pipeline & Tensor Parallelism in vLLM : docs.jarvislabs.ai/blog/scaling-l… - Expert Parallelism and Mixed Parallelism Strategies in vLLM : docs.jarvislabs.ai/blog/expert-pa… - vLLM Optimization Techniques: 5 Practical Methods to Improve Performance : docs.jarvislabs.ai/blog/vllm-opti…
English
2
5
24
980
Jaydev
Jaydev@JaydevTonde·
LLM Decoding technique : Sampling using temperature and top_p implementation in PyTorch for Tokn(LLM Inference Server)
Jaydev tweet media
English
1
0
4
170
Jaydev retuiteado
Alvin Foo
Alvin Foo@alvinfoo·
What determines success in life.
English
2
25
121
10K
Jaydev
Jaydev@JaydevTonde·
My learning experience with LLMs like GPT-5.4 and Opus-4.6. Using LLMs, we can understand things better, but not always faster. If you just go to an LLM and prompt it like “teach me Speculative Decoding,” the model starts with explanations and examples, which are actually good. But if you ask subsequent questions, it starts repeating the same things again and again. In the end, you end up reading 10+ conversations each of one complete blog size. Instead, the best way to learn with AI is to take a blog or paper, start reading it, and if you don’t understand any paragraph or mathematical equation, put that into an LLM prompt and ask it to explain in simple language or generate a block diagram for understanding in Markdown. After this, just return to your blog or paper. Don’t keep iterating prompt by prompt on LLMs because they don’t have ends for generation and connecting things. To keep our learning time-bound, follow this approach and don’t keep prompting, as we will end up reading about some more unrelated things compared to our main goal.
English
0
0
2
68
Jaydev
Jaydev@JaydevTonde·
Cool, we were waiting for this to include in our speculative decoding benchmarks. We have already covered draft models, n-gram, EAGLE, suffix decoding, and MLP speculators. In our upcoming experiments, we are planning to cover DFlash, PARD, and MTP. Here: docs.jarvislabs.ai/blog/speculati…
English
0
0
1
105
Jaydev
Jaydev@JaydevTonde·
Chunked prefill makes 2 improvements in LLM inference 1. Decode becomes non blocking as decode requests get processed in between chunks rather than waiting for whole prompt to get prefilled. 2. Chunk size like 512 token makes cuda compiled graphs reusable rather than compiling graph every time vector shape gets changed.
English
0
0
2
59
Jaydev
Jaydev@JaydevTonde·
Scenarios when to use disaggregated prefill/decode. 1. Large volume traffic, 100 millions - 1 billion tokens per day. 2. Larger model, ~ 100 billions parameters. 3. Traffic is prefill heavy with long input sequences Mostly good for frontier LLMs in code editor.
English
0
0
3
50
Jaydev
Jaydev@JaydevTonde·
Series : Rubin Coming this year Series : Feynman Coming in 2028
English
0
0
1
25
Jaydev
Jaydev@JaydevTonde·
Series : Blackwell Name :B200 Compute : 5 petaFLOPS Memory : 192 GB Bandwidth : 8 TB/s Name :B300 Compute : 5 petaFLOPS Memory : 284 GB Bandwidth : 8 TB/s
English
1
0
1
36
Jaydev
Jaydev@JaydevTonde·
@nvidia GPUs series launched till now and specification to know for LLM Inference and training.
Jaydev tweet media
English
1
0
3
49