Sabitlenmiş Tweet
Sagnik
1.8K posts


@vllm_project engineers are crazy ,Most of the components are art
I was going through the memory of gpu part Like how they decide 12 gb vram gpu with 95% utilization -> 11.4 gb in this how the whole model weight, all the kv cache , cuda graphs and other runtime memory will load and run ?
Their technique -
> load the model weights
> run a dummy forward pass(fake inputs)
> also track other memory req (kernels)
> now we have --> weights + others(cuda/custom)
> rest give to the kv cache
> same for all models(text/mm)
> so proper gpu utilization
English

@jbhuang0604 What an algo I was just watching ur flash attention vid 😅
English

I'm building a GPT-2 inference engine from scratch in CUDA.
the project focuses on implementing and optimizing transformer inference kernels directly at the CUDA level, with emphasis on reduction strategies, memory behavior, numerical stability, and kernel-level performance optimization.
by the end, this engine will take a prompt, tokenize it, run a full GPT-2 forward pass using custom CUDA kernels, and autoregressively generate text, with KV cache for fast decoding. every operation from embedding lookup to sampling runs through kernels written and profiled from scratch.
things planned for the repo:
- attention kernels
- causal masking
- KV cache, and alot more.
things i've implemented so far:
- tiled matmul with shared memory
- online softmax
- 3 pass layernorm: benchmarked against Welford across all GPT-2 model sizes, using 3 pass as primary kernel for LN
benchmarks and profiling done on RTX 3050 laptop GPU.
repo:
github.com/Mog9/gpt2-infe…
English































