James Dborin

7 posts

James Dborin

@JDborin

Co-founder of Doubleword. We solve hard LLM inference problems.

Katılım Ocak 2021

469 Takip Edilen43 Takipçiler

James Dborin@JDborin·2 Haz

@MeryemArik9 @sidfix yi and zephyr for y and z if i have understood the game?

English

Meryem Arik@MeryemArik9·2 Haz

@sidfix How do you play this game

English

Sid Jayakumar@sidfix·2 Haz

A very silly tweet follows. I was bored, on a plane, without wifi. And taking a break from the other entertainment. And realised I could probably do Foundation Model bingo. A game I invented 2 hours ago. I had to get creative for a few of them…

English

516

James Dborin@JDborin·9 Kas

@JeffreyUrban_ @MLOpsWorld Just saw this paper pop up: twitter.com/_akhaliq/statu… This is the sort of thing that would power these networks of resource sharing models.

AK@_akhaliq

S-LoRA: Serving Thousands of Concurrent LoRA Adapters paper page: huggingface.co/papers/2311.03… The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services.

English

James Dborin@JDborin·22 May

@pommedeterre33 I think you are right - are you trying to quantize the kv cache to get longer sequence lengths without OOM?

English

133

Michaël Benesty@pommedeterre33·22 May

I see a bunch of GPTQ implementations (LLAMA, etc.) with q4 quantization. Am I right to think it just quantizes weights, and KV cache is in FP16? (I have not yet read GPTQ paper)

English

1.6K

James Dborin@JDborin·3 Nis

@pommedeterre33 I think I understand, really cool idea! Loving the larger kernl project as well.

English

Michaël Benesty@pommedeterre33·3 Nis

@JDborin We really follow the program and execute it several times, following the grid, etc. If we were transforming too much the exec, the interpreter could not be used for debugging.

English

Michaël Benesty@pommedeterre33·3 Nis

🚀@OpenAI's Triton is now #PyTorch 2.0 for epic perf boosts! It balances low-level GPU control & auto-optimization, helping us crushing some benchmarks with custom kernels. But manipulating GPU memory addresses? Sometimes tricky 😅 Our very useful hack: Triton interpreter

English

22.6K

James Dborin@JDborin·3 Nis

@pommedeterre33 so you are splitting the computation into blocks indexed by the program id, doing the pytorch ops, and then combining them again at the end, using something like torch.cat?

English

Michaël Benesty@pommedeterre33·3 Nis

@JDborin Basically every ops is replaced by some PyTorch equivalent and we simulate the address management.

English

James Dborin@JDborin·3 Nis

@pommedeterre33 is the idea that triton ops like tl.arange are replaced with pytorch equivalents?

English

Michaël Benesty@pommedeterre33·3 Nis

The idea is simple - add an annotation on top of your kernel, and your code goes to PyTorch instead of the GPU compiler. Use Python debugger, print info, check addresses, and find errors with ease! 🎯

English

1.4K

James Dborin retweetledi

Conception X@conceptionxtech·14 Ara

We're on the @PaCCSResearch @UKRI_News blog today with a story about another Cohort 1 team that's making waves – @AstroscreenHQ by @Tehranix @rahkoAI @oxiapalus @JDborin also mentioned More about how to apply for Cohort V in there👇 paccsresearch.org.uk/blog/venture-s…

English

Keşfet

@MeryemArik9 @sidfix @MLOpsWorld @pommedeterre33 @OpenAI @PaCCSResearch @UKRI_News @AstroscreenHQ