Luka Ribar

10 posts

Luka Ribar

Luka Ribar

@luka_ribar

Research Scientist at Graphcore. Previously PhD & MEng at the University of Cambridge.

London Katılım Mart 2022
111 Takip Edilen26 Takipçiler
Luka Ribar retweetledi
Graphcore Research
Graphcore Research@GCResearchTeam·
We've written an interactive deep dive on Llama 3.2 Vision, alongside a full plain-PyTorch implementation (link in 🧵) Here's an attention head from the vision encoder in action - the implicit segmentation is quite impressive!
GIF
English
3
24
130
9.6K
Luka Ribar
Luka Ribar@luka_ribar·
Happy to share our recent work on speeding up long-context LLM generation in llama.cpp! ✨ If you’re interested in inference and implementing efficient attention in C++, check it out here: graphcore-research.github.io/posts/llama-cp…
English
0
3
4
597
Luka Ribar retweetledi
Graphcore Research
Graphcore Research@GCResearchTeam·
Our April Papers of the Month is now live 🧐 This month the common thread is efficient LLM inference. Our favourite papers cover speculative-decoding + sparse KV (TriForce), 4-bit quantisation (QuaRot) & dynamic compute allocation (Mixture-of-Depths). 🧵 graphcore-research.github.io/papers-of-the-…
English
1
5
8
983
Luka Ribar retweetledi
Graphcore Research
Graphcore Research@GCResearchTeam·
Our latest edition of *Papers of the Month* is now available 📚 These are summaries of our team's favourite papers from March, including a new low-rank training procedure GaLore, and the supposed "Era of 1-bit LLMs" (really 1.58 bits) Mini-version in 🧵 graphcore-research.github.io/papers-of-the-…
English
1
9
14
1.8K
Luka Ribar retweetledi
Charlie Blake
Charlie Blake@thecharlieblake·
Proud to have played a small part in this paper - a new method for improving token/sec of transformer inference The trick: sparse KV-cache access based on current query. Uses two-step top-k to find best KVs, without ever loading the full cache (8x mem reduction) Go try it out😎
AK@_akhaliq

SparQ Attention: Bandwidth-Efficient LLM Inference paper page: huggingface.co/papers/2312.04… Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy by evaluating Llama 2 and Pythia models on a wide range of downstream tasks.

English
1
2
13
1K
Luka Ribar retweetledi
Ivan Chelombiev
Ivan Chelombiev@savelichic·
1/n While everybody’s been busy packing for #NeurIPS2023, our team at @graphcoreai has been busy with this beauty. Let me introduce: ✨SparQ Attention✨ TL;DR This is a plug-and-play inference Attention block for pre-trained LLMs, which evaporates the KV cache bandwidth 🧵
English
1
10
11
3.1K