Luka Ribar

10 posts

Luka Ribar

@luka_ribar

Research Scientist at Graphcore. Previously PhD & MEng at the University of Cambridge.

London Katılım Mart 2022

111 Takip Edilen26 Takipçiler

Luka Ribar retweetledi

Graphcore Research@GCResearchTeam·2 Oca

We've written an interactive deep dive on Llama 3.2 Vision, alongside a full plain-PyTorch implementation (link in 🧵) Here's an attention head from the vision encoder in action - the implicit segmentation is quite impressive!

GIF

English

130

9.6K

Luka Ribar@luka_ribar·18 Eyl

Happy to share our recent work on speeding up long-context LLM generation in llama.cpp! ✨ If you’re interested in inference and implementing efficient attention in C++, check it out here: graphcore-research.github.io/posts/llama-cp…

English

597

Luka Ribar@luka_ribar·23 Tem

You can also read the full paper on arXiv: arxiv.org/abs/2312.04985

English

Luka Ribar@luka_ribar·23 Tem

Excited to present our SparQ Attention paper tomorrow at @icmlconf ! If you're not around to chat to us in person, check out the recent blog graphcore-research.github.io/graphcore-rese… written by Luke explaining our method for speeding up long-sequence transformer inference!

English

141

Luka Ribar@luka_ribar·2 May

Very excited to talk about the work we've been doing on efficient transformer inference at ICLR & ICML!

Graphcore Research@GCResearchTeam

Thrilled to announce that our SparQ paper has been accepted to #ICML2024! ✨🎉 For those who can't wait, we'll also be at the ME-FoMo & PML4LRS workshops next week at @iclr_conf in Vienna. Keen to chat with anyone interested in efficient attention. twitter.com/savelichic/sta…

English

298

Luka Ribar retweetledi

Graphcore Research@GCResearchTeam·1 May

Our April Papers of the Month is now live 🧐 This month the common thread is efficient LLM inference. Our favourite papers cover speculative-decoding + sparse KV (TriForce), 4-bit quantisation (QuaRot) & dynamic compute allocation (Mixture-of-Depths). 🧵 graphcore-research.github.io/papers-of-the-…

English

983

Luka Ribar retweetledi

Graphcore Research@GCResearchTeam·5 Nis

Our latest edition of *Papers of the Month* is now available 📚 These are summaries of our team's favourite papers from March, including a new low-rank training procedure GaLore, and the supposed "Era of 1-bit LLMs" (really 1.58 bits) Mini-version in 🧵 graphcore-research.github.io/papers-of-the-…

English

1.8K

Luka Ribar retweetledi

Graphcore Research@GCResearchTeam·22 Mar

At 2pm today Graphcore researchers @luka_ribar & @savelichic will be presenting at @letsunifyai's popular reading group. We'll be covering our recent SparQ paper - a method for increasing LLM inference throughput by sparsifying attention. Live stream: youtube.com/watch?v=xq_8dg…

YouTube

English

1.6K

Luka Ribar retweetledi

Charlie Blake@thecharlieblake·11 Ara

Proud to have played a small part in this paper - a new method for improving token/sec of transformer inference The trick: sparse KV-cache access based on current query. Uses two-step top-k to find best KVs, without ever loading the full cache (8x mem reduction) Go try it out😎

AK@_akhaliq

SparQ Attention: Bandwidth-Efficient LLM Inference paper page: huggingface.co/papers/2312.04… Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy by evaluating Llama 2 and Pythia models on a wide range of downstream tasks.

English

Luka Ribar retweetledi

Ivan Chelombiev@savelichic·11 Ara

1/n While everybody’s been busy packing for #NeurIPS2023, our team at @graphcoreai has been busy with this beauty. Let me introduce: ✨SparQ Attention✨ TL;DR This is a plug-and-play inference Attention block for pre-trained LLMs, which evaporates the KV cache bandwidth 🧵

English

3.1K

Keşfet

@icmlconf @savelichic @letsunifyai @graphcoreai @elonmusk @BarackObama @taylorswift13 @cristiano