Aleksandar Botev

26 posts

Aleksandar Botev

Aleksandar Botev

@botev_mg

Research scientist at Google DeepMind.

Katılım Şubat 2024
12 Takip Edilen226 Takipçiler
Sabitlenmiş Tweet
Aleksandar Botev
Aleksandar Botev@botev_mg·
We present Griffin: A hybrid model mixing a gated linear recurrence with local attention. This combination is extremely effective: it preserves all the efficient benefits of linear RNNs and the expressiveness of transformers. Scaled up to 14B! arxiv.org/abs/2402.19427
Aleksandar Botev tweet media
English
2
36
147
46.4K
Aleksandar Botev
Aleksandar Botev@botev_mg·
If anyone is interested on working in an exciting team at the frontier of LLM research in London, please reach out to me or Sam.
Samuel L Smith@SamuelMLSmith

The Training team @OpenAI is hiring researchers in London 🚀 Our twin missions are to train better LLMs, and serve them more cheaply Get in touch if you are excited to collaborate on architecture design, reliable scaling, and faster optimization

English
3
2
15
2.6K
Aleksandar Botev retweetledi
Sophia
Sophia@sopharicks·
It was a pleasure to host the talk with @botev_mg about the Griffin architecture (an alternative to the Transformer) and recall our internship days at OpenAI. Griffin handles long sequences well and is more efficient during inference. In some use cases, it can replace Transformers. Curious if the industry will adopt the hybrid model (Transformers + alternatives) over the years. Watch the lecture about Griffin on our YouTube channel: youtu.be/0Yi3yUjB-3M?si… #TechTalk #techtalks #MachineLearning #ArtificialInteligence #largelanguagemodels #LLMs
YouTube video
YouTube
Sophia tweet media
English
0
1
2
367
Aleksandar Botev retweetledi
Sophia
Sophia@sopharicks·
Excited about the upcoming talks I'm hosting in the next couple of weeks. With @botev_mg, we'll be exploring Griffin, a novel architecture and an alternative to Transformers. And @aahmadian_ from @cohere @CohereForAI will share about a new optimization method for RLHF. The details and registration are in the BuzzRobot newsletter buzzrobot.substack.com/p/google-deepm…
English
0
1
4
346
Aleksandar Botev
Aleksandar Botev@botev_mg·
Our 9B Griffin model is finally open sourced. Similar performance to the base Gemma model, but much faster! Throughput is through the roof 🤯 Available on Kaggle, Huggingface and github!
Samuel L Smith@SamuelMLSmith

RecurrentGemma-9B is out! kaggle.com/models/google/… huggingface.co/google/recurre… - Uses Griffin architecture, combining linear recurrence with local attention - Downstream evals comparable to Mistral and Gemma - Faster inference, especially for long sequences or large batch sizes 1/n

English
0
0
6
264
Aleksandar Botev
Aleksandar Botev@botev_mg·
@JagersbergKnut @burkov I think this depends a lot whether you are looking at latency or throughput, since RG uses a lot less memory, hence can fit larger batch size, which shows up only in throughput.
English
0
0
1
8
Knut Jägersberg
Knut Jägersberg@JagersbergKnut·
@botev_mg @burkov yeah this is a sickness. also I see inference is not really that much faster, except in some scenarios.
English
1
0
1
19
Aleksandar Botev
Aleksandar Botev@botev_mg·
@JagersbergKnut @burkov Actually both models have roughly 7B non-embedding parameters and around 1.5B embedding parameter, totalling 8.58B each. The only discrepancy, with respect to parameters, is in the naming of the two models.
English
1
0
1
21
Knut Jägersberg
Knut Jägersberg@JagersbergKnut·
@burkov Looking at the numbers of gemma, I'd say it looks like a rough match, though not really, since recurrentgemma needs to use more parameters. however, inference seems to be way way faster.
Knut Jägersberg tweet media
English
1
0
1
50
Aleksandar Botev retweetledi
Jeethu Rao
Jeethu Rao@jeethu·
Looks like Google has just silently released a 2B recurrent linear attention based model (non-transformer based, aka the Griffin architecture). This is a bigger deal than CodeGemma, IMO. AFAIK, the closest thing to this is RWKV. huggingface.co/google/recurre… arxiv.org/abs/2402.19427
English
9
87
481
63.5K
Aleksandar Botev retweetledi
Mihir Kale
Mihir Kale@maninblack815·
Happy to share - blah blah blah. Gemma + Griffin = RecurrentGemma Competitive quality with Gemma-2B and much better throughput, especially for long sequences. Cracked model from cracked team! Check it out below 👇
Soham De@sohamde_

Releasing RecurrentGemma - one of the strongest 2B-param open models designed for fast inference on long sequences and massive throughput! Both pre-trained and IT checkpoints available + code - try them out here! Code: github.com/google-deepmin… Weights: kaggle.com/models/google/…

English
2
9
55
20.5K
Aleksandar Botev retweetledi
Jyrki Alakuijala 🇺🇦
Our usually compression-centric team helped with the C++ implementation. Gemma runs on the Highway library originally built for HighwayHash and developed further and opensourced in the JPEG XL effort.
Samuel L Smith@SamuelMLSmith

Announcing RecurrentGemma! github.com/google-deepmin… - A 2B model with open weights based on Griffin - Replaces transformer with mix of gated linear recurrences and local attention - Competitive with Gemma-2B on downstream evals - Higher throughput when sampling long sequences

English
0
2
7
1.1K
Aleksandar Botev retweetledi
Nando de Freitas
Nando de Freitas@NandoDF·
I’m very proud of our team for open sourcing RecurrentGemma. Yes, recurrence is back and it results in huge gains at inference time. Just look at the impressive throughput plot below. For details, please see the paper and GitHub page: Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models by Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, Caglar Gulcehre arxiv.org/pdf/2402.19427… github.com/google-deepmin…
Nando de Freitas tweet media
English
1
30
153
16K
Aleksandar Botev retweetledi
Soham De
Soham De@sohamde_·
Just got back from vacation, and super excited to finally release Griffin - a new hybrid LLM mixing RNN layers with Local Attention - scaled up to 14B params! arxiv.org/abs/2402.19427 My co-authors have already posted about our amazing results, so here's a 🧵on how we got there!
English
12
65
305
48.5K
Aleksandar Botev retweetledi
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
It's not just LLM. We had essentially final SigLIP models for many months before the paper. We had essentially final PaLI-3 models for something more than half a year before the paper. It's not always like this, but if a paper "feels late" it's probably just bigco delays.
Caglar Gulcehre@caglarml

From the community's reaction to the Griffin paper, most people are unaware of how long it takes to publish an LLM paper at Google. We already had most of the results in the Griffin paper, including the final model, most of the writeup before I left in September.

English
4
8
78
18.5K
Aleksandar Botev
Aleksandar Botev@botev_mg·
@srush_nlp @SamuelMLSmith So Pallas sits on top of Triton and Mosaic, which are the GPU and TPU backends respectively. When we implemented the custom linear scan it doesn't go through Triton at all. Having said, indeed it just uses the `lax.control_flow.for_loop` primitive which works with references.
English
0
0
4
138
Sasha Rush
Sasha Rush@srush_nlp·
@SamuelMLSmith Actually curious how you implement linear scan in pallas? Is it just a triton for loop or is there a custom scan primitive?
English
1
1
3
2K
Sasha Rush
Sasha Rush@srush_nlp·
New Griffin paper is really interesting and contains a lot of implementation details arxiv.org/abs/2402.19427 . Implementation is in Pallas which is a Jax like frontend to Triton/TPU lowering. They show that Associative Scan is inherently worse than Linear Scan in this context. (not sure if this is TPU specific.)
Sasha Rush tweet mediaSasha Rush tweet media
English
4
40
279
35K
Aleksandar Botev
Aleksandar Botev@botev_mg·
In order to make all these models efficient we had to undergo significant engineering efforts in both careful model design, considering how we shard our models, as well as a custom Pallas kernel for the RNN scan. This has all been achieved by the work of our whole team.
English
1
0
12
705
Aleksandar Botev
Aleksandar Botev@botev_mg·
We present Griffin: A hybrid model mixing a gated linear recurrence with local attention. This combination is extremely effective: it preserves all the efficient benefits of linear RNNs and the expressiveness of transformers. Scaled up to 14B! arxiv.org/abs/2402.19427
Aleksandar Botev tweet media
English
2
36
147
46.4K