Aleksandar Botev

26 posts

Aleksandar Botev

@botev_mg

Research scientist at Google DeepMind.

Katılım Şubat 2024

12 Takip Edilen226 Takipçiler

Sabitlenmiş Tweet

Aleksandar Botev@botev_mg·1 Mar

We present Griffin: A hybrid model mixing a gated linear recurrence with local attention. This combination is extremely effective: it preserves all the efficient benefits of linear RNNs and the expressiveness of transformers. Scaled up to 14B! arxiv.org/abs/2402.19427

English

147

46.4K

Aleksandar Botev retweetledi

Samuel L Smith@SamuelMLSmith·4 Şub

A big win born in @OpenAI London! 🇬🇧

OpenAI Developers@OpenAIDevs

GPT-5.2 and GPT-5.2-Codex are now 40% faster. We have optimized our inference stack for all API customers. Same model. Same weights. Lower latency.

English

407

63.5K

Aleksandar Botev@botev_mg·14 Eki

If anyone is interested on working in an exciting team at the frontier of LLM research in London, please reach out to me or Sam.

Samuel L Smith@SamuelMLSmith

The Training team @OpenAI is hiring researchers in London 🚀 Our twin missions are to train better LLMs, and serve them more cheaply Get in touch if you are excited to collaborate on architecture design, reliable scaling, and faster optimization

English

2.6K

Aleksandar Botev retweetledi

Sophia@sopharicks·31 Tem

It was a pleasure to host the talk with @botev_mg about the Griffin architecture (an alternative to the Transformer) and recall our internship days at OpenAI. Griffin handles long sequences well and is more efficient during inference. In some use cases, it can replace Transformers. Curious if the industry will adopt the hybrid model (Transformers + alternatives) over the years. Watch the lecture about Griffin on our YouTube channel: youtu.be/0Yi3yUjB-3M?si… #TechTalk #techtalks #MachineLearning #ArtificialInteligence #largelanguagemodels #LLMs

YouTube

English

367

Aleksandar Botev retweetledi

Sophia@sopharicks·26 Haz

Excited about the upcoming talks I'm hosting in the next couple of weeks. With @botev_mg, we'll be exploring Griffin, a novel architecture and an alternative to Transformers. And @aahmadian_ from @cohere @CohereForAI will share about a new optimization method for RLHF. The details and registration are in the BuzzRobot newsletter buzzrobot.substack.com/p/google-deepm…

English

346

Aleksandar Botev@botev_mg·13 Haz

Our 9B Griffin model is finally open sourced. Similar performance to the base Gemma model, but much faster! Throughput is through the roof 🤯 Available on Kaggle, Huggingface and github!

Samuel L Smith@SamuelMLSmith

RecurrentGemma-9B is out! kaggle.com/models/google/… huggingface.co/google/recurre… - Uses Griffin architecture, combining linear recurrence with local attention - Downstream evals comparable to Mistral and Gemma - Faster inference, especially for long sequences or large batch sizes 1/n

English

264

Aleksandar Botev@botev_mg·12 Haz

@JagersbergKnut @burkov I think this depends a lot whether you are looking at latency or throughput, since RG uses a lot less memory, hence can fit larger batch size, which shows up only in throughput.

English

Knut Jägersberg@JagersbergKnut·12 Haz

@botev_mg @burkov yeah this is a sickness. also I see inference is not really that much faster, except in some scenarios.

English

Knut Jägersberg@JagersbergKnut·11 Haz

google/recurrentgemma-9b-it not entirely sure, but I think this one is new huggingface.co/google/recurre…

English

766

Aleksandar Botev@botev_mg·12 Haz

@JagersbergKnut @burkov Actually both models have roughly 7B non-embedding parameters and around 1.5B embedding parameter, totalling 8.58B each. The only discrepancy, with respect to parameters, is in the naming of the two models.

English

Knut Jägersberg@JagersbergKnut·12 Haz

@burkov Looking at the numbers of gemma, I'd say it looks like a rough match, though not really, since recurrentgemma needs to use more parameters. however, inference seems to be way way faster.

English

Aleksandar Botev retweetledi

Jeethu Rao@jeethu·9 Nis

Looks like Google has just silently released a 2B recurrent linear attention based model (non-transformer based, aka the Griffin architecture). This is a bigger deal than CodeGemma, IMO. AFAIK, the closest thing to this is RWKV. huggingface.co/google/recurre… arxiv.org/abs/2402.19427

English

481

63.5K

Aleksandar Botev retweetledi

Mihir Kale@maninblack815·9 Nis

Happy to share - blah blah blah. Gemma + Griffin = RecurrentGemma Competitive quality with Gemma-2B and much better throughput, especially for long sequences. Cracked model from cracked team! Check it out below 👇

Soham De@sohamde_

Releasing RecurrentGemma - one of the strongest 2B-param open models designed for fast inference on long sequences and massive throughput! Both pre-trained and IT checkpoints available + code - try them out here! Code: github.com/google-deepmin… Weights: kaggle.com/models/google/…

English

20.5K

Aleksandar Botev retweetledi

Jyrki Alakuijala 🇺🇦@jyzg·10 Nis

Our usually compression-centric team helped with the C++ implementation. Gemma runs on the Highway library originally built for HighwayHash and developed further and opensourced in the JPEG XL effort.

Samuel L Smith@SamuelMLSmith

Announcing RecurrentGemma! github.com/google-deepmin… - A 2B model with open weights based on Griffin - Replaces transformer with mix of gated linear recurrences and local attention - Competitive with Gemma-2B on downstream evals - Higher throughput when sampling long sequences

English

1.1K

Aleksandar Botev retweetledi

Nando de Freitas@NandoDF·9 Nis

I’m very proud of our team for open sourcing RecurrentGemma. Yes, recurrence is back and it results in huge gains at inference time. Just look at the impressive throughput plot below. For details, please see the paper and GitHub page: Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models by Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, Caglar Gulcehre arxiv.org/pdf/2402.19427… github.com/google-deepmin…

English

153

16K

Aleksandar Botev@botev_mg·9 Nis

Huge congrats to the whole team: @SamuelMLSmith @AnushanFer61200 @GeorgeMuraru @haroun_ruba @LeonardBerrada @sohamde_ + many others

English

159

Aleksandar Botev@botev_mg·9 Nis

We are releasing an optimized JAX implementation with a custom Pallas kernel for TPUs, PyTorch reference implementation and C++ code for the laptop enthusiasts! Also on Huggingface - huggingface.co/google/recurre… and huggingface.co/google/recurre…

English

203

Aleksandar Botev@botev_mg·9 Nis

Following our previous work, we are releasing RecurrentGemma - a fully open source 2B model based on our Griffin architecutre! Code + weights as everyone has wished for! Code on Github: github.com/google-deepmin… Weights on Kaggle: kaggle.com/models/google/…

English

14.5K

Aleksandar Botev retweetledi

Soham De@sohamde_·14 Mar

Just got back from vacation, and super excited to finally release Griffin - a new hybrid LLM mixing RNN layers with Local Attention - scaled up to 14B params! arxiv.org/abs/2402.19427 My co-authors have already posted about our amazing results, so here's a 🧵on how we got there!

English

305

48.5K

Aleksandar Botev retweetledi

Lucas Beyer (bl16)@giffmana·4 Mar

It's not just LLM. We had essentially final SigLIP models for many months before the paper. We had essentially final PaLI-3 models for something more than half a year before the paper. It's not always like this, but if a paper "feels late" it's probably just bigco delays.

Caglar Gulcehre@caglarml

From the community's reaction to the Griffin paper, most people are unaware of how long it takes to publish an LLM paper at Google. We already had most of the results in the Griffin paper, including the final model, most of the writeup before I left in September.

English

18.5K

Aleksandar Botev@botev_mg·1 Mar

@srush_nlp @SamuelMLSmith So Pallas sits on top of Triton and Mosaic, which are the GPU and TPU backends respectively. When we implemented the custom linear scan it doesn't go through Triton at all. Having said, indeed it just uses the `lax.control_flow.for_loop` primitive which works with references.

English

138

Sasha Rush@srush_nlp·1 Mar

@SamuelMLSmith Actually curious how you implement linear scan in pallas? Is it just a triton for loop or is there a custom scan primitive?

English

Sasha Rush@srush_nlp·1 Mar

New Griffin paper is really interesting and contains a lot of implementation details arxiv.org/abs/2402.19427 . Implementation is in Pallas which is a Jax like frontend to Triton/TPU lowering. They show that Associative Scan is inherently worse than Linear Scan in this context. (not sure if this is TPU specific.)

English

279

35K

Aleksandar Botev retweetledi

Samuel L Smith@SamuelMLSmith·1 Mar

@caglarml @sohamde_ @botev_mg @GeorgeMuraru Working with you @caglarml was definitely one of my highlights! I hope we keep collaborating in the future🤞

English

287

Aleksandar Botev@botev_mg·1 Mar

Thanks to @sohamde_, @SamuelMLSmith , Anushan Fernando, George Cristian-Muraru, @_albertgu, Ruba Haroun, Leonard Berrada, @yutianc, Srivatsan Srinivasan, Guillaume Desjardins, @ArnaudDoucet1, @davidmbudden, @yeewhye, Razvan Pascanu, @NandoDF, @caglarml!

Indonesia

703

Aleksandar Botev@botev_mg·1 Mar

In order to make all these models efficient we had to undergo significant engineering efforts in both careful model design, considering how we shard our models, as well as a custom Pallas kernel for the RNN scan. This has all been achieved by the work of our whole team.

English

705

Aleksandar Botev@botev_mg·1 Mar

English

147

46.4K

Keşfet

@OpenAI @aahmadian_ @cohere @CohereForAI @JagersbergKnut @burkov @SamuelMLSmith @AnushanFer61200