Filip Pavetić

32 posts

Filip Pavetić

Filip Pavetić

@FPavetic

Katılım Nisan 2022
136 Takip Edilen51 Takipçiler
Filip Pavetić retweetledi
Antoine Yang
Antoine Yang@AntoineYang2·
Thrilled to share our latest advances in video understanding 📽️: Gemini 2.5 Pro is a truly magical model to play with, excelling in traditional video analysis and unlocking new use cases I could not imagine a few months ago🪄 More in 🧵 and @Google blog: developers.googleblog.com/en/gemini-2-5-…
English
11
50
373
125.3K
Filip Pavetić retweetledi
Antoine Yang
Antoine Yang@AntoineYang2·
Gemini 2.0 Flash's video understanding is here 🚀 Think: search in videos via timecodes, extract text from moving camera footage, analyze screen recordings in real-time interactions with native audio out 🔊 Come and try it aistudio.google.com 😀 youtu.be/Mot-JEU26GQ?si…
YouTube video
YouTube
English
2
10
83
8.6K
Filip Pavetić retweetledi
Filip Pavetić retweetledi
Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن
Attending #NeurIPS2024? If you're interested in multimodal systems, building inclusive & culturally aware models, and how fractals relate to LLMs, we've 3 posters for you. I look forward to presenting them on behalf of our GDM team @ Zurich & collaborators. Details below (1/4)
English
2
7
23
2.3K
Filip Pavetić retweetledi
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
🧶PaLI-3 achieves SOTA across many vision-language (and video!) tasks while being 10x smaller than its predecessor PaLI-X. At only 5B parameters, it's also smaller (and stronger) than the concurrent Fuyu-8B model, though sadly we cannot release the model (props to @AdeptAILabs)
Lucas Beyer (bl16) tweet media
English
8
59
430
122.8K
Filip Pavetić retweetledi
Carlos Riquelme
Carlos Riquelme@rikelhood·
Sparsity is one of the most promising areas in deep learning (tokens follow different routes in the model). However, these discrete decisions are messy to handle & optimize. Today we introduce Soft-MoE. The idea is simple: Don't route tokens, route linear combinations of them.
Carlos Riquelme tweet media
English
4
47
331
47.9K
Filip Pavetić retweetledi
Joan Puigcerver
Joan Puigcerver@joapuipe·
Introducing Soft MoE! Sparse MoEs are a popular method for increasing the model size without increasing its cost, but they come with several issues. Soft MoEs avoid them and significantly outperform ViT and different Sparse MoEs on image classification. arxiv.org/abs/2308.00951
English
5
61
244
78.9K
Filip Pavetić retweetledi
Piotr Padlewski
Piotr Padlewski@PiotrPadlewski·
I will be at ICML presenting ViT-22B! Feel free to grab me if you want to chat about it.
Piotr Padlewski tweet media
English
3
12
61
6K
Filip Pavetić retweetledi
Mostafa Dehghani
Mostafa Dehghani@m__dehghani·
NaViT (arxiv.org/abs/2307.06304) sets us free from square boxes and lets us think outside the box! Let creativity flow and go for the natural designs we've always wanted in ViTs. I share a few cool ideas that are made possible with NaViT: twitter.com/m__dehghani/st…
Mostafa Dehghani@m__dehghani

What do you think are the primary limitations or design choices that feel unnatural when it comes to using Transformers for computer vision (images, videos, ...)?

English
1
18
104
28.9K
Filip Pavetić retweetledi
Neil Houlsby
Neil Houlsby@neilhoulsby·
At CVPR? Three papers from the Google Deepmind (formerly Brain) Vision team in in Berlin/Zürich/Amsterdam (+collaborators) there. If interested in the work or the team, track down the authors!
Neil Houlsby tweet mediaNeil Houlsby tweet mediaNeil Houlsby tweet media
English
3
34
143
21.9K
Filip Pavetić retweetledi
Piotr Padlewski
Piotr Padlewski@PiotrPadlewski·
Quick summary of our recent work on scaling Vision Transformers - solving stability issues, making training more efficient and cool results: ai.googleblog.com/2023/03/scalin…
English
1
8
31
5.9K
Filip Pavetić retweetledi
Google AI
Google AI@GoogleAI·
Learn about ViT-22B, the result of our latest work on scaling vision transformers to create the largest dense vision model. With improvements to both the stability and efficiency of training, ViT-22B advances the state of the art on many vision tasks → ai.googleblog.com/2023/03/scalin…
GIF
English
33
126
526
188.5K
Filip Pavetić retweetledi
Basil Mustafa
Basil Mustafa@_basilM·
2️⃣2️⃣🅱️: We trained a 22B parameter ViT model, and scale continues to prove its merit! I want to zero in on an aspect of this which is useful however at all scales: a method for improving training stability in transformers. arxiv.org/abs/2302.05442
English
4
22
173
39.5K
Filip Pavetić retweetledi
Andreas Steiner
Andreas Steiner@AndreasPSteiner·
Scaling Vision Transformers to 22 billion parameters continues to improve ImageNet and OOD classification. And while ImageNet top1-accuracy seems to saturate short of 91% after fine-tuning, ObjectNet accuracy continues to increase, resulting in better effective robustness.
Andreas Steiner tweet media
Mostafa Dehghani@m__dehghani

This was a collaboration with an amazing group of people including @PiotrPadlewski, @_basilM, @m__dehghani, @JonathanHeek, @jmgilmer, @AndreasPSteiner, @MJLM3, @mcaron31, @ibomohsin, @RJenatton, @rikelhood, @mechcoder, @anuragarnab, @brainshawn, @giffmana, @mtschannen,...

English
2
5
52
9.7K
Filip Pavetić retweetledi
Mostafa Dehghani
Mostafa Dehghani@m__dehghani·
1/ There is a huge headroom for improving capabilities of our vision models and given the lessons we've learned from LLMs, scaling is a promising bet. We are introducing ViT-22B, the largest vision backbone reported to date: arxiv.org/abs/2302.05442
Mostafa Dehghani tweet media
English
12
125
784
311.3K
Filip Pavetić retweetledi
Basil Mustafa
Basil Mustafa@_basilM·
Beep beep! Introducing LIMoE, the Language Image Mixture of Experts: a single model, processing both modalities for contrastive image-text modelling. Cruises straight to 84.1% 0shot ImageNet accuracy without any modality-specific architectures or pre-training. (1/10)
Basil Mustafa tweet media
English
3
38
193
0
Filip Pavetić retweetledi
Google AI
Google AI@GoogleAI·
Stop by the Google booth at #ECCV2022 at 3:30 pm today to see a demo presented by Austin Stone, @MJLM3 and @agritsenko about OWL-ViT, a simple and scalable approach for open-vocabulary object detection and image-conditioned detection. Try it yourself at bit.ly/owl-vit-demo.
GIF
English
4
55
209
0