Matthias Minderer

Michael Tschannen@mtschannen

12

39

6.1K

Matthias Minderer retweetledi

Alexander Kolesnikov@__kolesnikov__·2 Ara

I always dreamed of a model that simultaneously 1. optimizes NLL of raw pixel data, 2. generates competitive high-res. natural images, 3. is practical. But it seemed too good to be true. Until today! Our new JetFormer model (arxiv.org/abs/2411.19722) ticks on all of these. 🧵

Have you ever wondered how to train an autoregressive generative transformer on text and raw pixels, without a pretrained visual tokenizer (e.g. VQ-VAE)? We have been pondering this during summer and developed a new model: JetFormer 🌊🤖 arxiv.org/abs/2411.19722 A thread 👇 1/

English

5

41

289

68.1K

Matthias Minderer@MJLM3·4 Ara

@XiaohuaZhai @OpenAI @giffmana @__kolesnikov__ Congrats to all of you!

English

0

3

324

Xiaohua Zhai@XiaohuaZhai·4 Ara

Life update📢: After an amazing decade at Google/DeepMind, I’m thrilled to announce that I’ll be joining @OpenAI in a few weeks! I’m excited for the opportunity to co-build the OpenAI Zürich office alongside my close collaborators @giffmana and @__kolesnikov__ 🚀

English

89

36

1.4K

215K

Matthias Minderer@MJLM3·9 Şub

@unsorsodicorda @_akhaliq Different metrics. It's an increasing problem with LVIS in the literature. There's LVIS AP on the full val set, which generally produces the lowest numbers. That's what we report. Then there's minimal, and also "fixed" AP, both giving higher numbers. Some report those as "LVIS".

English

0

1

130

andrea panizza@unsorsodicorda·8 Şub

@MJLM3 @_akhaliq According to this table the mAP for OWLv2-L14 is 44.6<45, but in your plot it's close to 50 (and definitely > 45). Why? #pretrained-checkpoints" target="_blank" rel="nofollow noopener">github.com/google-researc…

English

0

189

AK@_akhaliq·2 Şub

Tencent releases YOLO-World Real-Time Open-Vocabulary Object Detection demo: huggingface.co/spaces/steveng… method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.

English

62

331

43.6K

Matthias Minderer@MJLM3·9 Şub

@unsorsodicorda @_akhaliq @Thom_Wolf Note that this HF benchmark includes the text tower. Image only (with precomputed text embeddings) would be significantly faster.

English

0

100

Matthias Minderer@MJLM3·9 Şub

@unsorsodicorda @_akhaliq @Thom_Wolf I haven't benchmarked the HF implementation myself, but this page reports 22.395ms / image on V100 for, I assume, OWLv1 B/32 at res 768, which is very roughly equivalent in compute to B/16 at res 400. So it is at least roughly in the same ballpark: huggingface.co/docs/transform…

English

0

1

113

Matthias Minderer retweetledi

Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن@ibomohsin·6 Şub

How is next-token prediction capable of such intelligent behavior? I’m very excited to share our work, where we study the fractal structure of language. TLDR: thinking of next-token prediction in language as “word statistics” is a big oversimplification! arxiv.org/abs/2402.01825

English

16

107

530

125.2K

Matthias Minderer@MJLM3·6 Şub

@ahatamiz1 @giffmana (1) No, so it would not have been trivial to reproduce, although the pretrained checkpoints I used and fine-tuning code is available. (2) Maybe the strong size augmentation+mosaics used during OWL-ViT training helps low-res performance? Need to investigate this further.

English

1

35

Ali Hatamizadeh@ahatamiz1·5 Şub

@giffmana i have two comments on this: (1) were OWL-ViT v2 series+ finetuned on O365 open-sourced ? wonder if the authors had access to reproduce these results (2) OWL-ViT v2 uses an isotropic ViT, what contributes to this amazing tradeoff ? usually multi-resolution extractors are faster

English

4

1

4

1.1K

Lucas Beyer (bl16)@giffmana·5 Şub

"Pretty neat" how omitting (intentionally or not) an existing method can completely change the key takeaway of a figure. Good thing it's marked as "work still in progress" in the arxiv comment, so this should be easy to fix. Interesting to already have a domain name then 🤔

Matthias Minderer@MJLM3

@_akhaliq I added OWL-ViT v2 to the plot. A single OWLv2 B/16 model, finetuned on O365+VG, covers all speed/accuracy combinations: Simply adjust the inference resolution to match your latency requirements. No re-training needed. arxiv.org/abs/2306.09683

English

5

7

65

33K

Matthias Minderer@MJLM3·6 Şub

@giffmana @ahatamiz1 The O365+VG-finetuned checkpoints are indeed not available (yet). Happy to work with the authors to make the results reproducible easily.

English

0

1

167

Lucas Beyer (bl16)@giffmana·6 Şub

@ahatamiz1 ohhh, actually I think you may have a point. I thought that the "FT" here was that fine-tuning, but looking at the paper, that was only LVIS correct @MJLM3 ? If this wasn't available I should tone down my criticism.

English

0

3

827

Matthias Minderer@MJLM3·5 Şub

@arankomatsuzaki I added OWL-ViT v2 to the plot. A single OWLv2 B/16, finetuned on O365+VG, covers all speed/accuracy combinations: Simply adjust the inference resolution to match your latency requirements. No re-training needed. arxiv.org/abs/2306.09683

English

1

9

531

Aran Komatsuzaki@arankomatsuzaki·31 Oca

YOLO-World: Real-Time Open-Vocabulary Object Detection Outperforms many SotA methods in terms of both accuracy and speed arxiv.org/abs/2401.17270

English

Excited to announce DORSal: a 3D structured diffusion model for generation and object-level editing of 3D scenes. DORSal is “geometry-free” and learns 3D scene structure purely from data – no expensive volume rendering! 🖥️ sjoerdvansteenkiste.com/dorsal/ 📜 arxiv.org/abs/2306.08068 1/6

44

238

33.5K

Matthias Minderer retweetledi

Niels Rogge@NielsRogge·13 Eki

Excited to share that @Google's OWLv2 model is now available in 🤗 Transformers! This model is one of the strongest zero-shot object detection models out there, improving upon OWL-ViT v1 which was released last year🔥 How? By self-training on web-scale data of over 1B examples⬇️

English

4

57

280

80K

Matthias Minderer retweetledi

Thomas Kipf@tkipf·2 Eki

I'll give a talk on object-centric models for video and 3D at the @ICCVConference Workshop on Large-scale Video Object Segmentation! Today @ 3:30pm (Room S02) Website: youtube-vos.org/challenge/2023/ I'll cover DORSal (see below) & recent work from our team on structured video models.

Thomas Kipf@tkipf

English

11

58

12.9K

Matthias Minderer@MJLM3·26 Eyl

We just open-sourced OWL-ViT v2, our improved open-vocabulary object detector that uses self-training to reach >40% zero-shot LVIS APr. Check out the paper, code, and pretrained checkpoints: arxiv.org/abs/2306.09683 github.com/google-researc…. With @agritsenko and @neilhoulsby.

English

25

90

14.4K

Matthias Minderer@MJLM3·21 Tem

@neilhoulsby Congrats Neil!

English

1

51

Neil Houlsby@neilhoulsby·18 Tem

🥈at Ironman Switzerland! Overwhelmed with the pace of AI development? An engaging hobby is a great way to stay enthusiastic. For me, it's endurance training. Bonus: long rides are a perfect time to ruminate on research ideas!

English

18

0

128

21.6K

Matthias Minderer@MJLM3·14 Tem

@rom1504 @m__dehghani @_basilM @neilhoulsby @JonathanHeek @mcaron31 @AndreasPSteiner @joapuipe @ibomohsin @PiotrPadlewski @MarioLucic_ You could also keep all pixels and add a little bit of padding

English

0

1

63

Romain Beaumont@rom1504·14 Tem

@m__dehghani @_basilM @neilhoulsby @JonathanHeek @mcaron31 @MJLM3 @AndreasPSteiner @joapuipe @ibomohsin @PiotrPadlewski @MarioLucic_ Amazing. I always disliked having to pick among the resizing methods. This still requires losing some pixels to fit the patch sizes though right? Weight and width are not always both a multiple of the patch size.

English

0

1

379

Mostafa Dehghani@m__dehghani·13 Tem

1/ Excited to share "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution". NaViT breaks away from the CNN-designed input and modeling pipeline, sets a new course for ViTs, and opens up exciting possibilities in their development. arxiv.org/abs/2307.06304

English

13

147

607

143.9K

Matthias Minderer@MJLM3·14 Tem

Check out NaViT, a Vision Transformer that processes images at their native resolution. Apart from improving efficiency and performance of image-level tasks, pretraining at native resolution also produces better backbones for localization tasks like object detection.

AK@_akhaliq

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution paper page: huggingface.co/papers/2307.06… The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.

English

10

54

21.5K

Matthias Minderer retweetledi

AK@_akhaliq·13 Tem

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution paper page: huggingface.co/papers/2307.06… The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.

English

51

217

119.5K

Matthias Minderer retweetledi

Aran Komatsuzaki@arankomatsuzaki·19 Haz

Scaling Open-Vocabulary Object Detection Proposes OWLv2, which achieves SotA open-vocabulary detection already at 10M examples and further large improvement by scaling to over 1B examples. arxiv.org/abs/2306.09683

English