Matthias Minderer

76 posts

Matthias Minderer

Matthias Minderer

@MJLM3

Research Scientist at @GoogleResearch.

Zürich, Switzerland Katılım Temmuz 2009
92 Takip Edilen498 Takipçiler
Matthias Minderer retweetledi
Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن
🔥Excited to introduce RINS - a technique that boosts model performance by recursively applying early layers during inference without increasing model size or training compute flops! Not only does it significantly improve LMs, but also multimodal systems like SigLIP. (1/N)
Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن tweet media
English
2
12
39
6.1K
Matthias Minderer retweetledi
Alexander Kolesnikov
Alexander Kolesnikov@__kolesnikov__·
I always dreamed of a model that simultaneously 1. optimizes NLL of raw pixel data, 2. generates competitive high-res. natural images, 3. is practical. But it seemed too good to be true. Until today! Our new JetFormer model (arxiv.org/abs/2411.19722) ticks on all of these. 🧵
Alexander Kolesnikov tweet media
Michael Tschannen@mtschannen

Have you ever wondered how to train an autoregressive generative transformer on text and raw pixels, without a pretrained visual tokenizer (e.g. VQ-VAE)? We have been pondering this during summer and developed a new model: JetFormer 🌊🤖 arxiv.org/abs/2411.19722 A thread 👇 1/

English
5
41
289
68.1K
Xiaohua Zhai
Xiaohua Zhai@XiaohuaZhai·
Life update📢: After an amazing decade at Google/DeepMind, I’m thrilled to announce that I’ll be joining @OpenAI in a few weeks! I’m excited for the opportunity to co-build the OpenAI Zürich office alongside my close collaborators @giffmana and @__kolesnikov__ 🚀
English
89
36
1.4K
215K
Matthias Minderer
Matthias Minderer@MJLM3·
@unsorsodicorda @_akhaliq Different metrics. It's an increasing problem with LVIS in the literature. There's LVIS AP on the full val set, which generally produces the lowest numbers. That's what we report. Then there's minimal, and also "fixed" AP, both giving higher numbers. Some report those as "LVIS".
English
1
0
1
130
andrea panizza
andrea panizza@unsorsodicorda·
@MJLM3 @_akhaliq According to this table the mAP for OWLv2-L14 is 44.6<45, but in your plot it's close to 50 (and definitely > 45). Why? #pretrained-checkpoints" target="_blank" rel="nofollow noopener">github.com/google-researc…
English
2
0
0
189
AK
AK@_akhaliq·
Tencent releases YOLO-World Real-Time Open-Vocabulary Object Detection demo: huggingface.co/spaces/steveng… method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.
AK tweet media
English
1
62
331
43.6K
Matthias Minderer
Matthias Minderer@MJLM3·
@ahatamiz1 @giffmana (1) No, so it would not have been trivial to reproduce, although the pretrained checkpoints I used and fine-tuning code is available. (2) Maybe the strong size augmentation+mosaics used during OWL-ViT training helps low-res performance? Need to investigate this further.
English
0
0
1
35
Ali Hatamizadeh
Ali Hatamizadeh@ahatamiz1·
@giffmana i have two comments on this: (1) were OWL-ViT v2 series+ finetuned on O365 open-sourced ? wonder if the authors had access to reproduce these results (2) OWL-ViT v2 uses an isotropic ViT, what contributes to this amazing tradeoff ? usually multi-resolution extractors are faster
English
4
1
4
1.1K
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
"Pretty neat" how omitting (intentionally or not) an existing method can completely change the key takeaway of a figure. Good thing it's marked as "work still in progress" in the arxiv comment, so this should be easy to fix. Interesting to already have a domain name then 🤔
Lucas Beyer (bl16) tweet mediaLucas Beyer (bl16) tweet media
Matthias Minderer@MJLM3

@_akhaliq I added OWL-ViT v2 to the plot. A single OWLv2 B/16 model, finetuned on O365+VG, covers all speed/accuracy combinations: Simply adjust the inference resolution to match your latency requirements. No re-training needed. arxiv.org/abs/2306.09683

English
5
7
65
33K
Matthias Minderer
Matthias Minderer@MJLM3·
@giffmana @ahatamiz1 The O365+VG-finetuned checkpoints are indeed not available (yet). Happy to work with the authors to make the results reproducible easily.
English
2
0
1
167
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
@ahatamiz1 ohhh, actually I think you may have a point. I thought that the "FT" here was that fine-tuning, but looking at the paper, that was only LVIS correct @MJLM3 ? If this wasn't available I should tone down my criticism.
Lucas Beyer (bl16) tweet mediaLucas Beyer (bl16) tweet media
English
2
0
3
827
Matthias Minderer
Matthias Minderer@MJLM3·
@arankomatsuzaki I added OWL-ViT v2 to the plot. A single OWLv2 B/16, finetuned on O365+VG, covers all speed/accuracy combinations: Simply adjust the inference resolution to match your latency requirements. No re-training needed. arxiv.org/abs/2306.09683
Matthias Minderer tweet media
English
0
1
9
531
Aran Komatsuzaki
Aran Komatsuzaki@arankomatsuzaki·
YOLO-World: Real-Time Open-Vocabulary Object Detection Outperforms many SotA methods in terms of both accuracy and speed arxiv.org/abs/2401.17270
Aran Komatsuzaki tweet media
English
2
44
238
33.5K
Matthias Minderer retweetledi
Niels Rogge
Niels Rogge@NielsRogge·
Excited to share that @Google's OWLv2 model is now available in 🤗 Transformers! This model is one of the strongest zero-shot object detection models out there, improving upon OWL-ViT v1 which was released last year🔥 How? By self-training on web-scale data of over 1B examples⬇️
Niels Rogge tweet media
English
4
57
280
80K
Matthias Minderer retweetledi
Thomas Kipf
Thomas Kipf@tkipf·
I'll give a talk on object-centric models for video and 3D at the @ICCVConference Workshop on Large-scale Video Object Segmentation! Today @ 3:30pm (Room S02) Website: youtube-vos.org/challenge/2023/ I'll cover DORSal (see below) & recent work from our team on structured video models.
Thomas Kipf@tkipf

Excited to announce DORSal: a 3D structured diffusion model for generation and object-level editing of 3D scenes. DORSal is “geometry-free” and learns 3D scene structure purely from data – no expensive volume rendering! 🖥️ sjoerdvansteenkiste.com/dorsal/ 📜 arxiv.org/abs/2306.08068 1/6

English
0
11
58
12.9K
Neil Houlsby
Neil Houlsby@neilhoulsby·
🥈at Ironman Switzerland! Overwhelmed with the pace of AI development? An engaging hobby is a great way to stay enthusiastic. For me, it's endurance training. Bonus: long rides are a perfect time to ruminate on research ideas!
Neil Houlsby tweet media
English
18
0
128
21.6K
Mostafa Dehghani
Mostafa Dehghani@m__dehghani·
1/ Excited to share "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution". NaViT breaks away from the CNN-designed input and modeling pipeline, sets a new course for ViTs, and opens up exciting possibilities in their development. arxiv.org/abs/2307.06304
Mostafa Dehghani tweet media
English
13
147
607
143.9K
Matthias Minderer
Matthias Minderer@MJLM3·
Check out NaViT, a Vision Transformer that processes images at their native resolution. Apart from improving efficiency and performance of image-level tasks, pretraining at native resolution also produces better backbones for localization tasks like object detection.
Matthias Minderer tweet media
AK@_akhaliq

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution paper page: huggingface.co/papers/2307.06… The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.

English
0
10
54
21.5K
Matthias Minderer retweetledi
AK
AK@_akhaliq·
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution paper page: huggingface.co/papers/2307.06… The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.
AK tweet media
English
2
51
217
119.5K
Matthias Minderer retweetledi
Aran Komatsuzaki
Aran Komatsuzaki@arankomatsuzaki·
Scaling Open-Vocabulary Object Detection Proposes OWLv2, which achieves SotA open-vocabulary detection already at 10M examples and further large improvement by scaling to over 1B examples. arxiv.org/abs/2306.09683
Aran Komatsuzaki tweet media
English
0
20
111
31.8K