Shangbang Long

34 posts

Shangbang Long

@ShangbangLong

Research Scientist @ Google DeepMind Multimodal understanding and generation; world models. AGI for ALL.

Katılım Ağustos 2022

217 Takip Edilen457 Takipçiler

Sabitlenmiş Tweet

Shangbang Long@ShangbangLong·23 Nis

🚀 Excited to announce Vision Banana 🍌 and our new paper: “Image Generators are Generalist Vision Learners”. We turn Nano Banana Pro into a state-of-the-art visual generation and understanding model. 🖼️ Check out our gallery at vision-banana.github.io 🧵 (1/N) continue ⬇️

English

429

59K

Shangbang Long@ShangbangLong·26 Nis

Our research was hugely inspired by the previous work that shows zero-shot understanding and reasoning capabilities in video generators like Veo 3. Video models might be even better vision (and action) learners 🧐.

Shane Gu@shaneguML

Exciting work from colleagues at Google. Nano Banana as a generalist vision learner. We need more AI that natively think in pixels / space.

English

1.3K

Shangbang Long@ShangbangLong·26 Nis

@shaneguML Thank you Shane for sharing our work! Your previous work was incredibly inspiring to us.

English

137

Shane Gu@shaneguML·26 Nis

Exciting work from colleagues at Google. Nano Banana as a generalist vision learner. We need more AI that natively think in pixels / space.

Shane Gu@shaneguML

Symbols, space, and time can represent most of the "information". In this eval paper, we show how video models are generalist "space-time reasoners". It's like "let's think step by step" in LLMs in 2022. Veo3 is like GPT-3 in 2020, and can't wait for its thinking/RL moment.

English

12K

Shangbang Long@ShangbangLong·24 Nis

@AntonObukhov1 Thank you Anton for your kind words 😊

English

102

Shangbang Long@ShangbangLong·24 Nis

@alcinos26 Thank you Nicolas! Apologies for this. We'll address this.

English

Nicolas Carion@alcinos26·24 Nis

That being said, this is impressive work, and I congratulate the authors for pulling it off! Unifying everything is a researcher's dream, this is step in that direction. I am quite excited to see where this line of work goes, and whether inference time can be made practical.

English

2.5K

Nicolas Carion@alcinos26·24 Nis

In this age of PR, it's common to see bombastic claims like "beating SAM3". However I take issue with this chart which is quite dishonest IMHO. I would have expected more academic honesty from researchers I deeply respect @sainingxie, @vgabeur, @jalayrac @jon_barron. A quick 🧵

Saining Xie@sainingxie

the idea of (using image generators to solve perception tasks) is pretty straightforward, and there have been many interesting results over the past couple of years. so why this moment matters? because for the first time, a single generalist model is actually beating top domain-specific models like SAM3 and DepthAnything3. those specialized models usually take years to develop and rely on pretty complex recipes in training and data. yet, as history often shows, such capabilities can instead emerge from general, scalable pretraining. in this case, image editing turns out to be a really effective pretraining paradigm, and all of the dense labeling problems can just be reframed as post-training on top of that. [2/n]

English

179

43.7K

Shangbang Long retweetledi

Nithish Kannen at ICLR 2026@NithishKannen·24 Nis

Vision Banana 🍌 is here in Rio at @iclr_conf. I'll be at the Google Booth tomorrow at 10 AM doing a Demo at the @GoogleDeepMind Kiosk for folks to try out the model. I have some cool demos but I wanna do BYOImages. Looking forward to seeing folks!

English

9.5K

Shangbang Long@ShangbangLong·24 Nis

@songyoupeng @KeranRong They should treat me that garlic beef dish for free

English

Songyou Peng@songyoupeng·23 Nis

@ShangbangLong @KeranRong you should show them this and at least get a free lunch!

English

Songyou Peng@songyoupeng·23 Nis

Yay, finally! Introducing Vision Banana🍌 from @GoogleDeepMind, our unified model that outperforms SoTA specialist models on various vision tasks! By treating 2D/3D vision tasks as image generation, we unlock a new foundation for CV. Project page: vision-banana.github.io (1/5)

English

303

2.2K

269K

Shangbang Long@ShangbangLong·23 Nis

@KeranRong @songyoupeng It's actually Eilleen's Kitchen. My wife and I really love it - we go there so often such that the owner gives us free dessert / drink every time 🤣 I'm gonna charge them for ads fee 😁

English

Keran R@KeranRong·23 Nis

@songyoupeng Xie Bao resturant??

Català

Shangbang Long@ShangbangLong·23 Nis

@awsaf49 @vgabeur Thank you, we are aware of this paper (cited as well). As explained in the paper, we are not the first one to explore such a direction. Diception is not the first one either. Instead, we show that this simple approach can beat real sota methods such as sam3 and depth anything

English

Awsaf@awsaf49·23 Nis

really cool work. but isn’t this already explored in diception (arxiv.org/abs/2502.17157)? they also start from pretrained image generators (stable diffusion) and finetune for multiple vision tasks, curious what the key difference is here. also wondering about the cost side: diffusion / AR models are typically slower and more expensive than standard vision models (segmentation, depth, etc). I don’t think "diception" really compared this either. and if the goal is generalist learners, could strong ssl-style vision backbones be a better fit for some of these tasks, since they can often be adapted with single-pass inference? curious what image generation pretraining gives here that a strong ssl foundation model would not.

English

512

Valentin Gabeur@vgabeur·23 Nis

Introducing Vision Banana🍌: an image generator that achieves SOTA on segmentation, depth prediction, and surface normal estimation 🚀 🖼️Project page: vision-banana.github.io 📜Technical report: arxiv.org/abs/2604.20329 🧵👇

English

146

Saining Xie@sainingxie·23 Nis

vision🍌 is here vision-banana.github.io if you got into computer vision the way I did, starting with pixel-level labeling tasks like segmentation, edges, depth, or surface normals, you’ll probably feel the same seeing these results -- something big has quietly shifted, and it’s going to change how we approach these problems for good 🧵

English

112

785

62.9K

Shangbang Long@ShangbangLong·23 Nis

@sainingxie Thank you Saining, for envisioning this path. I am excited about where it leads us to 🫶

English

1.2K

Shangbang Long@ShangbangLong·23 Nis

@vgabeur Let’s go Vision Banana!

English

177

Shangbang Long@ShangbangLong·23 Nis

@howardzzh Definitely exceeded my expectations at the beginning of this project as well…

English

153

Howard Zhou@howardzzh·23 Nis

When I became a Computer Vision student many years ago, I would've never imagined, even in my wildest dream, that one day some of the hardest vision problems would be solved by an image generator. Congratulations to the team for achieving this remarkable milestone!

Shangbang Long@ShangbangLong

English

10K

Shangbang Long retweetledi

Shuyang (Kevin) Sun@Kevin_SSY·23 Nis

Are we finally witnessing the GPT-3 moment for computer vision? We just dropped Vision Banana 🍌 , a vision foundation model that seamlessly unifies generation and perception by treating all vision tasks as just another image generation problem. 1/N #googledeepmind #nanobanana

English

195

24.7K

Shangbang Long@ShangbangLong·23 Nis

🧵 (N/N) Remember to check our demo by @NithishKannen at ICLR! It’s located right at Google’s Booth. #google-booth-interactive-kiosks-2" target="_blank" rel="nofollow noopener">research.google/conferences-an…

English

1.4K

Shangbang Long@ShangbangLong·23 Nis

🧵 (8/N) The Vision Banana works is greatly inspired and guided by our leads: @oliver_wang2 , @sainingxie , @howardzzh , Kaiming, Tom, @jalayrac , and @RSoricut .

English

1.6K

Shangbang Long@ShangbangLong·23 Nis

English

429

59K

Keşfet

@shaneguML @AntonObukhov1 @alcinos26 @sainingxie @vgabeur @jalayrac @jon_barron @iclr_conf