
What's surprising: Vision Banana keeps its original image generation ability AND achieves state-of-the-art zero-shot performance across tasks. 👉 No task-specific heads. 👉 No special losses. (Yes, the boring table below👇) (3/5)
Chaitanya (Chay) Ryali
290 posts


What's surprising: Vision Banana keeps its original image generation ability AND achieves state-of-the-art zero-shot performance across tasks. 👉 No task-specific heads. 👉 No special losses. (Yes, the boring table below👇) (3/5)


Here is a thread with some precisions regarding our evaluation of Vision Banana on the SA-Co/gold segmentation benchmark. [1/n] 🧵👇

What's surprising: Vision Banana keeps its original image generation ability AND achieves state-of-the-art zero-shot performance across tasks. 👉 No task-specific heads. 👉 No special losses. (Yes, the boring table below👇) (3/5)

Here is a thread with some precisions regarding our evaluation of Vision Banana on the SA-Co/gold segmentation benchmark. [1/n] 🧵👇


the idea of (using image generators to solve perception tasks) is pretty straightforward, and there have been many interesting results over the past couple of years. so why this moment matters? because for the first time, a single generalist model is actually beating top domain-specific models like SAM3 and DepthAnything3. those specialized models usually take years to develop and rely on pretty complex recipes in training and data. yet, as history often shows, such capabilities can instead emerge from general, scalable pretraining. in this case, image editing turns out to be a really effective pretraining paradigm, and all of the dense labeling problems can just be reframed as post-training on top of that. [2/n]

In this age of PR, it's common to see bombastic claims like "beating SAM3". However I take issue with this chart which is quite dishonest IMHO. I would have expected more academic honesty from researchers I deeply respect @sainingxie, @vgabeur, @jalayrac @jon_barron. A quick 🧵













SAM 3 is not a referring expression segmentation model - this is by design. SAM 3 solves Promptable Concept Segmentation (PCS): segmenting objects using simple noun phrases (general categories + basic attributes) and optional exemplars. It's also a robust, composable primitive that works seamlessly with MLLMs. For example, SAM 3 + Gemini 2.5 Pro achieves zero-shot SOTA on RefCOCO+/RefCOCOg referring expression benchmarks. Wondering if this wasn't sufficiently clear from the paper (our bad if so). But also wondering why only referring expression is benchmarked and not any of the benchmarks we reported on—the ones SAM 3 is actually designed for?


data labeling is dead. long live distillation. from data to object detection endpoint in 90 seconds. link: rapid.roboflow.com






