Chaitanya (Chay) Ryali (@wrong_whp) - Twitter Profile

@gabriberton Since vision 🍌 can't handle negatives, it can't actually do instance segmentation. Yet here's a lead of the project claiming it's SOTA at this, beating SAM 3 at something vision 🍌 cant actually do. So, pretty far from correct. x.com/i/status/20473…

Songyou Peng@songyoupeng

What's surprising: Vision Banana keeps its original image generation ability AND achieves state-of-the-art zero-shot performance across tasks. 👉 No task-specific heads. 👉 No special losses. (Yes, the boring table below👇) (3/5)

English

0

1

428

Gabriele Berton@gabriberton·25 Nis

Vision Banana's first author replies to the criticism My summary and thoughts 🧵 SAM 3 came with SA-Co, a segmentation dataset that contains also negatives, e.g. given a photo of dogs, the text asks to segment cats (see colander below). The output should be an empty mask [1/N]

Valentin Gabeur@vgabeur

Here is a thread with some precisions regarding our evaluation of Vision Banana on the SA-Co/gold segmentation benchmark. [1/n] 🧵👇

English

4

12

121

26.5K

Chaitanya (Chay) Ryali@wrong_whp·25 Nis

@vgabeur @alcinos26 @sainingxie @jalayrac @jon_barron Thanks for following up Valentin! Since vision 🍌 is not capable of handling negatives, it's not capable of open vocabulary instance segmentation, yet it's claimed to be SOTA at this by a project lead, so I hope you can see how this can be misleading. x.com/i/status/20473…

Songyou Peng@songyoupeng

What's surprising: Vision Banana keeps its original image generation ability AND achieves state-of-the-art zero-shot performance across tasks. 👉 No task-specific heads. 👉 No special losses. (Yes, the boring table below👇) (3/5)

English

0

1

233

Valentin Gabeur@vgabeur·25 Nis

@alcinos26 @sainingxie @jalayrac @jon_barron Thanks for the feedback Nico, I listed justifications for our decisions in the following thread: x.com/vgabeur/status…

Valentin Gabeur@vgabeur

Here is a thread with some precisions regarding our evaluation of Vision Banana on the SA-Co/gold segmentation benchmark. [1/n] 🧵👇

English

1

0

5

1.7K

Nicolas Carion@alcinos26·24 Nis

In this age of PR, it's common to see bombastic claims like "beating SAM3". However I take issue with this chart which is quite dishonest IMHO. I would have expected more academic honesty from researchers I deeply respect @sainingxie, @vgabeur, @jalayrac @jon_barron. A quick 🧵

Saining Xie@sainingxie

the idea of (using image generators to solve perception tasks) is pretty straightforward, and there have been many interesting results over the past couple of years. so why this moment matters? because for the first time, a single generalist model is actually beating top domain-specific models like SAM3 and DepthAnything3. those specialized models usually take years to develop and rely on pretty complex recipes in training and data. yet, as history often shows, such capabilities can instead emerge from general, scalable pretraining. in this case, image editing turns out to be a really effective pretraining paradigm, and all of the dense labeling problems can just be reframed as post-training on top of that. [2/n]

English

5

9

179

44.3K

Chaitanya (Chay) Ryali@wrong_whp·24 Nis

@gabriberton Not true for segmentation — they evaluated on <0.3% of the benchmark using the wrong metric. Full breakdown here: x.com/alcinos26/stat…

Nicolas Carion@alcinos26

In this age of PR, it's common to see bombastic claims like "beating SAM3". However I take issue with this chart which is quite dishonest IMHO. I would have expected more academic honesty from researchers I deeply respect @sainingxie, @vgabeur, @jalayrac @jon_barron. A quick 🧵

English

1

3

832

Gabriele Berton@gabriberton·24 Nis

Vision Banana outperforms SAM3 on most segmentation tasks, it is SOTA on Normals and monocular Metric Depth Estimation And the craziest thing is that it doesn't even take the camera intrinsics as input! [3/n]

English

2

4

52

12.2K

Gabriele Berton@gabriberton·24 Nis

A team of cracked @GoogleDeepMind colleagues just released Vision Banana A brief thread about Vision Banana, what it means for the future of AI, and the future of image understanding 🧵

English

3

6

69

5.5K

Chaitanya (Chay) Ryali@wrong_whp·24 Nis

@alcinos26 Unless I'm misunderstanding, they evaluate on 500 (image, noun-phrase) pairs, not 500 images, so even worse - evaluating on < 0.3% of a long-tailed benchmark 😅

English

0

69

Nicolas Carion@alcinos26·24 Nis

First, they chose to run only on 500 images, for... computational reasons? Is google running out of TPUs? The benchmark has 15.8k - this is a feature, not a bug. The real world is very diverse and long-tailed, it's impossible to get accurate stats on such a small subset. 1/x

English

3

1

40

3.2K

Chaitanya (Chay) Ryali nag-retweet

AI at Meta@AIatMeta·27 Mar

We’re releasing SAM 3.1: a drop-in update to SAM 3 that introduces object multiplexing to significantly improve video processing efficiency without sacrificing accuracy. We’re sharing this update with the community to help make high-performance applications feasible on smaller, more accessible hardware. 🔗 Model Checkpoint: go.meta.me/8dd321 🔗 Codebase: go.meta.me/b0a9fb

English

106

273

2.2K

334.6K

Chaitanya (Chay) Ryali nag-retweet

Kate Saenko@kate_saenko_·11 Nis

Excited to share SA-FARI which will be presented as an oral at CVPR 26! conservationxlabs.com/sa-fari My team at Meta collaborated with ConservationX Labs to create the largest open video dataset for wildlife detection -- with @Surisdi @wrong_whp @YuanTingHu1

English

1

14

100

5.9K

Chaitanya (Chay) Ryali nag-retweet

Meta Newsroom@MetaNewsroom·17 Ara

New on @instagram Edits: AI-powered video effects, enabled by our new SAM3 model, make it easier to blur an object, tag an outfit, outline, and more. about.fb.com/news/2025/04/i…

English

22

51

426

57.5K

Chaitanya (Chay) Ryali nag-retweet

AI at Meta@AIatMeta·16 Ara

🔉 Introducing SAM Audio, the first unified model that isolates any sound from complex audio mixtures using text, visual, or span prompts. We’re sharing SAM Audio with the community, along with a perception encoder model, benchmarks and research papers, to empower others to explore new forms of expression and build applications that were previously out of reach. 🔗 Learn more: go.meta.me/568e5d

English

406

915

6.4K

1.2M

Chaitanya (Chay) Ryali@wrong_whp·15 Ara

@georgiagkioxari Nice work! SAM 3 also extensively leveraged VLMs as verifiers for grounding data - producing near human-annotation (HQ) level data. Great to see this direction gaining momentum!

English

0

1

65

Georgia Gkioxari@georgiagkioxari·15 Ara

Most people are hyped about LLMs as generators/actors. But IMO their real superpower is being verifiers/critics. And in computer vision this is especially true: today’s VLMs still struggle on lots of core vision tasks, yet they’re incredibly useful as feedback engines...check Damiano's work for more details x.com/marsilidamiano…

English

8

10

261

278.8K

Chaitanya (Chay) Ryali@wrong_whp·13 Ara

@nikshepsvn @vikhyatk Nah, they don't have meaningful comparison to SAM 3. x.com/wrong_whp/stat…

Chaitanya (Chay) Ryali@wrong_whp

SAM 3 is not a referring expression segmentation model - this is by design. SAM 3 solves Promptable Concept Segmentation (PCS): segmenting objects using simple noun phrases (general categories + basic attributes) and optional exemplars. It's also a robust, composable primitive that works seamlessly with MLLMs. For example, SAM 3 + Gemini 2.5 Pro achieves zero-shot SOTA on RefCOCO+/RefCOCOg referring expression benchmarks. Wondering if this wasn't sufficiently clear from the paper (our bad if so). But also wondering why only referring expression is benchmarked and not any of the benchmarks we reported on—the ones SAM 3 is actually designed for?

English

0

40

nikshep@nikshepsvn·12 Ara

@vikhyatk bro what are these numbers, nice work

English

2

0

150

vik@vikhyatk·12 Ara

Now available on FAL! fal.ai/models/fal-ai/…

moondream@moondreamai

We’re introducing Segmentation. SVG masks from prompt, points, or box. SOTA on benchmarks. moondream.ai/skills/segment…

English

7

6

148

16K

Chaitanya (Chay) Ryali@wrong_whp·13 Ara

@giffmana @CoolMFcat @vikhyatk Not necessarily - they've been posting misleading benchmarks repeatedly despite corrections. Can only surmise it's deliberate. x.com/wrong_whp/stat…

Chaitanya (Chay) Ryali@wrong_whp

SAM 3 is not a referring expression segmentation model - this is by design. SAM 3 solves Promptable Concept Segmentation (PCS): segmenting objects using simple noun phrases (general categories + basic attributes) and optional exemplars. It's also a robust, composable primitive that works seamlessly with MLLMs. For example, SAM 3 + Gemini 2.5 Pro achieves zero-shot SOTA on RefCOCO+/RefCOCOg referring expression benchmarks. Wondering if this wasn't sufficiently clear from the paper (our bad if so). But also wondering why only referring expression is benchmarked and not any of the benchmarks we reported on—the ones SAM 3 is actually designed for?

English

0

2

100

Lucas Beyer (bl16)@giffmana·26 Kas

@CoolMFcat @vikhyatk (but I give it a reasonable chance that vik DID do the right thing, because he has seen me tweet about this in the past!)

English

1

0

8

262

vik@vikhyatk·26 Kas

SOTA referral segmentation ✅

moondream@moondreamai

We’re introducing Segmentation. SVG masks from prompt, points, or box. SOTA on benchmarks. moondream.ai/skills/segment…

Català

14

9

170

31.6K

Chaitanya (Chay) Ryali nag-retweet

Dilum Sanjaya@DilumSanjaya·13 Ara

Found the perfect sport to stress test Meta's SAM3 person segmentation. Dense crowds, extreme motion, zero structure. This is as tough as it gets, and SAM3 nailed it.

English

45

130

2K

157.5K

Chaitanya (Chay) Ryali nag-retweet

Dilum Sanjaya@DilumSanjaya·11 Ara

Tested Meta's SAM 3 on some low quality dashcam footage and expected the segmentation to fall apart, but it still picked up every vehicle and even spotted people on the roadside that I hadn't noticed at all.

English

42

119

1.7K

220.6K

Chaitanya (Chay) Ryali nag-retweet

AA@measure_plan·10 Ara

typing "player with a red shirt" was all it took to train this computer vision model just a few minutes to train the model, export to python, and create this video i'll keep testing roboflow rapid and will report back with the results in the meantime, enjoy the magic of Zlatan

SkalskiP@skalskip92

data labeling is dead. long live distillation. from data to object detection endpoint in 90 seconds. link: rapid.roboflow.com

English

12

33

491

239K

Chaitanya (Chay) Ryali nag-retweet

Kyle Walker@kyle_e_walker·8 Ara

The new SAM3 model from @Meta is blowing my mind Shown here: detecting putting greens, pools, and cars in Scottsdale from simple text prompts via @Mapbox imagery R, Shiny, mapgl for the UI; Python backend via @giswqs's segment-geospatial package (thanks Qiusheng!)

English

15

62

606

43.8K

Chaitanya (Chay) Ryali@wrong_whp·3 Ara

"laundry on the bed", "left light", "right couch" - are not "basic attributes", they are relationships. Easy way to see this: if you masked out other objects, e.g. the bed or one of the couches or lights, the referring phrase becomes incorrect for the target. On the subset that does look like PCS, it can be even better than GT as shown. Not sure what's not to buy. "Ref"COCO is a referring expression benchmark. A different task. Would you train on COCO and and deploy on a RefCOCO like task?

English

0

1

4

186

Chaitanya (Chay) Ryali nag-retweet

Qiusheng Wu@giswqs·3 Ara

🚀 Video Segmentation and Object Tracking with SAM 3! Learn how to segment and track objects in any video using text and point prompts with Meta’s powerful SAM 3 (Segment Anything Model 3)! Whether you're removing unwanted objects or adding new ones, this tutorial walks you through everything from start to finish. ✅ What You’ll Learn: How to use text prompts for object segmentation Use point-based prompts to add or remove objects Easily track any object across multiple video frames Real-world examples using SamGeo 📌 Useful Resources: 🔗 GitHub Repository (SamGeo): github.com/opengeos/segme… 🔗 Notebook Example: samgeo.gishub.org/examples/sam3_… 🔗 Meta SAM 3 Overview: ai.meta.com/sam3 📺 Check out the full video tutorial at @giswqs/videos" target="_blank" rel="nofollow noopener">youtube.com/@giswqs/videos #SAM3 #GeoAI #Geospatial #OpenSource #Python #DataScience

English

9

144

946

52.7K

Ethan Reid@EthanReidMorro·3 Ara

Really impressed with SAM3, but having trouble buying the PCS argument. What would you call the RefCOCO benchmark: PCS or referral segmentation? RefCOCO uses simple noun phrases (general categories + basic attributes) but it is not called a PCS benchmark. These samples are from the RefCOCO train set:

English

2

0

1

156

moondream@moondreamai·3 Ara

Moondream’s new segmentation just dropped. Prompt: “dirty laundry items on the bed.” Moondream: pixel-perfect + actually understands the scene. SAM 3: grabs the floor.

English

28

92

1.2K

60K

Chaitanya (Chay) Ryali nag-retweet

Qiusheng Wu@giswqs·2 Ara

🌍 Unlock powerful GeoAI workflows with SAM 3! In this step-by-step tutorial, I demonstrate how to segment remote sensing imagery using text prompts and bounding boxes, powered by Meta’s SAM 3 (Segment Anything Model 3). You’ll learn how to run image segmentation on satellite and aerial imagery, extract objects of interest, and export the results to geospatial formats like GeoTIFF for further GIS or Python analysis. 🔗 GitHub Repository (SamGeo): github.com/opengeos/segme… 🔗 Notebook Example: samgeo.gishub.org/examples/sam3_… 👉 Check out the full video tutorial at @giswqs/videos" target="_blank" rel="nofollow noopener">youtube.com/@giswqs/videos #SAM3 #GeoAI #Geospatial #OpenSource #Python #DataScience

English

5

44

293

11.7K

Chaitanya (Chay) Ryali

Tuklasin