Aadarsh Sahoo (@SahooAadarsh) - Twitter Profili

Sabitlenmiş Tweet

Aadarsh Sahoo@SahooAadarsh·17 Şub

Perception is actionable. Humans don't just see objects, we see affordances and constraints. "Something to sit on." "Region unsafe to walk." "Something that will tip if I bump it." But today’s vision models mostly see… labels. So we built ConverSeg: Conversational Image Segmentation 🧵 glab-caltech.github.io/converseg/

English

7

21

95

12.6K

Aadarsh Sahoo retweetledi

Pulkit Agrawal@pulkitology·6d

Until now, robotics stopped where the human hand begins. Strong, but not delicate. Precise, but not adaptive. Repetitive, but not creative. The human hand wasn’t a benchmark — it was a boundary. We are crossing it. @EkaRobotics. Coming soon.

English

24

38

492

70.9K

Aadarsh Sahoo retweetledi

Georgia Gkioxari@georgiagkioxari·23 Nis

@songyoupeng @GoogleDeepMind Awesome!! Curious to see how well the model works on more complex 2D vision task, such as conversational image grounding x.com/SahooAadarsh/s…

Aadarsh Sahoo@SahooAadarsh

Perception is actionable. Humans don't just see objects, we see affordances and constraints. "Something to sit on." "Region unsafe to walk." "Something that will tip if I bump it." But today’s vision models mostly see… labels. So we built ConverSeg: Conversational Image Segmentation 🧵 glab-caltech.github.io/converseg/

English

0

3

11

3.4K

Aadarsh Sahoo@SahooAadarsh·23 Nis

@ShangbangLong Congrats, this is really cool! Looking forward to the code release -- curious to try it out on our ConverSeg benchmark: glab-caltech.github.io/converseg/

English

0

2

522

Shangbang Long@ShangbangLong·23 Nis

🚀 Excited to announce Vision Banana 🍌 and our new paper: “Image Generators are Generalist Vision Learners”. We turn Nano Banana Pro into a state-of-the-art visual generation and understanding model. 🖼️ Check out our gallery at vision-banana.github.io 🧵 (1/N) continue ⬇️

English

21

71

429

59K

Aadarsh Sahoo retweetledi

Jitendra MALIK@JitendraMalikCV·18 Mar

With Emmanuel Dupoux scp.net/persons/dupoux/ and Yann LeCun @ylecun, we consider a cognitive science inspired AI. We analyse how autonomous learning works in living organisms, and propose a roadmap for reproducing it in artificial systems. lnkd.in/eNWDmuqT

English

9

77

449

64.4K

Aadarsh Sahoo retweetledi

Ziqi Ma@ziqi__ma·16 Mar

Today’s video world models “simulate” the world by generating pixel frame observations🖼️. Can they continue to simulate the world when observations are interrupted - such as by occlusion, illumination dimming, or camera lookaway? To probe this question, we release STEVO-Bench, which holistically evaluates whether image-/text-to-video models and camera-controlled video models can correctly evolve states under observation control. Check out our website, blog and paper for how they fail!

English

5

14

87

8.6K

Aadarsh Sahoo retweetledi

Wildminder@wildmindai·17 Şub

CIS. SAM2+Qwen2.5-VL to segment by physics, safety, and affordance. - good complex reasoning; understands something like "breakable items on the table" or "furniture blocking the walkway." glab-caltech.github.io/converseg/

English

2

13

80

4.2K

Aadarsh Sahoo@SahooAadarsh·17 Şub

@af_gao Thanks Angela!

English

0

124

angela@af_gao·17 Şub

@SahooAadarsh this is awesome! really great work aadarsh :)

English

1

0

1

155

Aadarsh Sahoo@SahooAadarsh·17 Şub

Perception is actionable. Humans don't just see objects, we see affordances and constraints. "Something to sit on." "Region unsafe to walk." "Something that will tip if I bump it." But today’s vision models mostly see… labels. So we built ConverSeg: Conversational Image Segmentation 🧵 glab-caltech.github.io/converseg/

English

7

21

95

12.6K

Aadarsh Sahoo@SahooAadarsh·17 Şub

Big thanks to my advisor @georgiagkioxari for the guidance and support throughout this project! 🌐 Webpage: glab-caltech.github.io/converseg/ 🎛️ Demo: huggingface.co/spaces/aadarsh… 📄 Paper: arxiv.org/abs/2602.13195

English

1

0

6

270

Aadarsh Sahoo@SahooAadarsh·17 Şub

We also introduce: ConverSeg-Net, a single-pass conversational segmentation model baseline. It combines strong segmentation priors (SAM2) with visual-language reasoning (Qwen2.5-VL-3B) via lightweight adapters, and stays competitive while handling these "beyond labels" prompts.

English

1

0

2

293

Aadarsh Sahoo retweetledi

Damiano Marsili@marsilidamiano·27 Oca

Our paper, VALOR, got accepted at #ICLR2026 ! We explore improving visual reasoning using multimodal verifiers - all without any ground truth annotations! More details below 👇 Excited to see everyone in Rio!

Damiano Marsili@marsilidamiano

(1/N): Can we improve visual reasoning models without annotations? In VALOR, we introduce an annotation-free training framework that boosts both visual reasoning and object grounding by training with multimodal verifiers instead of human labels

English

2

6

29

4.8K

Aadarsh Sahoo retweetledi

Raphi Kang@RaphiKang·27 Oca

🤓 How do LVLM/LMMMs reason about space and time? This was the central question of our #ICLR2016 paper, “Linear Mechanisms For Spatiotemporal Reasoning In Vision Language Models”. I’m very excited to finally share it:D 🥳🥳 A thread: [1/7]

English

2

12

63

3.5K

Aadarsh Sahoo retweetledi

Damiano Marsili@marsilidamiano·30 Ara

(1/6) Do these images show the same vacuum cleaner? They are certainly similar, but a human will notice the differences in dustbin geometry, design, and color accents. In contrast, open-source VLMs struggle at this task. Our recent work TWIN poses the question: Can we fix this?

English

1

9

24

9.2K

Aadarsh Sahoo retweetledi

Ziqi Ma@ziqi__ma·16 Ara

Generative models shouldn’t just generate. They should be steerable by your commands. Meet Steer3D🕹️: edit generated 3D assets with text📝 in one forward pass. Trained on only 100k synthetic data, it shows that we can make generative models responsive to signals from another modality🎛️. Check out: glab-caltech.github.io/steer3d/

English

8

55

403

32.6K

Aadarsh Sahoo retweetledi

Damiano Marsili@marsilidamiano·15 Ara

(1/N): Can we improve visual reasoning models without annotations? In VALOR, we introduce an annotation-free training framework that boosts both visual reasoning and object grounding by training with multimodal verifiers instead of human labels

English

4

8

73

9.1K

Aadarsh Sahoo retweetledi

Yisong Yue@yisongyue·25 Kas

My student @SaberaTalukder and I are creating a new startup that deeply rethinks how we architect and engage with multimodal models. 🚀 We are chatting with investors at #NeurIPS2025, and if you want to get on our radar, DM Sabera.

English

17

11

207

75.2K

Aadarsh Sahoo retweetledi

Kate Saenko@kate_saenko_·19 Kas

🚀 Excited to share that my team at Meta just launched Segment Anything 3! SAM 3 doubles the performance of existing models on open-vocabulary instance segmentation on our new SA-Co benchmark, with 207K unique object labels. Huge congrats to the team, so proud of this work!

AI at Meta@AIatMeta

Today we’re excited to unveil a new generation of Segment Anything Models: 1️⃣ SAM 3 enables detecting, segmenting and tracking of objects across images and videos, now with short text phrases and exemplar prompts. 🔗 Learn more about SAM 3: go.meta.me/591040 2️⃣ SAM 3D brings the model collection into the 3rd dimension to enable precise reconstruction of 3D objects and people from a single 2D image. 🔗 Learn more about SAM 3D: go.meta.me/305985 These models offer innovative capabilities and unique tools for developers and researchers to create, experiment and uplevel media workflows.

English

4

9

92

9.2K

Aadarsh Sahoo

Keşfet