Aadarsh Sahoo

264 posts

Aadarsh Sahoo banner
Aadarsh Sahoo

Aadarsh Sahoo

@SahooAadarsh

Human, from Earth.

Los Angeles, CA Katılım Aralık 2018
2.4K Takip Edilen380 Takipçiler
Sabitlenmiş Tweet
Aadarsh Sahoo
Aadarsh Sahoo@SahooAadarsh·
Perception is actionable. Humans don't just see objects, we see affordances and constraints. "Something to sit on." "Region unsafe to walk." "Something that will tip if I bump it." But today’s vision models mostly see… labels. So we built ConverSeg: Conversational Image Segmentation 🧵 glab-caltech.github.io/converseg/
English
7
21
95
12.6K
Aadarsh Sahoo retweetledi
Pulkit Agrawal
Pulkit Agrawal@pulkitology·
Until now, robotics stopped where the human hand begins. Strong, but not delicate. Precise, but not adaptive. Repetitive, but not creative. The human hand wasn’t a benchmark — it was a boundary. We are crossing it. @EkaRobotics. Coming soon.
English
24
38
492
70.9K
Aadarsh Sahoo retweetledi
Shangbang Long
Shangbang Long@ShangbangLong·
🚀 Excited to announce Vision Banana 🍌 and our new paper: “Image Generators are Generalist Vision Learners”. We turn Nano Banana Pro into a state-of-the-art visual generation and understanding model. 🖼️ Check out our gallery at vision-banana.github.io 🧵 (1/N) continue ⬇️
English
21
71
429
59K
Aadarsh Sahoo retweetledi
Jitendra MALIK
Jitendra MALIK@JitendraMalikCV·
With Emmanuel Dupoux scp.net/persons/dupoux/ and Yann LeCun @ylecun, we consider a cognitive science inspired AI. We analyse how autonomous learning works in living organisms, and propose a roadmap for reproducing it in artificial systems. lnkd.in/eNWDmuqT
English
9
77
449
64.4K
Aadarsh Sahoo retweetledi
Ziqi Ma
Ziqi Ma@ziqi__ma·
Today’s video world models “simulate” the world by generating pixel frame observations🖼️. Can they continue to simulate the world when observations are interrupted - such as by occlusion, illumination dimming, or camera lookaway? To probe this question, we release STEVO-Bench, which holistically evaluates whether image-/text-to-video models and camera-controlled video models can correctly evolve states under observation control. Check out our website, blog and paper for how they fail!
English
5
14
87
8.6K
Aadarsh Sahoo retweetledi
Wildminder
Wildminder@wildmindai·
CIS. SAM2+Qwen2.5-VL to segment by physics, safety, and affordance. - good complex reasoning; understands something like "breakable items on the table" or "furniture blocking the walkway." glab-caltech.github.io/converseg/
English
2
13
80
4.2K
angela
angela@af_gao·
@SahooAadarsh this is awesome! really great work aadarsh :)
English
1
0
1
155
Aadarsh Sahoo
Aadarsh Sahoo@SahooAadarsh·
Perception is actionable. Humans don't just see objects, we see affordances and constraints. "Something to sit on." "Region unsafe to walk." "Something that will tip if I bump it." But today’s vision models mostly see… labels. So we built ConverSeg: Conversational Image Segmentation 🧵 glab-caltech.github.io/converseg/
English
7
21
95
12.6K
Aadarsh Sahoo
Aadarsh Sahoo@SahooAadarsh·
We also introduce: ConverSeg-Net, a single-pass conversational segmentation model baseline. It combines strong segmentation priors (SAM2) with visual-language reasoning (Qwen2.5-VL-3B) via lightweight adapters, and stays competitive while handling these "beyond labels" prompts.
Aadarsh Sahoo tweet media
English
1
0
2
293
Aadarsh Sahoo retweetledi
Damiano Marsili
Damiano Marsili@marsilidamiano·
Our paper, VALOR, got accepted at #ICLR2026 ! We explore improving visual reasoning using multimodal verifiers - all without any ground truth annotations! More details below 👇 Excited to see everyone in Rio!
Damiano Marsili@marsilidamiano

(1/N): Can we improve visual reasoning models without annotations? In VALOR, we introduce an annotation-free training framework that boosts both visual reasoning and object grounding by training with multimodal verifiers instead of human labels

English
2
6
29
4.8K
Aadarsh Sahoo retweetledi
Raphi Kang
Raphi Kang@RaphiKang·
🤓 How do LVLM/LMMMs reason about space and time? This was the central question of our #ICLR2016 paper, “Linear Mechanisms For Spatiotemporal Reasoning In Vision Language Models”. I’m very excited to finally share it:D 🥳🥳 A thread: [1/7]
Raphi Kang tweet media
English
2
12
63
3.5K
Aadarsh Sahoo retweetledi
Damiano Marsili
Damiano Marsili@marsilidamiano·
(1/6) Do these images show the same vacuum cleaner? They are certainly similar, but a human will notice the differences in dustbin geometry, design, and color accents. In contrast, open-source VLMs struggle at this task. Our recent work TWIN poses the question: Can we fix this?
Damiano Marsili tweet media
English
1
9
24
9.2K
Aadarsh Sahoo retweetledi
Ziqi Ma
Ziqi Ma@ziqi__ma·
Generative models shouldn’t just generate. They should be steerable by your commands. Meet Steer3D🕹️: edit generated 3D assets with text📝 in one forward pass. Trained on only 100k synthetic data, it shows that we can make generative models responsive to signals from another modality🎛️. Check out: glab-caltech.github.io/steer3d/
English
8
55
403
32.6K
Aadarsh Sahoo retweetledi
Damiano Marsili
Damiano Marsili@marsilidamiano·
(1/N): Can we improve visual reasoning models without annotations? In VALOR, we introduce an annotation-free training framework that boosts both visual reasoning and object grounding by training with multimodal verifiers instead of human labels
English
4
8
73
9.1K
Aadarsh Sahoo retweetledi
Yisong Yue
Yisong Yue@yisongyue·
My student @SaberaTalukder and I are creating a new startup that deeply rethinks how we architect and engage with multimodal models. 🚀 We are chatting with investors at #NeurIPS2025, and if you want to get on our radar, DM Sabera.
Yisong Yue tweet media
English
17
11
207
75.2K
Aadarsh Sahoo retweetledi
Kate Saenko
Kate Saenko@kate_saenko_·
🚀 Excited to share that my team at Meta just launched Segment Anything 3! SAM 3 doubles the performance of existing models on open-vocabulary instance segmentation on our new SA-Co benchmark, with 207K unique object labels. Huge congrats to the team, so proud of this work!
AI at Meta@AIatMeta

Today we’re excited to unveil a new generation of Segment Anything Models: 1️⃣ SAM 3 enables detecting, segmenting and tracking of objects across images and videos, now with short text phrases and exemplar prompts. 🔗 Learn more about SAM 3: go.meta.me/591040 2️⃣ SAM 3D brings the model collection into the 3rd dimension to enable precise reconstruction of 3D objects and people from a single 2D image. 🔗 Learn more about SAM 3D: go.meta.me/305985 These models offer innovative capabilities and unique tools for developers and researchers to create, experiment and uplevel media workflows.

English
4
9
92
9.2K