Jona Ruthardt (@jonaruthardt) - Twitter Profili

Sabitlenmiş Tweet

Jona Ruthardt@jonaruthardt·11 Nis

What if you could 𝘵𝘦𝘭𝘭 vision encoders 𝘸𝘩𝘢𝘵 to encode? Check out our latest work where we introduce 𝗦𝘁𝗲𝗲𝗿𝗮𝗯𝗹𝗲 𝗩𝗶𝘀𝘂𝗮𝗹 𝗥𝗲𝗽𝗿𝗲𝘀𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻𝘀.

Manu Gaur@gaur_manu

Pretrained ViTs like DINOv2 or CLIP are great, but they produce fixed, generic representations that encode the most salient visual concepts (e.g., "cat"). In human vision, prior priming with language changes how people parse an image. We believe visual encoders should do the same 🚨 Introducing Steerable Visual Representations, a new family of visual features you can steer with text towards specific visual concepts.

English

0

5

256

Jona Ruthardt@jonaruthardt·14 Nis

Today, I had the chance to present our latest work on 𝗦𝘁𝗲𝗲𝗿𝗮𝗯𝗹𝗲 𝗩𝗶𝘀𝘂𝗮𝗹 𝗥𝗲𝗽𝗿𝗲𝘀𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻𝘀 at @SonyAI_global in their Tokyo office. This wraps up our @FunAILab visit to Japan. Grateful for the engaging discussions and new connections along the way.

English

0

1

59

Jona Ruthardt@jonaruthardt·13 Nis

@nikparth1 @y_m_asano Indeed, using SteerViT as the vision encoder of a MLLM is a promising direction (esp. from an efficiency angle). To your 2nd point: steerability improves when scaling MLLMs (e.g. 88.2 on CORE for InternVL3.5-8B). But representation quality remains below vision-centric encoders.

English

0

3

39

Nikhil Parthasarathy@nikparth1·12 Nis

@y_m_asano Very cool work! I've always thought something similar about how visual representation should be conditioned on higher-level signals. Does this kind of steering still help when the encoder is used in a MLLM? Relatedly, do larger MLLMs end up closing the gap with scale?(e.g >2B)

English

2

0

1

63

Yuki@y_m_asano·10 Nis

Humans understand images differently if you tell them what to look for (see image). But generic visual representations are.. generic, and mostly focus on salient objects. To tackle this, we thought hard and introduce Steerable Visual Representations. See the thread from @gaur_manu below! 👇

Manu Gaur@gaur_manu

Pretrained ViTs like DINOv2 or CLIP are great, but they produce fixed, generic representations that encode the most salient visual concepts (e.g., "cat"). In human vision, prior priming with language changes how people parse an image. We believe visual encoders should do the same 🚨 Introducing Steerable Visual Representations, a new family of visual features you can steer with text towards specific visual concepts.

English

4

16

198

16.2K

Jona Ruthardt@jonaruthardt·13 Nis

@_rabiulawal @gaur_manu Thanks for the pointer! We ran your slot-attention model on our benchmarks too. On CORE, it definitely shows steerability (59.0 vs 40.6 for DINOv2-S) but doesn't quite reach SteerViT at 93.6. Also linear probing accuracy drops by 17.3 vs. 9.1 (SteerViT) compared to vanilla DINOv2

English

1

0

55

Rabiul Awal@_rabiulawal·10 Nis

@gaur_manu Nice work, congrats! We've explored this before with slot-attention in our cvpr'25 paper arxiv.org/abs/2503.21747

English

1

0

3

126

Manu Gaur@gaur_manu·10 Nis

Pretrained ViTs like DINOv2 or CLIP are great, but they produce fixed, generic representations that encode the most salient visual concepts (e.g., "cat"). In human vision, prior priming with language changes how people parse an image. We believe visual encoders should do the same 🚨 Introducing Steerable Visual Representations, a new family of visual features you can steer with text towards specific visual concepts.

English

13

135

899

148.7K

Jona Ruthardt@jonaruthardt·13 Nis

@qu3tzalify @giffmana @gaur_manu But it goes further: SteerViT also steers the dense representations. Consider this example where specifying a certain person leads to clear separation from other people in the PCA feature visualization. This helps semantic discrimination in downstream tasks (e.g., segmentation).

English

1

0

1

51

Jona Ruthardt@jonaruthardt·13 Nis

@qu3tzalify @giffmana @gaur_manu True, DINO patch features also encode non-salient parts. But if your task requires one object-level embedding, you'd have to build a pooling pipeline (e.g., with SAM) to determine which local embeddings to consider. SteerViT does it implicitly within the vision encoder itself.

English

1

0

1

37

Jona Ruthardt@jonaruthardt·13 Nis

@MHR7DYN @gaur_manu The SteerViT variant used for most experiments in the paper builds directly on top of a frozen DINOv2. Therefore, comparing these two models best shows the capabilities of our method. But as SteerViT is backbone-agnostic, it is easily possible to train a variant for DINOv3 too.

English

0

2

54

Mahir Daiyan@MHR7DYN·11 Nis

@gaur_manu Hey why not compare with the attention maps from dinov3??

English

1

0

566

Jona Ruthardt@jonaruthardt·13 Nis

@lsyuan0322 @gaur_manu Have a look at our interactive demo: huggingface.co/spaces/JonaRut… There's an example with a cat.

English

0

1

43

Ash_Ella@lsyuan0322·12 Nis

@gaur_manu Nice work! Can it distinguish between the cat's left and right ears?

English

1

0

159

Jona Ruthardt retweetledi

Yuki@y_m_asano·5 Nis

Steerable Visual Representations. 👇 From @jonaruthardt, @gaur_manu, @RamananDeva, @MakarandTapaswi and me :). More Infos soon.

DailyPapers@HuggingPapers

Steerable Visual Representations SteerViT lets you control Vision Transformers with natural language. By injecting text directly into the encoder via lightweight cross-attention, you can steer attention toward any object while preserving representation quality.

English

1

13

100

12K

Jona Ruthardt retweetledi

DailyPapers@HuggingPapers·4 Nis

Steerable Visual Representations SteerViT lets you control Vision Transformers with natural language. By injecting text directly into the encoder via lightweight cross-attention, you can steer attention toward any object while preserving representation quality.

English

4

30

204

23.9K

Jona Ruthardt@jonaruthardt·27 Oca

Curious to learn more? 📝 Paper: jonaruthardt.github.io/assets/pdf/Sha… 🌐 Project Page: jonaruthardt.github.io/project/ShareL… 💻 Code: github.com/JonaRuthardt/S…

English

0

59

Jona Ruthardt@jonaruthardt·27 Oca

Big thanks to @gjburghouts , @SergeBelongie, and @y_m_asano for the collaboration and guidance throughout this project.

English

1

0

2

68

Jona Ruthardt@jonaruthardt·27 Oca

Ever asked yourself: 𝙃𝙤𝙬 𝙙𝙤 𝙩𝙚𝙭𝙩-𝙤𝙣𝙡𝙮 𝙇𝙇𝙈𝙨 "𝙨𝙚𝙚" 𝙩𝙝𝙚 𝙫𝙞𝙨𝙪𝙖𝙡 𝙬𝙤𝙧𝙡𝙙? Our new TMLR paper finds that language capability predicts visual alignment of LLM features (r=0.768). This means: LLM progress boosts vision(-language) models as well. 🧵⬇️

English

1

11

2.3K

Jona Ruthardt

Keşfet