Jona Ruthardt

18 posts

Jona Ruthardt

Jona Ruthardt

@jonaruthardt

AI Research | PhD student @ Fundamental AI Lab | Vision-Language Modelling and Multi-Modal Learning

Nuremberg Katılım Aralık 2009
102 Takip Edilen37 Takipçiler
Sabitlenmiş Tweet
Jona Ruthardt
Jona Ruthardt@jonaruthardt·
Today, I had the chance to present our latest work on 𝗦𝘁𝗲𝗲𝗿𝗮𝗯𝗹𝗲 𝗩𝗶𝘀𝘂𝗮𝗹 𝗥𝗲𝗽𝗿𝗲𝘀𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻𝘀 at @SonyAI_global in their Tokyo office. This wraps up our @FunAILab visit to Japan. Grateful for the engaging discussions and new connections along the way.
Jona Ruthardt tweet media
English
0
0
1
59
Jona Ruthardt
Jona Ruthardt@jonaruthardt·
@nikparth1 @y_m_asano Indeed, using SteerViT as the vision encoder of a MLLM is a promising direction (esp. from an efficiency angle). To your 2nd point: steerability improves when scaling MLLMs (e.g. 88.2 on CORE for InternVL3.5-8B). But representation quality remains below vision-centric encoders.
English
0
0
3
39
Nikhil Parthasarathy
Nikhil Parthasarathy@nikparth1·
@y_m_asano Very cool work! I've always thought something similar about how visual representation should be conditioned on higher-level signals. Does this kind of steering still help when the encoder is used in a MLLM? Relatedly, do larger MLLMs end up closing the gap with scale?(e.g >2B)
English
2
0
1
63
Yuki
Yuki@y_m_asano·
Humans understand images differently if you tell them what to look for (see image). But generic visual representations are.. generic, and mostly focus on salient objects. To tackle this, we thought hard and introduce Steerable Visual Representations. See the thread from @gaur_manu below! 👇
Yuki tweet mediaYuki tweet media
Manu Gaur@gaur_manu

Pretrained ViTs like DINOv2 or CLIP are great, but they produce fixed, generic representations that encode the most salient visual concepts (e.g., "cat"). In human vision, prior priming with language changes how people parse an image. We believe visual encoders should do the same 🚨 Introducing Steerable Visual Representations, a new family of visual features you can steer with text towards specific visual concepts.

English
4
16
198
16.2K
Jona Ruthardt
Jona Ruthardt@jonaruthardt·
@_rabiulawal @gaur_manu Thanks for the pointer! We ran your slot-attention model on our benchmarks too. On CORE, it definitely shows steerability (59.0 vs 40.6 for DINOv2-S) but doesn't quite reach SteerViT at 93.6. Also linear probing accuracy drops by 17.3 vs. 9.1 (SteerViT) compared to vanilla DINOv2
English
1
0
0
55
Manu Gaur
Manu Gaur@gaur_manu·
Pretrained ViTs like DINOv2 or CLIP are great, but they produce fixed, generic representations that encode the most salient visual concepts (e.g., "cat"). In human vision, prior priming with language changes how people parse an image. We believe visual encoders should do the same 🚨 Introducing Steerable Visual Representations, a new family of visual features you can steer with text towards specific visual concepts.
Manu Gaur tweet media
English
13
135
899
148.7K
Jona Ruthardt
Jona Ruthardt@jonaruthardt·
@qu3tzalify @giffmana @gaur_manu But it goes further: SteerViT also steers the dense representations. Consider this example where specifying a certain person leads to clear separation from other people in the PCA feature visualization. This helps semantic discrimination in downstream tasks (e.g., segmentation).
Jona Ruthardt tweet media
English
1
0
1
51
Jona Ruthardt
Jona Ruthardt@jonaruthardt·
@qu3tzalify @giffmana @gaur_manu True, DINO patch features also encode non-salient parts. But if your task requires one object-level embedding, you'd have to build a pooling pipeline (e.g., with SAM) to determine which local embeddings to consider. SteerViT does it implicitly within the vision encoder itself.
English
1
0
1
37
Jona Ruthardt
Jona Ruthardt@jonaruthardt·
@MHR7DYN @gaur_manu The SteerViT variant used for most experiments in the paper builds directly on top of a frozen DINOv2. Therefore, comparing these two models best shows the capabilities of our method. But as SteerViT is backbone-agnostic, it is easily possible to train a variant for DINOv3 too.
English
0
0
2
54
Mahir Daiyan
Mahir Daiyan@MHR7DYN·
@gaur_manu Hey why not compare with the attention maps from dinov3??
English
1
0
0
566
Ash_Ella
Ash_Ella@lsyuan0322·
@gaur_manu Nice work! Can it distinguish between the cat's left and right ears?
English
1
0
0
159
Jona Ruthardt retweetledi
DailyPapers
DailyPapers@HuggingPapers·
Steerable Visual Representations SteerViT lets you control Vision Transformers with natural language. By injecting text directly into the encoder via lightweight cross-attention, you can steer attention toward any object while preserving representation quality.
DailyPapers tweet media
English
4
30
204
23.9K
Jona Ruthardt
Jona Ruthardt@jonaruthardt·
Ever asked yourself: 𝙃𝙤𝙬 𝙙𝙤 𝙩𝙚𝙭𝙩-𝙤𝙣𝙡𝙮 𝙇𝙇𝙈𝙨 "𝙨𝙚𝙚" 𝙩𝙝𝙚 𝙫𝙞𝙨𝙪𝙖𝙡 𝙬𝙤𝙧𝙡𝙙? Our new TMLR paper finds that language capability predicts visual alignment of LLM features (r=0.768). This means: LLM progress boosts vision(-language) models as well. 🧵⬇️
Jona Ruthardt tweet media
English
1
1
11
2.3K