Raphi Kang

32 posts

Raphi Kang

Raphi Kang

@RaphiKang

Caltech PhD doing Computer Vision / Mechanistic Interpretability things MIT '23

Katılım Ekim 2021
126 Takip Edilen74 Takipçiler
Sabitlenmiş Tweet
Raphi Kang
Raphi Kang@RaphiKang·
🤓 How do LVLM/LMMMs reason about space and time? This was the central question of our #ICLR2016 paper, “Linear Mechanisms For Spatiotemporal Reasoning In Vision Language Models”. I’m very excited to finally share it:D 🥳🥳 A thread: [1/7]
Raphi Kang tweet media
English
2
12
63
3.5K
Raphi Kang retweetledi
David Bau
David Bau@davidbau·
NetHack is one of the most complex and longest-lived open source programs ever written, and after 46 years, v5.0 shipped today. nethack.org/common/index.h… And ... it is a VERY cool large codebase to work with in the LLM era.
David Bau tweet media
English
18
194
1K
102K
Raphi Kang retweetledi
Aadarsh Sahoo
Aadarsh Sahoo@SahooAadarsh·
Perception is actionable. Humans don't just see objects, we see affordances and constraints. "Something to sit on." "Region unsafe to walk." "Something that will tip if I bump it." But today’s vision models mostly see… labels. So we built ConverSeg: Conversational Image Segmentation 🧵 glab-caltech.github.io/converseg/
English
7
21
95
12.6K
Raphi Kang retweetledi
Bahareh Tolooshams
Bahareh Tolooshams@BTolooshams·
Call for Reviewers | FOMORep @ ICML 2026 We are organizing the first workshop on the Geometry of Foundation Model Representations (FOMORep, under submission) at ICML 2026. Fill out this form if you are interested in serving on the Program Committee, whose role would be to review submissions. forms.gle/9f5zCP564LuARu… FOMORep focuses on understanding the geometry of representations learned by foundation models, specifically, what geometric structures these representations acquire, why they arise, and how they relate to performance and robustness. The workshop brings together researchers from representation learning, geometric machine learning, deep learning, and applied mathematics. Thanks for supporting the growing community for understanding foundation models with a geometric lens. -- Organizing Committee: Guy Gilboa (Technion) Raphi Kang (Caltech) @RaphiKang Uri Shaham (Bar-Ilan University) @UXShaham Yue Song (Caltech & Tsinghua University) @YueSong48287250 Bahareh Tolooshams (University of Alberta) Yossi Levi (Technion)
English
0
2
12
1.2K
Raphi Kang retweetledi
Damiano Marsili
Damiano Marsili@marsilidamiano·
Our paper, VALOR, got accepted at #ICLR2026 ! We explore improving visual reasoning using multimodal verifiers - all without any ground truth annotations! More details below 👇 Excited to see everyone in Rio!
Damiano Marsili@marsilidamiano

(1/N): Can we improve visual reasoning models without annotations? In VALOR, we introduce an annotation-free training framework that boosts both visual reasoning and object grounding by training with multimodal verifiers instead of human labels

English
2
6
29
4.8K
Raphi Kang
Raphi Kang@RaphiKang·
🤓 How do LVLM/LMMMs reason about space and time? This was the central question of our #ICLR2016 paper, “Linear Mechanisms For Spatiotemporal Reasoning In Vision Language Models”. I’m very excited to finally share it:D 🥳🥳 A thread: [1/7]
Raphi Kang tweet media
English
2
12
63
3.5K
Raphi Kang retweetledi
vincent!
vincent!@vvhuang_·
We trained a decoder to read the internal activations of an LLM and answer questions about what the model will think about or do next. We find that this decoder can understand LLM behaviors, even when the model itself is confused! (for instance, if the model has been jailbroken)
vincent! tweet media
Transluce@TransluceAI

Transluce is developing end-to-end interpretability approaches that directly train models to make predictions about AI behavior. Today we introduce Predictive Concept Decoders (PCD), a new architecture that embodies this approach.

English
9
27
106
20.3K
Raphi Kang retweetledi
Ziqi Ma
Ziqi Ma@ziqi__ma·
Generative models shouldn’t just generate. They should be steerable by your commands. Meet Steer3D🕹️: edit generated 3D assets with text📝 in one forward pass. Trained on only 100k synthetic data, it shows that we can make generative models responsive to signals from another modality🎛️. Check out: glab-caltech.github.io/steer3d/
English
8
55
403
32.6K
Raphi Kang retweetledi
Damiano Marsili
Damiano Marsili@marsilidamiano·
(1/N): Can we improve visual reasoning models without annotations? In VALOR, we introduce an annotation-free training framework that boosts both visual reasoning and object grounding by training with multimodal verifiers instead of human labels
English
4
8
73
9.1K
Raphi Kang retweetledi
Neehar Kondapaneni
Neehar Kondapaneni@TheRealPaneni·
Excited to share our paper Representational Difference Explanations (RDX) was accepted to #NeurIPS2025! 🎉RDX is a new method for model diffing designed to isolate 🔍 representational differences. 1/7
Neehar Kondapaneni tweet media
English
1
6
18
3.1K
Raphi Kang retweetledi
Amil Dravid
Amil Dravid@_AmilDravid·
Our paper "Vision Transformers Don't Need Trained Registers" will appear as a Spotlight at NeurIPS 2025! We uncover the mechanism behind high-norm tokens and attention sinks in ViTs, propose a training-free fix, and recently added an analytical model -- more on that below. ⬇️
Nick Jiang@nickhjiang

Vision transformers have high-norm outliers that hurt performance and distort attention. While prior work removed them by retraining with “register” tokens, we find the mechanism behind outliers and make registers at ✨test-time✨—giving clean features and better performance! 🧵

English
7
41
388
48.9K
Raphi Kang retweetledi
Yisong Yue
Yisong Yue@yisongyue·
Since she's way too shy to post this herself, please join me in congratulating my amazing colleague and friend @klbouman for receiving tenure at @caltech! 🥳🎉
Yisong Yue tweet media
English
108
136
9K
2.7M
Raphi Kang
Raphi Kang@RaphiKang·
💡 We then propose Dense Cosine Similarity Maps (DCSMs): matrices preserving patch-token level topology, augmented with functional word awareness. We train a lightweight scoring module on top, which consistently outperform CLIP-like models. [5/6]
English
0
0
0
60
Raphi Kang
Raphi Kang@RaphiKang·
📐 In our work, we formalize the CLIP latent space and show that no CLIP-style joint embedding w/ unit vectors + cosine similarity can at once represent basic image content, attribute binding, spatial relationships, and negation, due to geometric constraints. [4/6]
English
3
0
0
118
Raphi Kang
Raphi Kang@RaphiKang·
🚀 Sharing our #ICCV2025 paper, "Is CLIP ideal? No. Can we fix it? Yes!". We will be at Poster session 5 (10:45AM 10/23), please come find us to chat or reach out online! A thread: [1/6]
English
1
6
10
893