Abhishek Shetty

99 posts

Abhishek Shetty

Abhishek Shetty

@AShettyV

Incoming Asst prof at @gatech_scs FODSI Postdoctoral Fellow @MIT PhD Student @Berkeley_EECS; Apple AI/ML Research Fellow 2023

Cambridge, MA Katılım Şubat 2020
1.4K Takip Edilen563 Takipçiler
Abhishek Shetty retweetledi
Vaishnavh Nagarajan
Vaishnavh Nagarajan@_vaishnavh·
if the low-rank logits really holds across settings, i expect it should have a lot of downstream corollaries & connections waiting to be discovered
English
0
1
3
237
Abhishek Shetty retweetledi
Vaishnavh Nagarajan
Vaishnavh Nagarajan@_vaishnavh·
i also like the low-rank logits finding (arxiv.org/abs/2510.24966) bc it provides a novel abstraction to think about what function a trained LLM implements. (it actually took me a while to understand and appreciate and buy the exact result)
English
1
1
3
304
Abhishek Shetty retweetledi
Vaishnavh Nagarajan
Vaishnavh Nagarajan@_vaishnavh·
incredibly, you can select these datapoints through a straightforward method: see whether the given preference is aligned with a model prompted with the target behavior. (i'd have expected that you'd need an exponential search over all possible data subsets to accomplish this)
English
1
1
2
285
Abhishek Shetty retweetledi
Vaishnavh Nagarajan
Vaishnavh Nagarajan@_vaishnavh·
this paper discovers another spooky generalization effect: to trigger any target behavior in an LLM, you can carefully subselect from a *completely unrelated* preference dataset such that preference finetuning on that subselected dataset produces that behavior.
English
1
2
5
532
Abhishek Shetty retweetledi
Lester Mackey
Lester Mackey@LesterMackey·
@tobias_schrdr and I are excited to share WildCat: Near-Linear Attention in Theory and Practice arxiv.org/abs/2602.10056 By attending over a spectrally-accurate optimally-weighted coreset, WildCat approximates exact attention with super-polynomial error decay in near-linear time
Lester Mackey tweet media
English
5
11
64
8.2K
Abhishek Shetty retweetledi
Lester Mackey
Lester Mackey@LesterMackey·
In contrast, prior practical approximations either lack error guarantees or require quadratic runtime to guarantee such high fidelity. We couple the theory with benchmark experiments highlighting the benefits for image generation, image classification, and KV cache compression.
English
1
1
4
923
Abhishek Shetty retweetledi
Nika Haghtalab
Nika Haghtalab@nhaghtal·
5/n Our mechanism: Logit-Linear Selection Take a system prompt and any preference dataset (we used tulu2.5). For each training data (prompt, r+, r-) where r+ is preferred to r-, ask a teacher model: does adding system prompt make you prefer r+ even more than r-? This effectively gives a score for how aligned an example is with the system prompt. We then use these scores to reweight or filter tulu2.5. If you believe that the linear embeddings are the same across models (see above post), then no matter what the student model is, once trained on the reweighted data, the student model acts as if it’s conditioned on system prompt.
Nika Haghtalab tweet media
English
0
1
3
439
Abhishek Shetty retweetledi
Nika Haghtalab
Nika Haghtalab@nhaghtal·
4/n How do system prompts correspond to directions, how to identify those directions, and how stable they are across different model families? We explain all of this through recent findings on logit-linear properties of language models. In short, recent work has shown that log Pr(response | prompt, sys prompt) is approximately the dot product between ψ(sys prompt) and ϕ(prompt, response), where ψ and ϕ are some universal embeddings. You can think of fine-tuning the model in this context as updating its ψ(·) embedding while keeping the embeddings ϕ(p, r) the same. So filtering a dataset can be interpreted as retaining only the points that push ψ(·) in some direction. We also give evidence that these embeddings are kind of universal across models. E.g., this is the overlap between top-100 principal row subspaces of log probability matrices for different models.
Nika Haghtalab tweet media
English
1
1
5
520
Abhishek Shetty retweetledi
Nika Haghtalab
Nika Haghtalab@nhaghtal·
1/n Now that I have a bit more time, I wanted to share more about this paper and my own thought process. There’s a gap in how we talk about data and behavior in LLMs. On the one hand, we say “data is the driver” and try to interpret what human preferences the data is showing. On the other hand, we keep seeing learned behaviors that don’t seem to be present in the data – at least not in any human-readable way. Examples like subliminal transfer, covert malicious finetuning, and weird generalization have been quite intriguing for me. In this paper, we ask what’s really causing that, and whether there’s a simple, general mechanism behind it.
Nika Haghtalab@nhaghtal

Did you know that your LLM is secretly an Ouija board?! Fun fact: Subsets of standard data sets can embed hidden instructions into your model and to turn them into evil rulers, animal lovers, and translators. No sys prompt. No signals. Just ghosts in the data.

English
4
9
25
11K
Abhishek Shetty retweetledi
Nika Haghtalab
Nika Haghtalab@nhaghtal·
2/n We show that through fine-tuning, a model can be made to behave as if it were conditioned on a particular system prompt, even when it is not system-prompted at inference time. We do this by showing that any dataset can be reweighted or filtered to teach a student to follow the instruction of the system prompts, and even when the data set contains no instances of the instruction. Fun example: there is a simple way to choose a subset of a standard preference learning dataset like tulu2.5 that has no text in Spanish, but when we fine-tune a model on this subset, the fine-tuned model learns to speak primarily in Spanish.
English
1
1
2
428
Abhishek Shetty retweetledi
Nika Haghtalab
Nika Haghtalab@nhaghtal·
3/n: The Explanation What’s cool is that this real and kind of spooky empirical behavior has a very clean mathematical explanation that leads to a general mechanism too. The question of “what's in my data?” is hard because invisible patterns to the human eye can show up as tiny gradients that once aggregated push the model in a certain direction. So don’t think of a dataset as human-readable examples but as a distribution over feature directions, where reweighting or filtering changes which directions get reinforced.
English
1
1
4
468
Abhishek Shetty retweetledi
Fermi Ma
Fermi Ma@fermi_ma·
I’m excited about this line of work because it shows that if you pick the right abstractions, you can use clean theoretical reasoning to make surprising predictions about LLM behavior.
English
0
1
9
526
Abhishek Shetty retweetledi
Fermi Ma
Fermi Ma@fermi_ma·
In the new paper, this framework is used to model the system prompt and training examples as vectors in low-dim “behavior space”. To spoof the system prompt, they simply fine-tune on the examples w/ positive inner product with the system prompt vector!
English
1
2
4
638
Abhishek Shetty retweetledi
Fermi Ma
Fermi Ma@fermi_ma·
As Ishaq explains, this work exploits the “low logit-rank framework” introduced in arxiv.org/abs/2510.24966 which found that the log probabilities of LLM outputs obey surprisingly linear relationships.
English
1
3
6
764