Constanza Fierro
439 posts

Constanza Fierro
@constanzafierro
Postdoc @coastalcph and @MATSprogram fellow doing NLP things. Ex SWE @Google and student @dccuchile. I also like sports, photography, and reading.
Katılım Nisan 2010
857 Takip Edilen574 Takipçiler
Constanza Fierro retweetledi

At the #Neurips2025 mechanistic interpretability workshop I gave a brief talk about Venetian glassmaking, since I think we face a similar moment in AI research today.
Here is a blog post summarizing the talk:
davidbau.com/archives/2025/…

English

@ESRogs @DanielCHTan97 We actually tried this as a baseline in the experiments and for some behaviors it works, but for others it fails completely (steering towards non-sycophancy)
English

@DanielCHTan97 > then add or remove this direction to modify the model's weights
Is this equivalent to just fine-tuning on more of the desired behavior?
Are they modifying the weights permanently? Should this be thought of as a training technique or an inference technique?
English

very cool paper - tl;dr it's possible to steer models by taking a weight difference rather than an activation difference. arxiv.org/abs/2511.05408
English

@DanielCHTan97 Thanks for the shout out! Link to the thread in case anyone wants to check it out
x.com/constanzafierr…
Constanza Fierro@constanzafierro
Can we find weight directions to modify LLM's behaviors? Our new paper proposes contrastive weight steering, an alternative to activation steering for modifying behaviors using small narrow distribution data 🕹️ 🧵👇
English

@Prakucho Cool! We missed this connection. We’ll add the citation in the next arXiv version 😄
English

@constanzafierro Same idea was explored for factuality in text summarization back in 2022 arxiv.org/pdf/2110.07166
English

Check out the paper and blogpost for more details:
Paper: arxiv.org/abs/2511.05408
English

