Constanza Fierro

439 posts

Constanza Fierro banner
Constanza Fierro

Constanza Fierro

@constanzafierro

PhD fellow @coastalcph doing NLP things. Ex SWE @Google 🇫🇷🥖 and student @dccuchile 🇨🇱. I also like sports, beer, reading, and photography.

Katılım Nisan 2010
848 Takip Edilen565 Takipçiler
Constanza Fierro retweetledi
David Bau
David Bau@davidbau·
At the #Neurips2025 mechanistic interpretability workshop I gave a brief talk about Venetian glassmaking, since I think we face a similar moment in AI research today. Here is a blog post summarizing the talk: davidbau.com/archives/2025/…
David Bau tweet media
English
24
99
551
106.6K
Constanza Fierro
Constanza Fierro@constanzafierro·
@ESRogs @DanielCHTan97 We actually tried this as a baseline in the experiments and for some behaviors it works, but for others it fails completely (steering towards non-sycophancy)
English
0
0
1
15
Rogs 🔍🔸
Rogs 🔍🔸@ESRogs·
@DanielCHTan97 > then add or remove this direction to modify the model's weights Is this equivalent to just fine-tuning on more of the desired behavior? Are they modifying the weights permanently? Should this be thought of as a training technique or an inference technique?
English
2
0
2
361
Daniel Tan
Daniel Tan@DanielCHTan97·
very cool paper - tl;dr it's possible to steer models by taking a weight difference rather than an activation difference. arxiv.org/abs/2511.05408
English
7
33
283
29.1K
Constanza Fierro
Constanza Fierro@constanzafierro·
@Prakucho Cool! We missed this connection. We’ll add the citation in the next arXiv version 😄
English
0
0
0
42
Constanza Fierro
Constanza Fierro@constanzafierro·
Can we find weight directions to modify LLM's behaviors? Our new paper proposes contrastive weight steering, an alternative to activation steering for modifying behaviors using small narrow distribution data 🕹️ 🧵👇
Constanza Fierro tweet media
English
5
30
207
14.4K