Constanza Fierro

@constanzafierro

PhD fellow @coastalcph doing NLP things. Ex SWE @Google 🇫🇷🥖 and student @dccuchile 🇨🇱. I also like sports, beer, reading, and photography.

Katılım Nisan 2010

848 Takip Edilen565 Takipçiler

Constanza Fierro retweetledi

David Bau@davidbau·10 Ara

At the #Neurips2025 mechanistic interpretability workshop I gave a brief talk about Venetian glassmaking, since I think we face a similar moment in AI research today. Here is a blog post summarizing the talk: davidbau.com/archives/2025/…

English

551

106.6K

Constanza Fierro@constanzafierro·15 Kas

@ESRogs @DanielCHTan97 We actually tried this as a baseline in the experiments and for some behaviors it works, but for others it fails completely (steering towards non-sycophancy)

English

Rogs 🔍🔸@ESRogs·14 Kas

@DanielCHTan97 > then add or remove this direction to modify the model's weights Is this equivalent to just fine-tuning on more of the desired behavior? Are they modifying the weights permanently? Should this be thought of as a training technique or an inference technique?

English

361

Daniel Tan@DanielCHTan97·13 Kas

very cool paper - tl;dr it's possible to steer models by taking a weight difference rather than an activation difference. arxiv.org/abs/2511.05408

English

283

29.1K

Constanza Fierro@constanzafierro·15 Kas

@DanielCHTan97 Thanks for the shout out! Link to the thread in case anyone wants to check it out x.com/constanzafierr…

Constanza Fierro@constanzafierro

Can we find weight directions to modify LLM's behaviors? Our new paper proposes contrastive weight steering, an alternative to activation steering for modifying behaviors using small narrow distribution data 🕹️ 🧵👇

English