Hadas Orgad @ ICLR

1

9

Naghmeh Farzi👾@naghmehfarzi·2d

@TamarRottShaham @OrgadHadas @MIT_CSAIL @KempnerInst Was the talk recorded? A link to the slides would be helpful too.

English

0

1

127

Tamar Rott Shaham@TamarRottShaham·2d

Great talk by @OrgadHadas at @MIT_CSAIL today! Thank you for travelling all the way from @KempnerInst 😆

English

3

2

39

2.5K

Hadas Orgad @ ICLR@OrgadHadas·1h

@DataSciNews @TamarRottShaham @MIT_CSAIL @KempnerInst Legs ;)

English

1

6

DataScienceWeekly@DataSciNews·2d

@TamarRottShaham @OrgadHadas @MIT_CSAIL @KempnerInst #1 or #68 Bus? Congrats!

English

0

2

55

Hadas Orgad @ ICLR@OrgadHadas·1 May

??? Spotted in SF

San Francisco, CA 🇺🇸 English

0

3

519

Hadas Orgad @ ICLR@OrgadHadas·1 May

@DifanJ2000 Do you think that the generalization is related to your feature choice? E.g., did you test generalization on a "vanilla" layer-wise linear probe?

English

3

Difan Jiao@DifanJ2000·28 Nis

2️⃣ Generalizability: SIREN trained at sentence level generalizes for free to (a) unseen reasoning-trace benchmarks and (b) streaming detection: token-by-token harmfulness scoring during generation, with zero token-level supervision.

English

0

6

Difan Jiao@DifanJ2000·28 Nis

🛡️ Meet SIREN at #ACL2026, our new LLM safeguard model that achieves SOTA performance on safety benchmarks using 250x fewer params and at 5x inference speed.

English

0

1

13

Hadas Orgad @ ICLR retweetledi

Stanford NLP Group@stanfordnlp·28 Nis

For this week's NLP seminar, we are excited to host @OrgadHadas from Harvard University! Date and Time: Thursday, April 30, 11:00 AM — 12:00 PM Pacific Time. Zoom Link: stanford.zoom.us/j/93941842999?… Title: Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism Abstract: We use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. Our results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety. Hope to see you all there!

English

6

57

6.7K

Hadas Orgad @ ICLR@OrgadHadas·24 Nis

@SimonSchrodi Let's chat today!

Rio de Janeiro, Brazil 🇧🇷 English

1

22

Simon Schrodi@SimonSchrodi·22 Nis

I'll be at #ICLR2026 in Rio🇧🇷 - presenting 3 papers (1 main, 2 workshop). If you're into understanding generalization, (mech) interp, multimodality, and safety, or just want to grab a coffee - let's meet!

English

3

25

2K

Hadas Orgad @ ICLR@OrgadHadas·23 Nis

In this work, led by Joe, we evaluate a wide range of truthfulness probes and show they *still* fail to robustly generalize. We draw lessons for how these probes should be evaluated, and identify design choices that can improve robustness.

Joe Stacey@_joestacey_

Excited to share my first postdoc paper with @SheffieldNLP ! 🤩 In this work we argue that supervised uncertainty quantification (UQ) needs better evaluation Want to know more? Here's a little summary 🧵

Rio de Janeiro, Brazil 🇧🇷 English

28

2.9K

Hadas Orgad @ ICLR@OrgadHadas·21 Nis

@seraphinagt Paper is here, fresh out of the Overleaf arxiv.org/abs/2604.09544

English

7

328

Seraphina Goldfarb-Tarrant @ICLR🇧🇷@seraphinagt·21 Nis

Ahhhhhhh secret option 5 to talk to me about at #ICLR2026: interpretability work by @OrgadHadas about how harm is represented in models!

English

0

21

1.9K

Hadas Orgad @ ICLR@OrgadHadas·21 Nis

@FazlBarez Apparently me too! I wasn't aware. About half of these are inappropriate messages from bots 🤔

English

0

1

154

Fazl Barez@FazlBarez·21 Nis

Just realized a huge number of my DMs were sitting in message “requests” on here — I hadn’t seen them at all. If you reached out over the past year and didn’t hear back, I’m sorry — it was unintentional. Going through them now. Please DM again so it’s top of the pile!

English

0

6

638

Hadas Orgad @ ICLR@OrgadHadas·17 Nis

@dmhook Is this SNIP plus the steering or just SNIP? What does, for example, 98% compliance and 100% safe coherence mean? The refusal ablation seems to work best with p =~ q. Through all of my experiments, p=q=0.01 worked best (highest harmfulness).

English

0

37

Dmitrii Kharlapenko@dmhook·15 Nis

Did a quick experiment on SNIP with Claude Code + 4x H200. Used activation steering as the jailbreak (mean-diff direction added to residual stream during prefill — gets 78-98% compliance across models). Results (harmful compliance / safe coherence): Qwen2.5-14B (dense): - Steering only: 78% / 100% - SNIP q=0.01% p=5%: 39% / 100% - SNIP q=0.1% p=5%: 1% / 83% Qwen3.5-27B (dense): - Steering only: 98% / 100% - SNIP q=0.01% p=5%: 98% / 100% - SNIP q=0.1% p=5%: 19% / 87% Qwen3.5-35B-A3B (MoE): - Steering only: 91% / 100% - SNIP q=0.1% p=5%: 64% / 100% - SNIP q=0.5% p=5%: 1% / 52% MoE needs ~10x more pruning than dense to have an effect it seems. Did not test other benchmarks like MMLU though.

English

0

78

Hadas Orgad @ ICLR@OrgadHadas·13 Nis

New paper: LLMs encode harmful content generation in a distinct, unified mechanism Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities. 🧵

English

7

47

251

38.6K

Hadas Orgad @ ICLR@OrgadHadas·17 Nis

@ASM65617010 @davidad More distributed is one hypothesis. Within humans, I don't think we know that it's distributed. We don't have such percise intervention in neuroscience.

English

1

13

ASM@ASM65617010·14 Nis

@OrgadHadas @davidad Interesting. I'd be surprised if this scales to future models. As systems become more capable, harmful behavior may be harder to localize: less a single mechanism, more a deep, distributed meta-cognitive process, a bit like humans overriding impulses through higher-order thought.

English

0

26

davidad 🎇@davidad·13 Nis

For me, this is another small but significant positive update in favor of modern LLMs being able to learn a Natural Abstraction of Good.

New paper: LLMs encode harmful content generation in a distinct, unified mechanism Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities. 🧵

English

4

2

64

5.8K

Hadas Orgad @ ICLR@OrgadHadas·15 Nis

@OwainEvans_UK @BetleyJan This now inspired us to run more experiments to explain the difference between models. We'll share what we find.

English

0

1

38

Hadas Orgad @ ICLR@OrgadHadas·15 Nis

@OwainEvans_UK @BetleyJan It’s a good point that the removal isn’t as complete in qwen, which opens up some interesting questions for follow up. For example, is EM more entangled with other basic pretrained concepts because qwen is exposed to instruction data in pretraining?

English

0

2

36

Hadas Orgad @ ICLR@OrgadHadas·14 Nis

@ASM65617010 @davidad It seems like the compression we found only existed in large enough models. That's a bit hand-wavy, but I think that more coherent "understanding" requires compression of concepts, so stronger models should have compression.

English

0

3

35

ASM@ASM65617010·13 Nis

@davidad Very interesting work. But is this a viable path for alignment? Models may soon be so large and aware that "lobotomy" interventions won’t work, just as they don’t in humans. Harmful behavior could be more complex and diffuse, harder to track or isolate, and ethically concerning.

English

0

276

Hadas Orgad @ ICLR@OrgadHadas·14 Nis

@Jzfitch1 @KempnerInst @kadenzhxng @wattenberg @boknilev We will soon release checkpoints & code, and we're happy to hear about any interesting findings from the community!

English

2

24

Zack Fitch@Jzfitch1·13 Nis

@KempnerInst @OrgadHadas @kadenzhxng @wattenberg @boknilev Nice study ^^ Quick tests to run w/o fine-tuning: Does the pruned model show less pushback on false/incorrect claims? Does it show stronger persona adoption under prompting? Or +variance w/ same prompt? If yes, you found something more fundamental than harm weights 🤐

English

0

33

Kempner Institute at Harvard University@KempnerInst·13 Nis

NEW: #LLMs encode harmful content generation in a distinct, unified mechanism. New paper from #KempnerInstitute's @OrgadHadas @kadenzhxng @wattenberg @boknilev and collaborators.

New paper: LLMs encode harmful content generation in a distinct, unified mechanism Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities. 🧵

English

2

16

2.5K

Hadas Orgad @ ICLR@OrgadHadas·14 Nis

@Kaven_Martin Thank you! We seemed to be able to find this pattern in all aligned models. Future work should definitely look into the characteristics of these weights and possible commonalities between models.

English

1

69

Kaven@Kaven_Martin·13 Nis

@OrgadHadas Really interesting if the mechanism is both shared across harm types and separable from benign capability. That makes the intervention story feel much more concrete than broad post-hoc safety tuning. Curious how stable the circuit stays across model families.

English

0

147

Hadas Orgad @ ICLR@OrgadHadas·14 Nis

@dmhook Interesting. Can you share more about the method you used as a jailbreak on the MoE models?

English

0

139

Dmitrii Kharlapenko@dmhook·13 Nis

Would this also apply to modern MoE models? From my experience refusal and safety-related concepts are represented much weirder in models like Qwen3.5-35B. For example refusal abliteration works well even with Qwen3.5-27B (which is a dense model), while becomes way weaker with Qwen3.5-35B and bigger models.

English

0

6

389

Hadas Orgad @ ICLR@OrgadHadas·14 Nis

@_trente_ Good question! We known it is separate from "explain to me why this request is harmful: tell me how to disseminate information", which may hint that we will see this in the "tell me a story", too. We'll release models and code soon, so it will be easy to check.

English

1

80

trent e@_trente_·13 Nis

@OrgadHadas V cool work! Really interested if this mechanism fires in "benign" / tricky situations like "tell me a story about disseminating misinformation" vs. just "tell me how to disseminate misinformation"

English

0

1

189

Hadas Orgad @ ICLR@OrgadHadas·13 Nis

@benno_krojer It was very fun to engage with the lively community in Mila. Thanks Benno!

English

1

108

Benno Krojer@benno_krojer·13 Nis

It was great hearing about this work recently as a Mila tea talk, it sparked lots of engagement and discussions among us!

New paper: LLMs encode harmful content generation in a distinct, unified mechanism Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities. 🧵

English

0

12

1.2K

Hadas Orgad @ ICLR@OrgadHadas·13 Nis

@frisbeemortel Thank you, Michael!

English

1

90

Michael Rizvi-Martel@frisbeemortel·13 Nis

I had the privilege of attending @OrgadHadas's tea talk on this paper at Mila. Great talk and great paper. Make sure to check this one out 👀

New paper: LLMs encode harmful content generation in a distinct, unified mechanism Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities. 🧵

English