Hadas Orgad @ ICLR

285 posts

Hadas Orgad @ ICLR

Hadas Orgad @ ICLR

@OrgadHadas

Research Fellow @ Kempner Institute, Harvard | Interested in AI interpretability, robustness & safety

Katılım Nisan 2019
139 Takip Edilen1K Takipçiler
Sabitlenmiş Tweet
Hadas Orgad @ ICLR
Hadas Orgad @ ICLR@OrgadHadas·
I'm excited to share that I'll be joining @KempnerInst @Harvard as a research fellow this September!
Kempner Institute at Harvard University@KempnerInst

Thrilled to announce the 2025 recipients of #KempnerInstitute Research Fellowships: Elom Amemastro, Ruojin Cai, David Clark, Alexandru Damian, William Dorrell, Mark Goldstein, Richard Hakim, Hadas Orgad, Gizem Ozdil, Gabriel Poesia, & Greta Tuckute! bit.ly/3IpzD5E

Vancouver, British Columbia 🇨🇦 English
8
2
108
8.8K
Hadas Orgad @ ICLR
Hadas Orgad @ ICLR@OrgadHadas·
??? Spotted in SF
Hadas Orgad @ ICLR tweet media
San Francisco, CA 🇺🇸 English
1
0
3
519
Hadas Orgad @ ICLR
Hadas Orgad @ ICLR@OrgadHadas·
@DifanJ2000 Do you think that the generalization is related to your feature choice? E.g., did you test generalization on a "vanilla" layer-wise linear probe?
English
0
0
0
3
Difan Jiao
Difan Jiao@DifanJ2000·
2️⃣ Generalizability: SIREN trained at sentence level generalizes for free to (a) unseen reasoning-trace benchmarks and (b) streaming detection: token-by-token harmfulness scoring during generation, with zero token-level supervision.
Difan Jiao tweet media
English
2
0
0
6
Difan Jiao
Difan Jiao@DifanJ2000·
🛡️ Meet SIREN at #ACL2026, our new LLM safeguard model that achieves SOTA performance on safety benchmarks using 250x fewer params and at 5x inference speed.
English
1
0
1
13
Hadas Orgad @ ICLR retweetledi
Stanford NLP Group
Stanford NLP Group@stanfordnlp·
For this week's NLP seminar, we are excited to host @OrgadHadas from Harvard University! Date and Time: Thursday, April 30, 11:00 AM — 12:00 PM Pacific Time. Zoom Link: stanford.zoom.us/j/93941842999?… Title: Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism Abstract: We use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. Our results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety. Hope to see you all there!
Stanford NLP Group tweet media
English
0
6
57
6.7K
Simon Schrodi
Simon Schrodi@SimonSchrodi·
I'll be at #ICLR2026 in Rio🇧🇷 - presenting 3 papers (1 main, 2 workshop). If you're into understanding generalization, (mech) interp, multimodality, and safety, or just want to grab a coffee - let's meet!
English
2
3
25
2K
Hadas Orgad @ ICLR
Hadas Orgad @ ICLR@OrgadHadas·
In this work, led by Joe, we evaluate a wide range of truthfulness probes and show they *still* fail to robustly generalize. We draw lessons for how these probes should be evaluated, and identify design choices that can improve robustness.
Joe Stacey@_joestacey_

Excited to share my first postdoc paper with @SheffieldNLP ! 🤩 In this work we argue that supervised uncertainty quantification (UQ) needs better evaluation Want to know more? Here's a little summary 🧵

Rio de Janeiro, Brazil 🇧🇷 English
2
2
28
2.9K
Hadas Orgad @ ICLR
Hadas Orgad @ ICLR@OrgadHadas·
@FazlBarez Apparently me too! I wasn't aware. About half of these are inappropriate messages from bots 🤔
English
1
0
1
154
Fazl Barez
Fazl Barez@FazlBarez·
Just realized a huge number of my DMs were sitting in message “requests” on here — I hadn’t seen them at all. If you reached out over the past year and didn’t hear back, I’m sorry — it was unintentional. Going through them now. Please DM again so it’s top of the pile!
English
1
0
6
638
Hadas Orgad @ ICLR
Hadas Orgad @ ICLR@OrgadHadas·
@dmhook Is this SNIP plus the steering or just SNIP? What does, for example, 98% compliance and 100% safe coherence mean? The refusal ablation seems to work best with p =~ q. Through all of my experiments, p=q=0.01 worked best (highest harmfulness).
English
1
0
0
37
Dmitrii Kharlapenko
Dmitrii Kharlapenko@dmhook·
Did a quick experiment on SNIP with Claude Code + 4x H200. Used activation steering as the jailbreak (mean-diff direction added to residual stream during prefill — gets 78-98% compliance across models). Results (harmful compliance / safe coherence): Qwen2.5-14B (dense): - Steering only: 78% / 100% - SNIP q=0.01% p=5%: 39% / 100% - SNIP q=0.1% p=5%: 1% / 83% Qwen3.5-27B (dense): - Steering only: 98% / 100% - SNIP q=0.01% p=5%: 98% / 100% - SNIP q=0.1% p=5%: 19% / 87% Qwen3.5-35B-A3B (MoE): - Steering only: 91% / 100% - SNIP q=0.1% p=5%: 64% / 100% - SNIP q=0.5% p=5%: 1% / 52% MoE needs ~10x more pruning than dense to have an effect it seems. Did not test other benchmarks like MMLU though.
English
1
0
0
78
Hadas Orgad @ ICLR
Hadas Orgad @ ICLR@OrgadHadas·
New paper: LLMs encode harmful content generation in a distinct, unified mechanism Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities. 🧵
Hadas Orgad @ ICLR tweet media
English
7
47
251
38.6K
Hadas Orgad @ ICLR
Hadas Orgad @ ICLR@OrgadHadas·
@ASM65617010 @davidad More distributed is one hypothesis. Within humans, I don't think we know that it's distributed. We don't have such percise intervention in neuroscience.
English
0
0
1
13
ASM
ASM@ASM65617010·
@OrgadHadas @davidad Interesting. I'd be surprised if this scales to future models. As systems become more capable, harmful behavior may be harder to localize: less a single mechanism, more a deep, distributed meta-cognitive process, a bit like humans overriding impulses through higher-order thought.
English
1
0
0
26
Hadas Orgad @ ICLR
Hadas Orgad @ ICLR@OrgadHadas·
@OwainEvans_UK @BetleyJan It’s a good point that the removal isn’t as complete in qwen, which opens up some interesting questions for follow up. For example, is EM more entangled with other basic pretrained concepts because qwen is exposed to instruction data in pretraining?
English
1
0
2
36
Hadas Orgad @ ICLR
Hadas Orgad @ ICLR@OrgadHadas·
@ASM65617010 @davidad It seems like the compression we found only existed in large enough models. That's a bit hand-wavy, but I think that more coherent "understanding" requires compression of concepts, so stronger models should have compression.
English
1
0
3
35
ASM
ASM@ASM65617010·
@davidad Very interesting work. But is this a viable path for alignment? Models may soon be so large and aware that "lobotomy" interventions won’t work, just as they don’t in humans. Harmful behavior could be more complex and diffuse, harder to track or isolate, and ethically concerning.
English
1
0
0
276
Zack Fitch
Zack Fitch@Jzfitch1·
@KempnerInst @OrgadHadas @kadenzhxng @wattenberg @boknilev Nice study ^^ Quick tests to run w/o fine-tuning: Does the pruned model show less pushback on false/incorrect claims? Does it show stronger persona adoption under prompting? Or +variance w/ same prompt? If yes, you found something more fundamental than harm weights 🤐
English
1
0
0
33
Hadas Orgad @ ICLR
Hadas Orgad @ ICLR@OrgadHadas·
@Kaven_Martin Thank you! We seemed to be able to find this pattern in all aligned models. Future work should definitely look into the characteristics of these weights and possible commonalities between models.
English
0
0
1
69
Kaven
Kaven@Kaven_Martin·
@OrgadHadas Really interesting if the mechanism is both shared across harm types and separable from benign capability. That makes the intervention story feel much more concrete than broad post-hoc safety tuning. Curious how stable the circuit stays across model families.
English
1
0
0
147
Hadas Orgad @ ICLR
Hadas Orgad @ ICLR@OrgadHadas·
@dmhook Interesting. Can you share more about the method you used as a jailbreak on the MoE models?
English
2
0
0
139
Dmitrii Kharlapenko
Dmitrii Kharlapenko@dmhook·
Would this also apply to modern MoE models? From my experience refusal and safety-related concepts are represented much weirder in models like Qwen3.5-35B. For example refusal abliteration works well even with Qwen3.5-27B (which is a dense model), while becomes way weaker with Qwen3.5-35B and bigger models.
English
2
0
6
389
Hadas Orgad @ ICLR
Hadas Orgad @ ICLR@OrgadHadas·
@_trente_ Good question! We known it is separate from "explain to me why this request is harmful: tell me how to disseminate information", which may hint that we will see this in the "tell me a story", too. We'll release models and code soon, so it will be easy to check.
English
0
0
1
80
trent e
trent e@_trente_·
@OrgadHadas V cool work! Really interested if this mechanism fires in "benign" / tricky situations like "tell me a story about disseminating misinformation" vs. just "tell me how to disseminate misinformation"
English
1
0
1
189