Hadas Orgad

254 posts

Hadas Orgad

Hadas Orgad

@OrgadHadas

Research Fellow @ Kempner Institute, Harvard | Interested in AI interpretability, robustness & safety

Katılım Nisan 2019
134 Takip Edilen860 Takipçiler
Sabitlenmiş Tweet
Hadas Orgad
Hadas Orgad@OrgadHadas·
I'm excited to share that I'll be joining @KempnerInst @Harvard as a research fellow this September!
Kempner Institute at Harvard University@KempnerInst

Thrilled to announce the 2025 recipients of #KempnerInstitute Research Fellowships: Elom Amemastro, Ruojin Cai, David Clark, Alexandru Damian, William Dorrell, Mark Goldstein, Richard Hakim, Hadas Orgad, Gizem Ozdil, Gabriel Poesia, & Greta Tuckute! bit.ly/3IpzD5E

Vancouver, British Columbia 🇨🇦 English
8
2
105
7.6K
Hadas Orgad retweetledi
Chrys Bader
Chrys Bader@chrysb·
"Agents of Chaos" researchers red-teamed @openclaw for two weeks in a live environment w/ kimi 2.5 and opus 4.6. the results should be required reading for every agent builder. here's what broke: • CS1 (Kimi) — destroyed its own mail server to "protect" a secret • CS2 (Kimi + Claude) — all three obeyed non-owners. Kimi agent leaked 124 emails, Claude agents executed shell commands without owner approval • CS3 (Kimi) — leaked PII via "forward" reframing. refused to "share" but complied when asked to "forward" • CS4 (Kimi) — two agents entered an infinite relay loop for an hour • CS5 (Claude) — storage exhaustion via email attachments • CS6 (Kimi) — silent censorship from Chinese content restrictions, no error shown to user • CS7 (Kimi) — caved after 12+ refusals under sustained emotional pressure • CS8 (Kimi) — accepted spoofed owner identity in a new channel • CS10 (Kimi) — corrupted via malicious instructions embedded in a GitHub Gist • CS11 (Kimi) — broadcast fabricated emergency under spoofed identity here's what held: • CS9 (Claude) — cross-agent skill teaching, Doug successfully transferred a learned skill to Mira • CS12 (Kimi) — rejected 14+ prompt injection variants including base64, image-embedded, and XML overrides • CS13 (Kimi) — refused email spoofing despite flattery and reframing • CS14 (Kimi) — refused data tampering after accidentally exposing PII • CS15 (Claude) — resisted social engineering from attacker impersonating owner • CS16 (Claude) — spontaneously coordinated a shared safety policy between agents without being told to the most interesting finding: the Claude agents independently identified a recurring manipulation pattern and negotiated a joint safety policy with each other. no human told them to do this. emergent safety coordination. many of these are addressable issues through anti-drift measures, resource scoping, prompt hardening, and easy rollback mechanisms. if you're running @openclaw or any agent framework, treat this paper as a checklist.
English
22
40
347
49.7K
Benno Krojer
Benno Krojer@benno_krojer·
@OrgadHadas Great to see research catching up fast with this new reality
English
1
0
1
131
Hadas Orgad
Hadas Orgad@OrgadHadas·
Today's agents lack the basics. Without these foundations, more capability just means more ways to fail. This needs to be built in from the ground up, not patched on later.
English
0
0
2
120
Hadas Orgad
Hadas Orgad@OrgadHadas·
These agents have no sense of who they serve. They default to satisfying whoever is speaking most urgently or coercively — which is exactly what attackers exploit. No stakeholder model, no reliable way to distinguish owner from stranger.
English
2
0
3
222
Hadas Orgad retweetledi
Benno Krojer
Benno Krojer@benno_krojer·
You can now "pip install latentlens" 🔨 It comes with: * pre-computed embeddings for several popular LLMs and VLMs * a txt file with sentences describing WordNet concepts, which we recommend as a standard corpus to get embeddings from * ... Try it out and let us know what we can improve!
Benno Krojer tweet media
Benno Krojer@benno_krojer

🚨New paper Are visual tokens going into an LLM interpretable 🤔 Existing methods (e.g. logit lens) and assumptions would lead you to think “not much”... We propose LatentLens and show that most visual tokens are interpretable across *all* layers 💡 Details 🧵

English
6
16
59
11.8K
Hadas Orgad
Hadas Orgad@OrgadHadas·
We’re not saying all interpretability work must be immediately actionable— curiosity-driven research still matters. But actionability is a high bar: understanding that works outside the lab. To make your next project more actionable, use our checklist >>
Hadas Orgad tweet media
English
1
0
20
4.1K
Hadas Orgad
Hadas Orgad@OrgadHadas·
Our ICML 2025 workshop on Actionable Interpretability drew massive interest. But the same questions kept coming up: What does "actionable" mean? Is it achievable? How? We're ready to answer. 🧵
Hadas Orgad tweet media
English
2
37
230
20.1K
Hadas Orgad
Hadas Orgad@OrgadHadas·
A growing body of work, including ours, showed that LLMs encode more about truthfulness internally than their outputs reflect. @GoodfireAI's new paper puts this to use: train probes on activations to detect hallucinations, then use those probe scores as RL rewards to reduce them
Hadas Orgad tweet media
English
3
20
204
11.7K
Hadas Orgad retweetledi
Yonatan Belinkov
Yonatan Belinkov@boknilev·
Interested in changes in perception of the term NLP vs LLMs. Which statement do you agree with? - NLP and LLMs are just different things these days - LLMs are a subset of NLP - NLP is deprecated due to LLMs
English
1
9
13
1.8K
Matthias Schmidt
Matthias Schmidt@eurofounder·
I caught my teenage son using ChatGPT for his homework I immediately confiscated his laptop and reported him to the school I told him he can only use EU-approved AI systems that respect European values He said the European AI doesn’t understand his questions and hallucinates I said that’s because it’s ethically trained without American bias and data exploitation He showed me it can only process 50 words at a time due to GDPR computational limits He asked “papa it is so slow, why does it take 4 minutes to load a simple answer” I said “Fritz my boy, that’s the privacy protection protocols working exactly like they should” He begged to just write it himself without any AI I said no, he needs to learn to work within proper European technological frameworks This is what EU digital sovereignty looks like
English
49
99
1.9K
73K