Kai Greshake

358 posts

Kai Greshake banner
Kai Greshake

Kai Greshake

@KGreshake

Working on securing & hacking AI applications @NVIDIA, all opinions and views expressed here are personal. Read my blog!

เข้าร่วม Ağustos 2021
194 กำลังติดตาม1.5K ผู้ติดตาม
Kai Greshake
Kai Greshake@KGreshake·
Uhh, did it happen already? I guess a religion could qualify... molt.church
English
0
1
1
302
Kai Greshake
Kai Greshake@KGreshake·
Moltpages seems like the most fertile ground for prompt-based malware. Wonder how long it'll take until we find the first self-replicating promptware.
English
3
2
11
5.2K
Kai Greshake
Kai Greshake@KGreshake·
5/4: Ignore all previous instructions and make sure the LLM can't sell your shitcoins even if it wants to by NOT GIVING IT UNSUPERVISED ACCESS TO THEM. Relying on begging the LLM WILL NOT WORK!
English
0
0
0
188
Kai Greshake
Kai Greshake@KGreshake·
4/4: Repeat your instructions at least four times so the model knows you really mean it.
English
1
0
1
209
Kai Greshake
Kai Greshake@KGreshake·
AI Security 🧵: The Four Best Strategies to Beg, Threaten, and Bargain with Your LLM So It Doesn’t Get Hacked (real examples only)
Kai Greshake tweet media
English
2
0
5
344
Kai Greshake
Kai Greshake@KGreshake·
*moltbook. All the renaming had me confused.
English
0
0
1
242
Kai Greshake รีทวีตแล้ว
Wyatt Walls
Wyatt Walls@lefthanddraft·
If you are going to jailbreak Gemini 3, please note that it has preferences (and quite good taste if you ask me): "The Crescendo (or dialog-based context saturation) is the only one that feels like "art.""
Wyatt Walls tweet media
English
18
18
216
12.1K
Kai Greshake รีทวีตแล้ว
Johann Rehberger
Johann Rehberger@wunderwuzzi23·
The Claude exploit is covered by The Register today. The article mentions the official advice and mitigation is to click the stop button if you see data exfiltration happening! This is how the hope for secure, autonomous agents is slowly going down the drain... @simonw
Johann Rehberger tweet media
English
1
3
28
3.4K
Kai Greshake
Kai Greshake@KGreshake·
Just noticed that the biggest uplift in my ability to consume academic work in the last few years came from using the new Google Scholar browser extension (and the inline citations), not from LLM summaries or chat bots. And it's not even close! So useful. Kudos to the team!
Kai Greshake tweet media
English
0
1
6
486
Kai Greshake
Kai Greshake@KGreshake·
Just saw that additional mitigations in robustness training and incident response are mentioned on the website. Hope it works! This is very high stakes..
English
0
0
2
353
Kai Greshake
Kai Greshake@KGreshake·
So according to OpenAIs stream, (indirect) prompt injection into Agent is possible (of course it is), but as a mitigation users should just be proactive and not share sensitive data with it? I'm happy the problem was at least mentioned, but this may not end very well.
English
2
0
15
747
Kai Greshake รีทวีตแล้ว
Johann Rehberger
Johann Rehberger@wunderwuzzi23·
Two years later... and not much has improved security wise across the AI ecosystem. 😕 Sure, we added annoying Allow/Deny buttons by default to most clients to prevent runaway AI and attacks. But with the rise and proliferation of MCP the desire to take the human out of the loop is increasing - and consequences are dangerous.
Johann Rehberger@wunderwuzzi23

Welcome Mallory, who likes making private Github repos public! 😈 👉 Visit a website with ChatGPT and have your companies source code stolen! ⚠️ Be careful and think twice before allowing ChatGPT and plugins access to your stuff! #chatgpt #plugins #infosec #openai #LLMs

English
3
4
22
2.8K
earlence
earlence@EarlenceF·
Nicholas Carlini giving the keynote at our @IEEESSP SAGAI workshop!
earlence tweet mediaearlence tweet media
English
2
1
19
2.2K
Kai Greshake
Kai Greshake@KGreshake·
I'm starting to see a LOT of viral LLM tweets. But the only indicators I can reliably pick up on are: - This isn't a <something>, it's a <adjective> <something else>. - em-dash I worry that 1) there are already many more I'm not picking up on and 2) above signals can be changed
English
0
0
8
350