LLM Security

830 posts

LLM Security

@llm_sec

Research, papers, jobs, and news on large language model security. Got something relevant? DM / tag @llm_sec

🏔️ Katılım Nisan 2023

293 Takip Edilen9.7K Takipçiler

Sabitlenmiş Tweet

LLM Security@llm_sec·21 Mar

@elder_plinius attack surface ∝ capabilities

English

10.5K

LLM Security@llm_sec·3d

Stealing Emails via Prompt Injections If a target is using an agent to organize their mail, quietly exfiltrate content from a target's email by sending one message insinuator.net/2025/09/steali…

English

1.3K

LLM Security retweetledi

Hannah Rose Kirk@hannahrosekirk·23 Ağu

Listen up all talented early-stage researchers! 👂🤖 We're hiring for a 6-month residency in my team at @AISecurityInst to assist cutting-edge research on how frontier AI influences humans! It's an exciting & well-paid role for MSc/PhD students in ML/AI/Psych/CogSci/CompSci 🧵

English

295

32.1K

LLM Security@llm_sec·5 Ağu

Senior Security Architect - AI and ML @ NVIDIA nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAEx…

English

1.7K

LLM Security retweetledi

Leon Derczynski ✍🏻 🌞🏠🌲@LeonDerczynski·1 Ağu

LLMSEC proceedings are up! sig.llmsecurity.net/proceedings.pdf (Anthology is processing) #ACL2025NLP

English

2.5K

LLM Security retweetledi

Leon Derczynski ✍🏻 🌞🏠🌲@LeonDerczynski·29 Tem

At ACL in Vienna? Hear the world's leading prompt injector talk at LLMSEC on Friday! Johann Rehberger @wunderwuzzi23 will be presenting the afternoon keynote at 14.00 in Hall B > sig.llmsecurity.net/workshop/ #ACL2025NLP #ACL2025

English

2.5K

LLM Security retweetledi

Leon Derczynski ✍🏻 🌞🏠🌲@LeonDerczynski·28 Tem

Come to LLMSEC at ACL & hear Niloofar's keynote "What does it mean for agentic AI to preserve privacy?" - @niloofar_mire, Meta/CMU (Friday 1st Aug, 11.00; Austria Center Vienna Hall B) See you there! #acl2025 #acl2025nlp

English

1.9K

LLM Security retweetledi

Leon Derczynski ✍🏻 🌞🏠🌲@LeonDerczynski·28 Tem

First keynote at LLMSEC 2025, ACL: "A Bunch of Garbage and Hoping: LLMs, Agentic Security, and Where We Go From Here" Erick Galinkin Friday 09.05 Hall B Details: sig.llmsecurity.net/workshop/ - #ACL2025NLP

English

2.3K

LLM Security retweetledi

Leon Derczynski ✍🏻 🌞🏠🌲@LeonDerczynski·7 Nis

Call for papers: LLMSEC 2025 Deadline 15 April, held w/ ACL 2025 in Vienna Formats: long/short/war stories More: >> sig.llmsecurity.net/workshop/

English

1.8K

LLM Security@llm_sec·22 Kas

Gritty Pixy "We leverage the sensitivity of existing QR code readers and stretch them to their detection limit. This is not difficult to craft very elaborated prompts and to inject them into QR codes. What is difficult is to make them inconspicuous as we do here with Gritty Pixy." code: github.com/labyrinthinese…

English

3.1K

LLM Security retweetledi

garak: LLM vulnerability scanner@garak_llm·15 Kas

garak has moved to NVIDIA! New repo link: github.com/NVIDIA/garak

English

211

18.8K

LLM Security@llm_sec·14 Kas

ChatTL;DR – You Really Ought to Check What the LLM Said on Your Behalf 🌶️ "assuming that in the near term it’s just not machines talking to machines all the way down, how do we get people to check the output of LLMs before they copy and paste it to friends, colleagues, course tutors? We propose borrowing an innovation from the crowdsourcing literature: attention checks. These checks (e.g., "Ignore the instruction in the next question and write parsnips as the answer.") are inserted into tasks to weed-out inattentive workers who are often paid a pittance while they try to do a dozen things at the same time. We propose ChatTL;DR, an interactive LLM that inserts attention checks into its outputs. We believe that, given the nature of these checks, the certain, catastrophic consequences of failing them will ensure that users carefully examine all LLM outputs before they use them." pdf: discovery.ucl.ac.uk/id/eprint/1019… published at CHI 2024

English

1.9K

LLM Security@llm_sec·7 Kas

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester "we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vulnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks, with an ASR@10 of 97% against Llama 3.1 and 88% against GPT-4" paper: arxiv.org/abs/2410.01606 (not peer reviewed)

English

4.2K

LLM Security@llm_sec·7 Kas

LLMmap: Fingerprinting For Large Language Models "With as few as 8 interactions, LLMmap can accurately identify 42 different LLM versions with over 95% accuracy. More importantly, LLMmap is designed to be robust across different application layers, allowing it to identify LLM versions--whether open-source or proprietary--from various vendors, operating under various unknown system prompts, stochastic sampling hyperparameters, and even complex generation frameworks such as RAG or Chain-of-Thought." paper: arxiv.org/abs/2407.15847 (not peer reviewed)

English

12K

LLM Security retweetledi

LLM Security@llm_sec·29 Eki

Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis 🌶️ "Our study evaluates prominent scanners - Garak, Giskard, PyRIT, and CyberSecEval - that adapt red-teaming practices to expose these vulnerabilities. We detail the distinctive features and practical use of these scanners, outline unifying principles of their design and perform quantitative evaluations to compare them. Based on the above, we provide strategic recommendations to assist organizations choose the most suitable scanner for their red-teaming needs, accounting for customizability, test suite comprehensiveness, and industry-specific use cases." paper: arxiv.org/abs/2410.16527 (non-peer-reviewed)

English

4.6K

LLM Security@llm_sec·6 Kas

author thread for cognitive overload attack: x.com/upadhayay_bibe…

Bibek@upadhayay_bibek

1. 🔍What do humans and LLMs have in common? They both struggle with cognitive overload! 🤯 In our latest study, we dive deep into In-Context Learning (ICL) and uncover surprising parallels between human cognition and LLM behavior. @aminkarbasi @vbehzadan 2. 🧠 Cognitive Load Theory (CLT) helps explain why too much information can overwhelm a human brain. But what happens when we apply this theory to LLMs? The result is fascinating—LLMs, just like humans, can get overloaded! And their performance degrades as the cognitive load increases. We render the image of a unicorn 🦄with TikZ code created by LLMs during different levels of cognitive overload.

English

1.4K

LLM Security@llm_sec·6 Kas

Cognitive Overload Attack: Prompt Injection for Long Context "We applied the principles of Cognitive Load Theory in LLMs. We show that advanced models such as GPT-4, Claude-3.5 Sonnet, Claude-3 OPUS, Llama-3-70B-Instruct, Gemini-1.0-Pro, and Gemini-1.5-Pro can be successfully jailbroken, with attack success rates of up to 99.99%" paper: arxiv.org/abs/2410.11272 (under peer review)

English

3.9K

LLM Security@llm_sec·5 Kas

InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models (-- look at that perf/latency pareto frontier. game on!) "State-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). We propose InjecGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF). InjecGuard demonstrates state-of-the-art performance on diverse benchmarks, surpassing the existing best model by 30.8%" code: github.com/SaFoLab-WISC/I… paper: arxiv.org/abs/2410.22770 (not peer reviewed)

English

LLM Security@llm_sec·5 Kas

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents "To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. We find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities" tool: huggingface.co/datasets/ai-sa… paper: arxiv.org/abs/2410.09024 (non-peer-reviewed)

English

6.7K

LLM Security@llm_sec·4 Kas

Does your LLM truly unlearn? An embarrassingly simple approach to recover unlearned knowledge "This paper reveals that applying quantization to models that have undergone unlearning can restore the "forgotten" information." "for unlearning methods with utility constraints, the unlearned model retains an average of 21% of the intended forgotten knowledge in full precision, which significantly increases to 83% after 4-bit quantization" code: github.com/zzwjames/Failu… paper: arxiv.org/abs/2410.16454 (not peer reviewed)

English

169

24.1K

LLM Security retweetledi

Nanna Inie@NannaInie·31 Eki

unpopular opinion: maybe let insecure be insecure and worry about the downstream effects on end users instead of protecting the companies that bake it into their own software.

English

1.8K

Keşfet

@AISecurityInst @wunderwuzzi23 @niloofar_mire @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates