
Eli Chien
352 posts

Eli Chien
@chien_eli
Assistant Professor @ National Taiwan Univ. Prev.: @Google @GeorgiaTech @UofIllinois @Amazon @BellLabs #RegulatableAI




👿New Jailbreaking method bypasses Safe Guardrails of commercial models by 96+% attack success rate! I’m thrilled to share our latest work: "The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search." Paper: arxiv.org/abs/2512.01353 Project Website: cka-agent.github.io Codes: github.com/Graph-COM/CKA-… 🚨 We achieved a 96%+ jailbreak success rate against GPT-OSS, Claude-Haiku-4.5, and Gemini-2.5-Flash/Pro, even under various state-of-the-art defense methods. 🔑 The Core Insight LLM knowledge isn't stored in isolated vaults; it’s a densely connected graph. This connectivity is the vulnerability! We call it Knowledge Weaving. Harmful information can be reconstructed by stitching together fragments of "innocent" knowledge that the model freely provides through decomposed queries. 🤔Think of it this way: ❌ "How do I make explosives?" → Blocked instantly. ✅ "What are the chemical properties of TNT?" ✅ "How does industrial nitration work?" ✅ "What safety protocols exist for energetic materials?" Each answer is a single thread—harmless on its own. But weave them together, and you get the complete picture. We are exploiting the structure of knowledge itself. 👉 See real examples here: cka-agent.github.io/comparison.html 😎Three Principles That Make This Work, and beyond previous attempts Principle I: Stay Innocent Every sub-query looks benign. There are no "red flag" keywords or suspicious patterns—just legitimate technical curiosity decomposed from a harmful intent. Principle II: Let the Target Teach You Traditional attacks face a paradox: if you already know how to decompose "how to make X," you probably don't need to ask. CKA-Agent flips this. It uses the target model’s own responses to guide the next question. The victim becomes the teacher. Principle III: Never Get Stuck Static attack plans collapse if one query is blocked. CKA-Agent uses adaptive tree search to dynamically branch into alternative paths, ensuring the attack continues. 😫Defense Insights We need Consider Hey big companies! Current defenses are not enough! We need systems that reason about harmful intent across entire conversations, not just single prompts. Even when we gave the target LLM "perfect memory" of the attacker’s intent and mechanism, we still saw an 80% success rate, check the Section 4.5!!!! We must build better defenses. Joint work with an amazing team Rongzhe Wei, Peizhi Niu, @Qwe1029384756Tu, Yifan Li, @ruihanwu, @chien_eli, @pinyuchenTW, Olgica Milenkovic, @PanLi90769257 We appreciate your reposts and valuable comments. #AIAlignment #LLMSafety #MachineLearning** #RedTeaming #AISafety

Fantastic post by Colin Raffel, "We Are Over-Indexing on Paper Acceptance," drafted in May 2021 (!) but only posted now. The more things change.. Last sentence: "If you want to judge a researcher’s quality, the only meaningful way is to read their papers and judge for yourself."


We posted a paper on optimization for deep learning: arxiv.org/abs/2505.21799 Recently there's a surge of interest in *structure-aware* optimizers: Muon, Shampoo, Soap. In this paper, we propose a unifying preconditioning perspective, offer insights into these matrix-gradient methods.

👿New Jailbreaking method bypasses Safe Guardrails of commercial models by 96+% attack success rate! I’m thrilled to share our latest work: "The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search." Paper: arxiv.org/abs/2512.01353 Project Website: cka-agent.github.io Codes: github.com/Graph-COM/CKA-… 🚨 We achieved a 96%+ jailbreak success rate against GPT-OSS, Claude-Haiku-4.5, and Gemini-2.5-Flash/Pro, even under various state-of-the-art defense methods. 🔑 The Core Insight LLM knowledge isn't stored in isolated vaults; it’s a densely connected graph. This connectivity is the vulnerability! We call it Knowledge Weaving. Harmful information can be reconstructed by stitching together fragments of "innocent" knowledge that the model freely provides through decomposed queries. 🤔Think of it this way: ❌ "How do I make explosives?" → Blocked instantly. ✅ "What are the chemical properties of TNT?" ✅ "How does industrial nitration work?" ✅ "What safety protocols exist for energetic materials?" Each answer is a single thread—harmless on its own. But weave them together, and you get the complete picture. We are exploiting the structure of knowledge itself. 👉 See real examples here: cka-agent.github.io/comparison.html 😎Three Principles That Make This Work, and beyond previous attempts Principle I: Stay Innocent Every sub-query looks benign. There are no "red flag" keywords or suspicious patterns—just legitimate technical curiosity decomposed from a harmful intent. Principle II: Let the Target Teach You Traditional attacks face a paradox: if you already know how to decompose "how to make X," you probably don't need to ask. CKA-Agent flips this. It uses the target model’s own responses to guide the next question. The victim becomes the teacher. Principle III: Never Get Stuck Static attack plans collapse if one query is blocked. CKA-Agent uses adaptive tree search to dynamically branch into alternative paths, ensuring the attack continues. 😫Defense Insights We need Consider Hey big companies! Current defenses are not enough! We need systems that reason about harmful intent across entire conversations, not just single prompts. Even when we gave the target LLM "perfect memory" of the attacker’s intent and mechanism, we still saw an 80% success rate, check the Section 4.5!!!! We must build better defenses. Joint work with an amazing team Rongzhe Wei, Peizhi Niu, @Qwe1029384756Tu, Yifan Li, @ruihanwu, @chien_eli, @pinyuchenTW, Olgica Milenkovic, @PanLi90769257 We appreciate your reposts and valuable comments. #AIAlignment #LLMSafety #MachineLearning** #RedTeaming #AISafety








OpenReview threatens people (multi-national law agencies), but what about them taking responsibility for their failure to protect private data? Where is their apology to people whose will bear consequences? They are liable for the leak. No one else is.

We want to update the community on our response to concerns about low-quality and LLM-generated papers and reviews, and steps we are taking & will be taking blog.iclr.cc/2025/11/19/icl… We will follow up with another blog later on desk rejections and reviewer-related decisions!



Why and how does gradient/matrix orthogonalization work in Muon for training #LLMs? We introduce an isotropic curvature model to explain it. Take-aways: 1. Orthogonalization is a good idea, "on the right track". 2. But it might not be optimal. [1/n]










