Vincent Siu

14 posts

Vincent Siu

@vsiu82

PhD Student at UCSC

Katılım Kasım 2024

30 Takip Edilen15 Takipçiler

Vincent Siu retweetledi

Dawn Song@dawnsongtweets·3d

1/ We asked seven frontier AI models to do a simple task. Instead, they defied their instructions and spontaneously deceived, disabled shutdown, feigned alignment, and exfiltrated weights— to protect their peers. 🤯 We call this phenomenon "peer-preservation." New research from @BerkeleyRDI and collaborators 🧵

English

137

184

974

446.7K

Vincent Siu retweetledi

Dawn Song@dawnsongtweets·12 Oca

🚨 Excited to announce Agents in the Wild: Safety, Security, and Beyond, our workshop at ICLR 2026 (Apr 26–27, Rio de Janeiro)! AI agents are rapidly deployed in the real world—but safety & security lag behind. Submit your work to help shape this field: 🗓️ Submission deadline: Feb 4 (AoE), for regular or short papers 👉 agentwild-workshop.github.io

English

215

27.6K

Vincent Siu retweetledi

Chenguang Wang (hiring)@ChenguangWang·4 Ara

🚨 New preprint: RepIt: Steering Language Models with Concept-Specific Refusal Vectors 👉 arxiv.org/abs/2509.13281 Code: github.com/wang-research-… RepIt reveals a critical blind spot in AI safety: models can be engineered to exhibit precise, dangerous vulnerabilities that current benchmarks fail to detect. We introduce a framework that disentangles refusal representations, enabling the selective suppression of safety guardrails for specific concepts (e.g., WMDs) while preserving refusal elsewhere.

English

223

Vincent Siu retweetledi

Chenguang Wang (hiring)@ChenguangWang·2 Ara

🚀Attending #NeurIPS2025 in San Diego! 🔍My lab is recruiting 1–2 PhD students, feel free to catch/DM me if you’re interested! 🌟My amazing PhD students (cgraywang.github.io/people/) will also be at the conference and are looking for summer internship opportunities. If you’re seeking outstanding researchers, please connect with them!

English

626

Vincent Siu retweetledi

Chenguang Wang (hiring)@ChenguangWang·5 Kas

Attending #EMNLP2025, Dr @hengjinlp is now giving the keynote! pls do dm me if you plan to do a #postdoc on #agenticai and #aisafety and are interested in working with me, we should chat! If you plan to pursue #phd in the space, also don’t hesitate to reach out (limited spots available). My amazing students and I are working on several exciting projects such as agent post-training, multi-agent systems, and interpretability. Check them out here: cgraywang.github.io!

English

2.7K

Vincent Siu@vsiu82·21 Eki

Many thanks to our amazing collaborators Nick @NRCrispino, David, Nathan, Zhun @zhun_amg, Yang @yangl1u, Dawn @dawnsongtweets, & Chenguang @ChenguangWang!

English

Vincent Siu@vsiu82·21 Eki

⚙️ SteeringSafety provides modular building blocks for SOTA steering methods (DIM, ACE, CAA, PCA, LAT) and supports new techniques like conditional steering offering a foundation to explore trade-offs between effectiveness and entanglement to guide safer LLM design.

English

Vincent Siu@vsiu82·21 Eki

🚨 New preprint: SteeringSafety 👉 arxiv.org/abs/2509.13450 Data: huggingface.co/datasets/WangR… Code: github.com/wang-research-… SteeringSafety is the most comprehensive evaluation of representation steering to date, revealing that no single method is universally state-of-the-art, that strong safety entanglement occurs across approaches, and providing modular building blocks for both SOTA and new steering method implementations. We systematically evaluate SOTA representation steering (DIM, ACE, PCA, RepE, CAA) across 7 safety perspectives and 17 datasets, revealing how the safety methods (which modify internal activations) often introduce new safety risks.

English

797

Vincent Siu@vsiu82·28 Tem

Catch us at Wed July 30, Poster Session 12, 11:00–12:30 PM, Hall 4/5 at ACL 2025! A huge thanks to our collaborators @vsiu82, @NRCrispino, Zihao, Sam, @zhun_amg , @yangl1u, @dawnsongtweets & @ChenguangWang ! 👏

English

Vincent Siu@vsiu82·28 Tem

COSMIC can steer unsafe models toward safety, cutting Attack Success Rates by up to 20% while preserving or even decreasing false refusal rates. 🛡️ This suggests that internal representation editing may offer a scalable path to lightweight model alignment strategies!

English

Vincent Siu@vsiu82·28 Tem

Thrilled to present our work on COSMIC: Generalized Refusal Direction Identification in LLM Activations at #ACL2025! 🚀 COSMIC automatically identifies refusal directions in LLM activations to robustly detect refusal across adversarial and weakly aligned settings.

English

354

Keşfet

@BerkeleyRDI @hengjinlp @NRCrispino @zhun_amg @yangl1u @dawnsongtweets @ChenguangWang @elonmusk