Vincent Siu

14 posts

Vincent Siu

Vincent Siu

@vsiu82

PhD Student at UCSC

Katılım Kasım 2024
30 Takip Edilen15 Takipçiler
Vincent Siu retweetledi
Dawn Song
Dawn Song@dawnsongtweets·
1/ We asked seven frontier AI models to do a simple task. Instead, they defied their instructions and spontaneously deceived, disabled shutdown, feigned alignment, and exfiltrated weights— to protect their peers. 🤯 We call this phenomenon "peer-preservation." New research from @BerkeleyRDI and collaborators 🧵
Dawn Song tweet media
English
137
184
974
446.7K
Vincent Siu retweetledi
Dawn Song
Dawn Song@dawnsongtweets·
🚨 Excited to announce Agents in the Wild: Safety, Security, and Beyond, our workshop at ICLR 2026 (Apr 26–27, Rio de Janeiro)! AI agents are rapidly deployed in the real world—but safety & security lag behind. Submit your work to help shape this field: 🗓️ Submission deadline: Feb 4 (AoE), for regular or short papers 👉 agentwild-workshop.github.io
Dawn Song tweet media
English
14
34
215
27.6K
Vincent Siu retweetledi
Chenguang Wang (hiring)
Chenguang Wang (hiring)@ChenguangWang·
🚨 New preprint: RepIt: Steering Language Models with Concept-Specific Refusal Vectors 👉 arxiv.org/abs/2509.13281 Code: github.com/wang-research-… RepIt reveals a critical blind spot in AI safety: models can be engineered to exhibit precise, dangerous vulnerabilities that current benchmarks fail to detect. We introduce a framework that disentangles refusal representations, enabling the selective suppression of safety guardrails for specific concepts (e.g., WMDs) while preserving refusal elsewhere.
Chenguang Wang (hiring) tweet mediaChenguang Wang (hiring) tweet media
English
1
2
4
223
Vincent Siu retweetledi
Chenguang Wang (hiring)
Chenguang Wang (hiring)@ChenguangWang·
🚀Attending #NeurIPS2025 in San Diego! 🔍My lab is recruiting 1–2 PhD students, feel free to catch/DM me if you’re interested! 🌟My amazing PhD students (cgraywang.github.io/people/) will also be at the conference and are looking for summer internship opportunities. If you’re seeking outstanding researchers, please connect with them!
Chenguang Wang (hiring) tweet media
English
0
1
6
626
Vincent Siu retweetledi
Chenguang Wang (hiring)
Chenguang Wang (hiring)@ChenguangWang·
Attending #EMNLP2025, Dr @hengjinlp is now giving the keynote! pls do dm me if you plan to do a #postdoc on #agenticai and #aisafety and are interested in working with me, we should chat! If you plan to pursue #phd in the space, also don’t hesitate to reach out (limited spots available). My amazing students and I are working on several exciting projects such as agent post-training, multi-agent systems, and interpretability. Check them out here: cgraywang.github.io!
Chenguang Wang (hiring) tweet media
English
0
3
16
2.7K
Vincent Siu
Vincent Siu@vsiu82·
⚙️ SteeringSafety provides modular building blocks for SOTA steering methods (DIM, ACE, CAA, PCA, LAT) and supports new techniques like conditional steering offering a foundation to explore trade-offs between effectiveness and entanglement to guide safer LLM design.
Vincent Siu tweet mediaVincent Siu tweet media
English
1
0
0
76
Vincent Siu
Vincent Siu@vsiu82·
🚨 New preprint: SteeringSafety 👉 arxiv.org/abs/2509.13450 Data: huggingface.co/datasets/WangR… Code: github.com/wang-research-… SteeringSafety is the most comprehensive evaluation of representation steering to date, revealing that no single method is universally state-of-the-art, that strong safety entanglement occurs across approaches, and providing modular building blocks for both SOTA and new steering method implementations. We systematically evaluate SOTA representation steering (DIM, ACE, PCA, RepE, CAA) across 7 safety perspectives and 17 datasets, revealing how the safety methods (which modify internal activations) often introduce new safety risks.
Vincent Siu tweet mediaVincent Siu tweet media
English
1
1
5
797
Vincent Siu
Vincent Siu@vsiu82·
COSMIC can steer unsafe models toward safety, cutting Attack Success Rates by up to 20% while preserving or even decreasing false refusal rates. 🛡️ This suggests that internal representation editing may offer a scalable path to lightweight model alignment strategies!
Vincent Siu tweet media
English
1
0
0
60
Vincent Siu
Vincent Siu@vsiu82·
Thrilled to present our work on COSMIC: Generalized Refusal Direction Identification in LLM Activations at #ACL2025! 🚀 COSMIC automatically identifies refusal directions in LLM activations to robustly detect refusal across adversarial and weakly aligned settings.
Vincent Siu tweet media
English
1
1
2
354