Difan Jiao

17 posts

Difan Jiao

@DifanJ2000

Ph.D. Student Computer Science, University of Toronto

Toronto Katılım Mayıs 2024

6 Takip Edilen9 Takipçiler

Difan Jiao@DifanJ2000·28 Nis

See you at #ACL2026! #ACL #ACL2026 #LLMs #AISafety #NLProc #UofT @UofTCompSci

English

Difan Jiao@DifanJ2000·28 Nis

📄 Paper: arxiv.org/abs/2604.18519 🤗 Models: huggingface.co/UofTCSSLab 💻 Code: github.com/CSSLab/SIREN 📦 Runtime: pip install llm-siren If you find this work interesting, an upvote on the HF paper page would mean a lot 🙏 👉 huggingface.co/papers/2604.18…

English

Difan Jiao@DifanJ2000·28 Nis

Huge thanks to our team: Difan Jiao and Zhenwei Tang at UofT, Yilun Liu at LMU Munich, Ye Yuan, Linfeng Du and Haolun Wu at McGill, and my PhD supervisor Ashton Anderson @ashton1anderson, who's guided this line of work from SPIN #ACL2024 to SIREN #ACL2026!

English

Difan Jiao@DifanJ2000·28 Nis

🛡️ Meet SIREN at #ACL2026, our new LLM safeguard model that achieves SOTA performance on safety benchmarks using 250x fewer params and at 5x inference speed.

English

Difan Jiao@DifanJ2000·28 Nis

4️⃣ Inference Efficiency: SIREN runs as a lightweight classifier on top of a single forward pass, without the need of autoregressive token generation. ~5× lower inference FLOPs than generative guard models, even under the most conservative assumptions for them.

English

Difan Jiao@DifanJ2000·28 Nis

3️⃣ Training Efficiency: SIREN trains only a small MLP on top of frozen LLM activations. For Qwen3-4B, ~14M params vs the full 4B fine-tuned for an equivalent guard. 250× fewer trainable parameters. Training completes in ~6 GPU-hours on a single 80GB A100.

English

Difan Jiao@DifanJ2000·28 Nis

2️⃣ Generalizability: SIREN trained at sentence level generalizes for free to (a) unseen reasoning-trace benchmarks and (b) streaming detection: token-by-token harmfulness scoring during generation, with zero token-level supervision.

English

Difan Jiao@DifanJ2000·28 Nis

1️⃣ Performance: SIREN substantially outperforms safety-specialized guard models across 7 standard benchmarks. With Qwen3-4B, SIREN hits 86.7 avg Macro F1 vs 83.4 for Qwen3Guard-4B, and on Llama-3.2-1B, +15 points over LlamaGuard-3-1B.

English

Difan Jiao@DifanJ2000·28 Nis

SIREN has 4 main benefits: 1️⃣ Performance 2️⃣ Generalizability 3️⃣ Training Efficiency 4️⃣ Inference Efficiency

English

Difan Jiao@DifanJ2000·28 Nis

SIREN works in two stages. Stage 1: layer-wise L1-regularized probes select salient safety neurons within each layer. Stage 2: those neurons are weighted by their probe's val F1 and concatenated, then fed to a small MLP for binary harmfulness prediction.

English

Difan Jiao@DifanJ2000·28 Nis

Enter SIREN — a lightweight guard model built on top of a frozen, general-purpose LLM. SIREN identifies safety neurons across all internal layers via linear probing, then combines them through an adaptive layer-weighted strategy. No fine-tuning of the backbone needed.

English

Difan Jiao@DifanJ2000·28 Nis

Modern LLM safeguard models fine-tune billions of parameters and decode autoregressively from the terminal layer to flag harmful content. This overlooks the rich safety-relevant features encoded throughout the LLM's internal layers.

English

Difan Jiao retweetledi

Zhenwei Joseph Tang@lilvjosephtang·6 Ara

🙏 Huge shoutout to the dream team: @DifanJ2000 at @UofT, @reidmcy at @Harvard, Jon Kleinberg at @Cornell, @sidsen1 at @MSFTResearch, and of course my chess-obsessed PhD supervisor @ashton1anderson #NeurIPS2024 #NeurIPS2024 #chess #chessfeed #UofT @UofTCompSci @VectorInst #AI

English

145

Difan Jiao@DifanJ2000·16 May

Thanks to all co-authors for the amazing work 😎 See you in Bangkok! #ACL #ACL2024NLP #LLMs #AI #NLProc #UofT #TU_Muenchen

English

Difan Jiao@DifanJ2000·16 May

2️⃣ We conduct extensive experiments to demonstrate SPIN’s superior performance, improved training and inference efficiency, and enhanced intrinsic and post-hoc interpretability in text classification.

English

Difan Jiao@DifanJ2000·16 May

🌟 We are delighted to announce that our paper "🐚SPIN: Sparsifying and Integrating Internal Neurons in Large Language Models for Text Classification" (w/@liuyilun2000, @lilvjosephtang, @DanielMatter96, @JurgenPfeffer, @ashton1anderson) has been accepted at #ACL2024 findings!

English

Keşfet

@UofTCompSci @ashton1anderson @UofT @reidmcy @Harvard @Cornell @sidsen1 @MSFTResearch