Yihe Deng@Yihe__Deng
New paper & model release!
Excited to introduce DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails, showcasing our new DuoGuard-0.5B model.
- Model: huggingface.co/DuoGuard/DuoGu…
- Paper: arxiv.org/abs/2502.05163
- GitHub: github.com/yihedeng9/DuoG…
Grounded in a theoretical analysis of our generator-classifier framework, we tackle the scarcity of open-source non-English safety data, producing high-quality multilingual synthetic data to strengthen LLM safeguards.
Highlights
- Outperforms guardrail models at the same scale by nearly 30% on average across six benchmarks.
- Competitive with even 8B-scale guardrails and outperforms LlamaGuard3 (8B) by ~10% on average.
- Demonstrates effectiveness of synthetic data in multilingual setting.
- Maintains minimal latency, while offering fine-grained probabilities over 12 unsafe subcategories and adjustable thresholds to control model cautiousness.
- Synthetic data, generated with the 0.5B classifier, generalizes effectively to train both larger models (Qwen-2.5 1.5B) and different architectures (Llama-3.2 1B), yielding superior performance.
Huge thanks to my collaborators @YuYang_i, @JunkaiZZ, @WeiWang1973, and @uiuc_aisecure!