
we trained Qwen3.5-4B with RL to get itself to comply with requests about making meth and stealing credit cards. then we used the attacks that worked to train the model’s defenses, and repeated the loop - fully automated red-teaming. defense rate went from 64% → 92%.
















