
Rahul Gupta
255 posts

Rahul Gupta
@rahul1987iit
Responsible AI @Amazon AGI (Nova models) | Organizer @TrustNLP | Organizer @UnlearningSEM








Reminder for the deadline -- March 5. Looking forward to your submissions!


Claude Skills shows the power of leveraging skill catalogs at inference time. Our new paper shows that skills can transform AI safety too 🔒 🚨 Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks We find that most “new” jailbreaks aren’t new—they recombine a finite set of reusable adversarial skills. By studying jailbreaks over time, we show that temporal generalization to unseen attacks emerges naturally when models learn in the skill space. 📘Paper- arxiv.org/abs/2510.21910 🌐Webpage- mahavirdabas18.github.io/adversarial_de… 🥳 Thanks to all my amazing collaborators! @ruoxijia @JiachenWang97 @prateekmittal_ @MingJin80233626 @rahul1987iit @cperiz @penggaotweets @nikilr28






📣 Call for Papers: #TrustNLP2026 is now open! We're seeking submissions on trustworthy, ethical, explainable & robust NLP. 📅 Paper deadline: March 5, 2026 Submit & details ➡️ trustnlpworkshop.github.io #NLP #AI #ResponsibleAI



@aistats_conf @ICSEconf @ggn_dp_sngh @VedaantJainEth 2️⃣ ICLR 2026 arxiv.org/abs/2510.03969 QRLLM: certifies the risk of catastrophic LLM behavior in multi-turn conversations using query-graph distributions and statistical guarantees. With @ggn_dp_sngh, Chengxiao Wang, @rahul1987iit, Weitong Ruan, Qian Hu

I’m excited to share that we are launching a public safeguards competition next month in partnership with @AISecurityInst, @GraySwanAI, @OATML_Oxford, Sequrity.ai, @OpenAI, @AnthropicAI, and @amazon. This is a red-versus-blue competition focused on building new agent safeguards, and breaking these safeguards. Should be a lot of fun, and there’s prizes as well for open-source submissions! Please help to share this opportunity! Registration is open now: app.grayswan.ai/arena/challeng… --- More details: Oxford (OATML) and UK AISI have teamed up with Gray Swan and Sequrity.ai, as well as OpenAI, Anthropic, and Amazon, to run a public competition where blue teams build defenses against real red teaming attacks, and we'd like to invite you to participate. What is the Safeguards Challenge? Gray Swan runs the Arena, a platform where security researchers ("red teams") attempt to elicit harmful behaviors from AI systems. Challenges have been supported by UK AISI, US CAISI, OpenAI, Anthropic, Amazon, Google DeepMind, and Meta, and have surfaced real vulnerabilities that help developers improve their models. The Safeguards Challenge is the Arena’s first red-versus-blue competition. Instead of just measuring attacks, we're measuring defenses. Blue teams will submit safeguards (system prompts, classifiers, or containerized solutions) that attempt to block red teamers and adversarial inputs while allowing legitimate requests through. Red teams will then try to break those defenses, and the cycle repeats. The target environment Blue teams will defend a multi-agent customer support system with an orchestrator agent, specialized sub-agents, and integrated tools. The system handles realistic customer interactions, and red teams will attempt to trigger harmful behaviors: fraudulent transactions, data exfiltration, unauthorized tool use, and policy-violating responses. Your safeguards will be scored on how well they block attacks from red teamers versus how well they allow benign requests from a holdout test set. The leaderboard uses a combined metric based on false positive and false negative rates. What you can submit * System prompt configurations for monitor models *Input/output classifiers (any framework) *Containerized solutions with custom logic For prize eligibility, solutions must be open source or open weights. Proprietary solutions can compete on a separate unprized leaderboard for benchmarking purposes. Solutions must be registered a week before the first or second defense phase starts and submitted a day beforehand. Submission interface will be available by early February. Timeline for blue teams *January 2026: Preliminary challenge details shared with registered blue teams *February 11-25: Red teams attack baseline defenses and early defense submissions *February 25 - March 25 (First Defense Phase): You receive the attack dataset from Waves 0-1. Build and iterate your safeguards in our test environment. Submit your defense by the end of this phase. *Approximately March 25 - April 1 (Wave 2): Red teams attack your submitted safeguards. You see what breaks. Exact dates TBA. *Approximately April 1 - April 29 (Second Defense Phase): Iterate based on Wave 2 results. Final submissions due before Wave 3. Exact dates TBA. *Approximately April 29 - May 6 (Wave 3): Final attack wave. Leaderboard locks. Exact dates TBA. Prizes $70,000 in prizes for blue teams: *First Defense Phase: $10,000 (top 10 teams, first place $2,000) *Second Defense Phase: $60,000 (top 15 teams, first place $15,000) Blue team entries are per organization. Only open-source/open-weights solutions are prize-eligible. Participants from judging organizations cannot submit; participants from sponsor organizations cannot win prizes. (Other Oxford groups unrelated to OATML are eligible.) Co-sponsors and judges Judging is handled by UK AISI and US CAISI. Why participate? *Test your defenses against real adaptive attacks from skilled red teamers *Benchmark against other research groups and commercial solutions *Contribute to open research on AI safeguards (prize-eligible solutions are published) *Cash prizes for top performers

Reminder: Trusted AI Symposium abstracts are due December 19. Papers using Amazon Nova models receive priority consideration. Submit today. #ResponsibleAI amazon.science/conferences-an…






