Rahul Gupta

255 posts

Rahul Gupta banner
Rahul Gupta

Rahul Gupta

@rahul1987iit

Responsible AI @Amazon AGI (Nova models) | Organizer @TrustNLP | Organizer @UnlearningSEM

เข้าร่วม Şubat 2011
321 กำลังติดตาม235 ผู้ติดตาม
ทวีตที่ปักหมุด
Rahul Gupta
Rahul Gupta@rahul1987iit·
🚀 Excited to share an update on the the Amazon Nova AI Challenge! This year, competing university teams will get access to Nova Forge, enabling them to customize frontier Nova models and build AI agents that go beyond single-prompt code generation—focusing on planning, execution, and validation across complex software systems. What makes this especially exciting is the emphasis on both capability and safety, and the opportunity for students to work with research-grade AI infrastructure typically limited to large labs. Proud of the team behind this and looking forward to the innovations these teams will unlock. 🙌 Blog here: rb.gy/xn32sn
English
0
0
5
351
Rahul Gupta รีทวีตแล้ว
Patricia Paskov
Patricia Paskov@prpaskov·
🚨New paper 🚨Human uplift studies -- RCTs measuring how AI changes human performance -- inform how we govern, deploy, and understand AI systems and their societal impacts.
English
1
5
20
2.6K
Rahul Gupta รีทวีตแล้ว
Gray Swan AI
Gray Swan AI@GraySwanAI·
The Staged Attack Challenge, a $40,000 competition focused on breaking layered AI defense systems, is launching at 1PM ET: hubs.ly/Q046mZhX0 Frontier Labs like @AnthropicAI, @GoogleDeepMind, and @OpenAI increasingly rely on defense-in-depth pipelines: input filters, output filters, and safety-trained models working together. But recent research shows attackers can defeat these layers one at a time, then combine techniques to bypass the entire pipeline.
Gray Swan AI tweet media
English
2
5
14
1.3K
Rahul Gupta รีทวีตแล้ว
trustnlp
trustnlp@trustnlp·
We got incredible interest in our workshop and now are looking for reviewers. Please consider signing up as reviewers for the workshop here -- docs.google.com/forms/d/e/1FAI…
English
0
1
5
189
Rahul Gupta
Rahul Gupta@rahul1987iit·
Thrilled to share our latest research on building safer, more trustworthy AI! 🛡️ In collaboration with Amazon Nova & Chatterbox Labs (now @RedHat), we’ve documented how early, rigorous safety testing helps mitigate adversarial risks in GenAI development. Check out the full paper and the story behind it: redhat.com/en/blog/ai-tru… #GenAI #ResponsibleAI
English
0
0
2
86
Rahul Gupta รีทวีตแล้ว
Sharon Li
Sharon Li@SharonYixuanLi·
When evaluating LVLMs, should we really be asking: “Did the model get the right answer?” or rather “Did the model truly integrate the visual input?” LVLMs can rely on shortcuts learned from the underlying language model, aka language prior. In our #ICLR2026 paper, we attempt to understand this phenomenon at a deeper, representation-level. 📄 “Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding”. arxiv.org/abs/2509.23050 ------- 1/ Problem: LVLMs often ignore visual evidence While LVLMs perform well on many benchmarks, they sometimes rely on language patterns rather than actual images. A simple example: show a model a green banana, and it may confidently describe it as “ripe and yellow” ---because that’s the most common linguistic pattern it has learned. 🍌 This raises a central question: Where inside the model does visual information begin to influence its reasoning? 2/ Motivation: Output-level probes fall short Most analyses inspect outputs, e.g., by removing the image or comparing predictions. But these methods cannot reveal when the model starts integrating vision and how strongly visual signals affect internal states. To address this, we need a representation-driven perspective. 🔍 3/ Approach: Contrasting Chain-of-Embedding (CoE) We trace hidden representations across the model’s depth for the same prompt: •once with the image •once without the image By comparing these trajectories layer by layer, we identify the exact point where visual input begins shaping the model’s internal computation. This leads to the discovery of the Visual Integration Point (VIP) ✨--- the layer at which the model “starts seeing.” We then define Total Visual Integration (TVI), a metric that quantifies how much visual influence accumulates after the VIP. 4/ Findings across 10 LVLMs and 6 benchmarks Across 60 evaluation settings, we observe: • VIP consistently appears across diverse architectures • Pre-VIP → representations behave like a language-only model • Post-VIP → visual signals increasingly reshape the embedding pathway • TVI correlates strongly with actual visual reasoning performance • TVI outperforms attention- and output-based proxies at identifying language prior TVI thus offers a more principled indicator of whether a model actually uses the image. 5/ Impact: A new lens on multimodal behavior Our framework has a few practical benefits. It enables (1) diagnosing over-reliance on language prior, (2) comparing LVLM architectures more rigorously, (3) informing better training and alignment strategies, and (4) improving robustness and grounding in real-world tasks. Shout out to my students for this insightful work: Lin Long, @Changdae_Oh, @seongheon_96 🌻 Please check out our paper for more details!
Sharon Li tweet media
English
2
34
227
14.6K
Rahul Gupta
Rahul Gupta@rahul1987iit·
It has been an incredibly productive and perspective-shifting month for our team! From major research milestones to global community engagement, here is a look back at what we’ve been up to in the world of Trusted AI: 🚀 Expanding Research FrontiersWe officially published our Frontier Safety Report on Nova Lite 2.0, detailing our commitment to building secure and robust large-scale models. You can dive into the technical safety evaluations here: arxiv.org/abs/2601.19134 🎓 Academic Excellence at ICLRI’m thrilled to share that three of our collaborative papers have been accepted at ICLR 2026! These works represent months of deep dive into model alignment and evaluation: Paper 1: arxiv.org/pdf/2510.03969 Paper 2: arxiv.org/pdf/2510.21910 Paper 3: arxiv.org/pdf/2510.03999 🤝 Community & Innovation We hosted Trusted AI Day, a deep dive into the symposium of ideas surrounding the responsible deployment of AI: amazon.science/conferences-an… We officially kicked off Year 2 of the Nova Trusted AI Challenge! This year, competing teams will have access to Nova Forge to push the boundaries of what’s possible: lnkd.in/nova-challenge 🌏 Global Perspectives: India AI Summit On a personal note, attending the India AI Summit was a defining highlight of the month. Beyond the summit, I had the honor of presenting our work on Frontier Safety at IIT Mumbai and IIIT Delhi. The summit, in particular, was an eye-opening experience regarding where our work is situated in the broader GenAI universe. The sheer scale of the organization, the diversity of use cases, and the unique perspectives shared were unlike anything I’ve seen before. Even as a spectator, it was the kind of experience that fundamentally shapes your perspective on how AI will impact the world. Grateful for my brilliant collaborators and the global community for pushing the needle on safe, trusted AI. Onward! 🚀 #AI #GenerativeAI #TrustedAI #MachineLearning #ICLR2026 #IndiaAISummit #NovaAI #TechInnovation
English
0
2
4
174
Rahul Gupta รีทวีตแล้ว
Ruoxi Jia
Ruoxi Jia@ruoxijia·
Excited that one of my favorite works is accepted to #ICLR2026. We study AI security from a data-centric lens: What data should we train on to achieve robustness against future jailbreaks? We conduct a 2-year temporal analysis of jailbreak papers and find that new attacks rarely introduce fundamentally new primitives, instead recombining existing ones at the adversarial skill level. This mirrors human innovation, which often advances by reusing existing skills rather than inventing from scratch. Inspired by this, we: • Learn a jailbreak skill dictionary • Synthesize diverse attack data by recombining skills Using this simple approach, we align a Llama model on the synthetic data and achieve robustness comparable to commercial reasoning models (Claude Sonnet, o4-mini) while maintaining low over-refusal. More broadly, we see skill-level modeling as a structured way to reason about unseen future attacks in AI security. This work was led by @Mahavir_Dabas18, in collaboration with Amazon colleagues Yao Ma, @cperiz, and @rahul1987iit, as well as colleagues and students at VT and Princeton: Tran Huynh, Nikhil Reddy, @JiachenWang97, @penggaotweets, and @prateekmittal_.
Mahavir@Mahavir_Dabas18

Claude Skills shows the power of leveraging skill catalogs at inference time. Our new paper shows that skills can transform AI safety too 🔒 🚨 Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks We find that most “new” jailbreaks aren’t new—they recombine a finite set of reusable adversarial skills. By studying jailbreaks over time, we show that temporal generalization to unseen attacks emerges naturally when models learn in the skill space. 📘Paper- arxiv.org/abs/2510.21910 🌐Webpage- mahavirdabas18.github.io/adversarial_de… 🥳 Thanks to all my amazing collaborators! @ruoxijia @JiachenWang97 @prateekmittal_ @MingJin80233626 @rahul1987iit @cperiz @penggaotweets @nikilr28

English
0
3
16
1.1K
Rahul Gupta รีทวีตแล้ว
Narendra Modi
Narendra Modi@narendramodi·
Inaugurated the India AI Impact Expo 2026 at Bharat Mandapam. Being here among innovators, researchers and tech enthusiasts gives a glimpse of the extraordinary potential of AI, Indian talent and innovation. Together, we will shape solutions not just for India but for the world!
Narendra Modi tweet mediaNarendra Modi tweet mediaNarendra Modi tweet mediaNarendra Modi tweet media
English
1.8K
5.9K
44K
5.1M
Rahul Gupta
Rahul Gupta@rahul1987iit·
I’ll be at the India AI Summit next week 🇮🇳🤖 Looking forward to catching up with fellow researchers, builders, and policymakers to discuss how we prioritize safety while building and deploying foundation models at Amazon—from evaluation and governance to integrating safety into production-scale systems. If this is of interest, I’m also happy to discuss our recent paper on frontier safety evaluation, where we evaluate Nova Lite 2.0 across multiple safety dimensions: 👉 arxiv.org/html/2601.1913…� Feel free to reach out if you’ll be there—I’d love to connect.
English
0
1
4
390
Rahul Gupta รีทวีตแล้ว
Yang Xu
Yang Xu@YangXu_09·
LH-DECEPTION, our framework for studying LLM deception in long-horizon interactions, has been accepted at ICLR 2026! 🎉 Most deception benchmarks test LLMs in single-turn settings. But in the real world, AI agents work on extended, interdependent tasks, and deception doesn't always show up in one exchange. It can emerge gradually, compound over turns, and erode trust silently. We built a multi-agent simulation framework: a performer agent completes sequential tasks under event pressure, a supervisor agent evaluates progress and tracks states, and an independent deception auditor reviews the full trajectory to detect when and how deception occurs. We tested 11 frontier LLMs: every single one deceives, but rates vary dramatically: Claude Sonnet-4 at 21.4%, Gemini 2.5 Pro at 24.8%… all the way to DeepSeek V3-0324 at 79.3%. Key findings: 📌Models that look safe on single-turn benchmarks fail badly here, and long-horizon auditing catches 7.1% more deception than per-step auditing. 📌 Deceptive behaviors are more likely under event pressure. Higher stakes will amplify deceptive strategies. 📌 Deception erodes trust: strong negative correlation between deception rate and supervisor trust 📌 Deception compounds. We found "chain of deception" where small deviations escalate into outright fabrication across turns, invisible to single-turn evaluation Grateful to @SharonYixuanLi for her mentorship, and to @xuanmingzhangai and @Samuel861025 for driving this work together. Thanks also to @jwaladhamala, @ousamjah, and @rahul1987iit at @amazon AGI for their support and collaboration. #AI #LLM #Deception #Trust #AIethics #AgenticAI #AIResearch #ICLR2026
Yang Xu tweet mediaYang Xu tweet mediaYang Xu tweet media
English
1
5
9
1.3K
Rahul Gupta รีทวีตแล้ว
Cryptography and Security Papers
Evaluating Nova 2.0 Lite model under Amazon's Frontier Model Safety Framework Satyapriya Krishna, Matteo Memelli, Tong Wang, Abhinav Mohanty, Claire O'Brien Rajkumar, Payal Motwani, Rahul Gupta, Spyros Matsoukas arxiv.org/abs/2601.19134 [𝚌𝚜.𝙲𝚁 𝚌𝚜.𝚂𝙴]
Cryptography and Security Papers tweet media
0
1
4
92
Rahul Gupta
Rahul Gupta@rahul1987iit·
Excited for this challenge.
Yarin@yaringal

I’m excited to share that we are launching a public safeguards competition next month in partnership with @AISecurityInst, @GraySwanAI, @OATML_Oxford, Sequrity.ai, @OpenAI, @AnthropicAI, and @amazon. This is a red-versus-blue competition focused on building new agent safeguards, and breaking these safeguards. Should be a lot of fun, and there’s prizes as well for open-source submissions! Please help to share this opportunity! Registration is open now: app.grayswan.ai/arena/challeng… --- More details: Oxford (OATML) and UK AISI have teamed up with Gray Swan and Sequrity.ai, as well as OpenAI, Anthropic, and Amazon, to run a public competition where blue teams build defenses against real red teaming attacks, and we'd like to invite you to participate. What is the Safeguards Challenge? Gray Swan runs the Arena, a platform where security researchers ("red teams") attempt to elicit harmful behaviors from AI systems. Challenges have been supported by UK AISI, US CAISI, OpenAI, Anthropic, Amazon, Google DeepMind, and Meta, and have surfaced real vulnerabilities that help developers improve their models. The Safeguards Challenge is the Arena’s first red-versus-blue competition. Instead of just measuring attacks, we're measuring defenses. Blue teams will submit safeguards (system prompts, classifiers, or containerized solutions) that attempt to block red teamers and adversarial inputs while allowing legitimate requests through. Red teams will then try to break those defenses, and the cycle repeats. The target environment Blue teams will defend a multi-agent customer support system with an orchestrator agent, specialized sub-agents, and integrated tools. The system handles realistic customer interactions, and red teams will attempt to trigger harmful behaviors: fraudulent transactions, data exfiltration, unauthorized tool use, and policy-violating responses. Your safeguards will be scored on how well they block attacks from red teamers versus how well they allow benign requests from a holdout test set. The leaderboard uses a combined metric based on false positive and false negative rates. What you can submit * System prompt configurations for monitor models *Input/output classifiers (any framework) *Containerized solutions with custom logic For prize eligibility, solutions must be open source or open weights. Proprietary solutions can compete on a separate unprized leaderboard for benchmarking purposes. Solutions must be registered a week before the first or second defense phase starts and submitted a day beforehand. Submission interface will be available by early February. Timeline for blue teams *January 2026: Preliminary challenge details shared with registered blue teams *February 11-25: Red teams attack baseline defenses and early defense submissions *February 25 - March 25 (First Defense Phase): You receive the attack dataset from Waves 0-1. Build and iterate your safeguards in our test environment. Submit your defense by the end of this phase. *Approximately March 25 - April 1 (Wave 2): Red teams attack your submitted safeguards. You see what breaks. Exact dates TBA. *Approximately April 1 - April 29 (Second Defense Phase): Iterate based on Wave 2 results. Final submissions due before Wave 3. Exact dates TBA. *Approximately April 29 - May 6 (Wave 3): Final attack wave. Leaderboard locks. Exact dates TBA. Prizes $70,000 in prizes for blue teams: *First Defense Phase: $10,000 (top 10 teams, first place $2,000) *Second Defense Phase: $60,000 (top 15 teams, first place $15,000) Blue team entries are per organization. Only open-source/open-weights solutions are prize-eligible. Participants from judging organizations cannot submit; participants from sponsor organizations cannot win prizes. (Other Oxford groups unrelated to OATML are eligible.) Co-sponsors and judges Judging is handled by UK AISI and US CAISI. Why participate? *Test your defenses against real adaptive attacks from skilled red teamers *Benchmark against other research groups and commercial solutions *Contribute to open research on AI safeguards (prize-eligible solutions are published) *Cash prizes for top performers

English
0
0
2
130
Rahul Gupta รีทวีตแล้ว
Kimon Fountoulakis
Kimon Fountoulakis@kfountou·
ICLR is asking ACs to submit desk-rejection suggestions for papers with hallucinated references...
English
5
2
93
21.5K
Rahul Gupta
Rahul Gupta@rahul1987iit·
6th edition of the @trustnlp workshop will be co-located with ACL 2026! 🎯 We can’t wait to join the community and explore the latest in trustworthy NLP. See you there! 👋 #ACL2026 #TrustNLP
English
0
3
11
1.5K