Jerick Shi

@Jerick1380

5th Year MSCS student @SCSatCMU

Katılım Eylül 2025

5 Takip Edilen18 Takipçiler

Jerick Shi retweetledi

Zhijing Jin@ZhijingJin·30 Nis

📢New paper alert📢Check out our latest survey on #LLM Deception: "From Hallucination to Scheming: A Unified Taxonomy and Benchmark Analysis for LLM Deception". We cover from behavioral deception to intentional, strategic deception, via mechanisms such as fabrication, omission, and pragmatic distortion. 💡Highlight: Surveying 50 benchmarks, we find every single one tests fabrication while pragmatic distortion and attribution are critically under-covered. 🔗Link: arxiv.org/abs/2604.04788 🤝Authors: @Jerick1380 @TerryJCZhang @ZhijingJin @conitzer🎉 #AIAgents #AISafety #MultiAgentAI @MPI_IS @ELLISforEurope @UofTCompSci @VectorInst @TorontoSRI @CIFAR_News @JinesisLab @EuroSafeAI @ELLISInst_Tue @CarnegieMellon @SCSatCMU

English

5.7K

Jerick Shi@Jerick1380·29 Nis

Good question! For this paper, we are sidestepping the trust question (we assign announcements so we never measure trust building). We are working on a follow-up that's about to go on arXiv soon that talks about this directly! The main idea is that we add a private planning stage before the public announcement, so any lie decomposes into two separable pieces: promise deception (plan ≠ announcement) and commitment breaking (announcement ≠ action). Will send updates when the preprint is out!

English

Suresh@_Suresh2·29 Nis

@Jerick1380 how did you separate breaking promises from exploiting trust? in my evals they kept blurring

English

Jerick Shi@Jerick1380·28 Nis

After about a year of work, I defended my MSCS thesis at CMU: Title: The Structure of Deception: How LLM Agents Lie, Break Promises, and Exploit Trust in Multi-Agent Settings Core claim: LLM deception in multi-agent settings isn't one phenomenon. It's a family of structurally distinct failure modes, each shaped by different features of the interaction. Some look like premeditated false commitments. Others look like strategic silence that message-level classifiers can't see at all. Aggregate lying rates hide this, and current monitoring approaches each fail against different parts of it. I would like to deeply thank to my advisors @conitzer and @ZhijingJin, @AdtRaghunathan for being part of the committee, and everyone in the @JinesisLab for all their time and effort shaping this work. Recording: youtu.be/Z3Q9AkriPxg @MPI_IS @ELLISforEurope @UofTCompSci @VectorInst @TorontoSRI @CIFAR_News @JinesisLab @EuroSafeAI @ELLISInst_Tue @CarnegieMellon @SCSatCMU #AIAgents #AISafety #MultiAgentAI

YouTube

English

6.1K

Jerick Shi retweetledi

Zhijing Jin@ZhijingJin·29 Nis

⚠️Can we trust #LLM agents to keep their promises? We tested 9 frontier LLMs in game-theoretic settings, where the agents (1) publicly commit to an action, (2) privately choose what to do -- breaking promises ~57% of the time, and most do it without even realizing they lied. 📖Paper: "Cheap Talk, Empty Promise: Frontier LLMs easily break public promises for self-interest" 🔗Link: arxiv.org/abs/2604.04782 🤝Authors: @Jerick1380 @TerryJCZhang @ZhijingJin @conitzer🎉 #AIAgents #AISafety #MultiAgentAI @MPI_IS @ELLISforEurope @UofTCompSci @VectorInst @TorontoSRI @CIFAR_News @JinesisLab @EuroSafeAI @ELLISInst_Tue @CarnegieMellon @SCSatCMU

English

116

9.3K

Jerick Shi retweetledi

Zhijing Jin@ZhijingJin·22 Nis

10 days left to submit to the 1st Trustworthy AI for Good (AI4GOOD) workshop at #ICML2026! @icmlconf We're giving out multiple awards and travel funds sponsored by @schmidtsciences and @coop_ai: 🏆 Best Paper Awards (including targeted prizes for cooperative AI theme) 🏆 Top Reviewer Awards ✈️ Travel Funds Submit here → openreview.net/group?id=ICML.… ⏰ Deadline: May 3, 2026 (AoE) 📌 Notification: May 18, 2026 🔗(We extended our deadline to accommodate more submissions!) Join us in Seoul for discussions bridging AI safety, social good, and governance with keynote speakers @Yoshua_Bengio, @OanaIgnatRo, @jzl86, @maksym_andr, and more!

English

13.3K

Keşfet

@TerryJCZhang @ZhijingJin @conitzer @MPI_IS @ELLISforEurope @UofTCompSci @VectorInst @TorontoSRI