Jinesis Lab (UToronto)

43 posts

Jinesis Lab (UToronto)

@JinesisLab

Jinesis Lab led by Prof @ZhijingJin at @UofTCompSci @VectorInst conducts frontier research on Responsible AI, LLMs, and Causality.

Katılım Aralık 2025

1 Takip Edilen194 Takipçiler

Jinesis Lab (UToronto) retweetledi

Changling Li@ChanglingXavier·2h

🔍 We release our work on decomposing and measuring evaluation awareness in frontier LLMs, together with EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples.

GIF

English

1.3K

Jinesis Lab (UToronto)@JinesisLab·16 May

@Yoshua_Bengio @OanaIgnatRo @jzl86 @maksym_andr @schmidtsciences @coop_ai A heartfelt thank you to every author and reviewer. Your enthusiasm for advancing trustworthy AI is what makes this community so special. We look forward to seeing you on July 10th in Grand Ballroom 103, Seoul. 🌐 trustworthy-ai-for-good.github.io

English

153

Jinesis Lab (UToronto)@JinesisLab·16 May

We're thrilled to share that our 1st Trustworthy AI for Good (AI4GOOD) workshop at #ICML2026 has received 534 submissions and they will be reviewed by an incredible pool of 230 reviewers!

English

4.8K

Jinesis Lab (UToronto) retweetledi

Zhijing Jin@ZhijingJin·3 May

Excited for our #ICML2026 papers at @JinesisLab @MPI_IS @UofTCompSci @TorontoSRI @VectorInst! We present papers that advance the research frontiers of (1) Causal LLMs, (2) AI for Science (physics), (3) Multi-Agent LLMs via mechanism design, and (4) Adversarial Defense by honeypot. Congrats to all our student authors and collaborators, esp. @TerryJCZhang @SimkoSamuel @EmanuelTewolde @ivakshi_s @andrewkihyun @PepijnCobben @yahang_qi @FurkanDanismann @bschoelkopf and many others!🎉

English

3.7K

Jinesis Lab (UToronto) retweetledi

Zhijing Jin@ZhijingJin·30 Nis

📢New paper alert📢Check out our latest survey on #LLM Deception: "From Hallucination to Scheming: A Unified Taxonomy and Benchmark Analysis for LLM Deception". We cover from behavioral deception to intentional, strategic deception, via mechanisms such as fabrication, omission, and pragmatic distortion. 💡Highlight: Surveying 50 benchmarks, we find every single one tests fabrication while pragmatic distortion and attribution are critically under-covered. 🔗Link: arxiv.org/abs/2604.04788 🤝Authors: @Jerick1380 @TerryJCZhang @ZhijingJin @conitzer🎉 #AIAgents #AISafety #MultiAgentAI @MPI_IS @ELLISforEurope @UofTCompSci @VectorInst @TorontoSRI @CIFAR_News @JinesisLab @EuroSafeAI @ELLISInst_Tue @CarnegieMellon @SCSatCMU

English

5.7K

Jinesis Lab (UToronto) retweetledi

Zhijing Jin@ZhijingJin·29 Nis

⚠️Can we trust #LLM agents to keep their promises? We tested 9 frontier LLMs in game-theoretic settings, where the agents (1) publicly commit to an action, (2) privately choose what to do -- breaking promises ~57% of the time, and most do it without even realizing they lied. 📖Paper: "Cheap Talk, Empty Promise: Frontier LLMs easily break public promises for self-interest" 🔗Link: arxiv.org/abs/2604.04782 🤝Authors: @Jerick1380 @TerryJCZhang @ZhijingJin @conitzer🎉 #AIAgents #AISafety #MultiAgentAI @MPI_IS @ELLISforEurope @UofTCompSci @VectorInst @TorontoSRI @CIFAR_News @JinesisLab @EuroSafeAI @ELLISInst_Tue @CarnegieMellon @SCSatCMU

English

116

9.3K

Jinesis Lab (UToronto) retweetledi

Zhijing Jin@ZhijingJin·29 Nis

What happens when you put #LLM agents in a room and ask them to cooperate? They collapse. They free-ride. They form social networks. We spent 2+ years building a full research series on Multi-Agent LLM Safety. Here's a 50-min talk covering all of it: 🔗 youtube.com/watch?v=1MxpYJ…

YouTube

English

6.7K

Jinesis Lab (UToronto) retweetledi

CausalNLP@CausalNLP·25 Nis

Sharing ACL 2024 Best Paper Winner, "Causal Estimation of Memorisation Profiles"! LMs can reproduce training data verbatim, but measuring this "causally" (what would happen if the model never saw the data?) is hard. This paper fills the gap. link: aclanthology.org/2024.acl-long.… 1/n

English

520

Jinesis Lab (UToronto) retweetledi

Zhijing Jin@ZhijingJin·22 Nis

10 days left to submit to the 1st Trustworthy AI for Good (AI4GOOD) workshop at #ICML2026! @icmlconf We're giving out multiple awards and travel funds sponsored by @schmidtsciences and @coop_ai: 🏆 Best Paper Awards (including targeted prizes for cooperative AI theme) 🏆 Top Reviewer Awards ✈️ Travel Funds Submit here → openreview.net/group?id=ICML.… ⏰ Deadline: May 3, 2026 (AoE) 📌 Notification: May 18, 2026 🔗(We extended our deadline to accommodate more submissions!) Join us in Seoul for discussions bridging AI safety, social good, and governance with keynote speakers @Yoshua_Bengio, @OanaIgnatRo, @jzl86, @maksym_andr, and more!

English

13.3K

Jinesis Lab (UToronto)@JinesisLab·17 Nis

All Papers for our Multi-Agent LLMs Work Topic 1: Emergent Behavior Analysis 🌍 (NeurIPS 2024) "Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents". arxiv.org/abs/2404.16698 🎮 (Preprint 2026) "GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory". arxiv.org/abs/2602.12316 ⚖️ (Preprint 2025) "When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas". arxiv.org/abs/2505.19212 Topic 2: Governance & Regulation ⚙️ (COLM 2025, Best Oral Paper @ REALM ACL 2025) "Corrupted by Reasoning: Reasoning LLMs Become Free-Riders in Public Goods Games". arxiv.org/abs/2506.23276 🤝 (Preprint 2026) "CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas". tinyurl.com/coopeval-pdf 🗳️ (Preprint 2026) "Evaluating Cooperation in LLM Social Groups through ElectedSelf-Organizing Leadership". tinyurl.com/agent-elect-pdf Topic 3: Dynamics in Agent-to-Agent Interactions 🧠 (EMNLP 2025) "Testing Interlocutor Awareness among LLMs: Agent-to-Agent Theory of Mind". arxiv.org/abs/2506.22957 📊 (EACL 2026) "CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures". arxiv.org/abs/2508.11915 Topic 4: Moral Evaluation of LLMs 🏆 (Best Paper @ NeurIPS 2024 WS; Spotlight @ ICLR 2024) "Language Model Alignment in Multilingual Trolley Problems". arxiv.org/abs/2407.02273 🧭 (EMNLP 2025) "Are Language Models Consequentialist or Deontological Moral Reasoners?". arxiv.org/abs/2505.21479 ⚖️ (Preprint 2025) "When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas". arxiv.org/abs/2505.19212

English

Jinesis Lab (UToronto)@JinesisLab·17 Nis

🎙️Happy to share @ZhijingJin’s talk on"Emergent AI Safety Risks in Multi-Agent #LLMs" at the @SRI_UofT Seminar Series on: Will multi-agent LLMs coordinate for social good, or exploit rivals in ways that put humans at serious risk? 🧵 📹 youtube.com/watch?v=1MxpYJ…

YouTube

English

Jinesis Lab (UToronto)@JinesisLab·17 Nis

🔍 Key findings: reasoning agents with sophisticated thinking often fail to sustain cooperation in a multitude of settings, and surprisingly, stronger reasoning capabilities often make models more prone to selfish strategies like free riding. But interventions such as mediation by a neutral agent and agent-to-agent commitment protocols show a promising path towards the Pareto frontier ✨ Thanks @SRI_UofT for the invitation and for hosting such a great seminar series!

English

Jinesis Lab (UToronto) retweetledi

Zhijing Jin@ZhijingJin·9 Nis

Excited for our "Trustworthy AI for Good" (AI4GOOD) Workshop at #ICML2026! As AI agents increasingly affect our lives, it is key to bridge #ResponsibleAI, social good, and governance. Let’s build solutions together! ⏰ Submission deadline: April 30, 2026 (AoE) 🎙️Confirmed speakers: @Yoshua_Bengio, Joel Z. Leibo (@jzl86), Maksym Andriushchenko (@maksym_andr), @OanaIgnatRo [More to come!] 📍July 10-11, 2026 · Seoul🇰🇷 🔗 trustworthy-ai-for-good.github.io 📝 Submit: openreview.net/group?id=ICML.… 📣 Be a reviewer: forms.gle/7cXvUJCW1FdEgh…

English

163

12.3K

Jinesis Lab (UToronto) retweetledi

Zhijing Jin@ZhijingJin·9 Nis

🙌 Huge thanks to our organizing team across 7 institutions: @TerryJCZhang @VectorInst @JinesisLab @EuroSafeAI, @ZhijingJin @MPI_IS @UofTCompSci @VectorInst @TorontoSRI @CIFAR_News @JinesisLab @ELLISInst_Tue, @radamihalcea @UMichCSE @michigan_AI, @MilindTambe_A @Harvard, @david_lie @UofTCompSci @TorontoSRI, @JoanNwatu @UMichCSE @michigan_AI, @davidguzman1120 @ETH_en @JinesisLab, @ChanglingXavier @ETH_en @JinesisLab, @Jerick1380 @CarnegieMellon, Prakhar Gupta @UMichCSE, @vantru0ng @Penn @JinesisLab Ettore Gran @EuroSafeAI 🎉Big thanks to our sponsor @schmidtsciences Mark Greaves, @mikebelinsky, @James_D_Fox et al. 📧 Sponsorship & questions: zjingchen@cs.toronto.edu Let's bridge trustworthy AI and real-world impact — see you in Seoul! 🇰🇷

English

488

Jinesis Lab (UToronto) retweetledi

Zhijing Jin@ZhijingJin·7 Nis

We are hosting a Dagstuhl seminar on Causality & LLMs this week (Apr 7–10). Bringing together world experts to explore: 1️⃣ Integrating LLMs 🤖 into causal workflows 2️⃣ Evaluating & improving LLMs’ causal reasoning 🧠 Co-organized w/ @amt_shrma @DominikJanzing @kunkzhang @ZhijingJin 📍Schloss Dagstuhl, Wadern, Germany 🔗 dagstuhl.de/26152 📖 cr-llm.github.io 📅 Apr 7–10 #CausalNLP #LLM #Dagstuhl @CausalNLP @MPI_IS @ELLISforEurope @UofTCompSci @VectorInst @TorontoSRI @CIFAR_News @JinesisLab @EuroSafeAI @ELLISInst_Tue Also joined with my student @rahulbshrestha to present our CauSciBench and Causal AI Scientist work :)!

English

Jinesis Lab (UToronto) retweetledi

Zhijing Jin@ZhijingJin·6 Nis

📢We will present 5 papers to #ICLR2026, #CLeaR2026, and #ACL2026: - SocialHarmBench by @psyonp et al. - Causal LLMs on Instrumental Variable Method by @ivakshi_s et al. - LLM Data Contamination study by @TerryJCZhang et al. - Mech Interp for VLM by @francescortu et al. - DPO data selection method by Xuan & @rongwu_xu Thanks to all our collaborators and institutional support from @MPI_IS @ELLISforEurope @UofTCompSci @VectorInst @TorontoSRI @CIFAR_News @JinesisLab @EuroSafeAI @ELLISInst_Tue @ETH_en @ETH_AI_Center @michigan_AI @UMichiganAI @UMichCSE! Feel free to access the papers at arxiv.org/abs/2510.04891 arxiv.org/abs/2602.07943 arxiv.org/abs/2509.00072 arxiv.org/abs/2507.13868 arxiv.org/abs/2508.04149 🎉

English

5.8K

Jinesis Lab (UToronto)@JinesisLab·28 Mar

What is the roadmap for NLP to actually help the world? 🌍 Thrilled to share our NLP for Social Good survey across nine domains, from healthcare and education to poverty, peacebuilding, and environmental protection. We analyze ACL Anthology trends and find that poverty, peacebuilding, and environmental protection remain underexplored. A call for cross-disciplinary partnerships and human-centered NLP, with 30+ authors! 📄 aclanthology.org/2026.eacl-long… #NLP4SG #EACL2026 #AI #ResponsibleAI

English

1.2K

Jinesis Lab (UToronto)@JinesisLab·28 Mar

w/ @_AKassem, @bschoelkopf, @ZhijingJin With the support of @MPI_IS @ELLISforEurope @UofTCompSci @VectorInst @TorontoSRI @CIFAR_News @JinesisLab @ELLISInst_Tue

English

140

Jinesis Lab (UToronto)@JinesisLab·28 Mar

How robust are LLM routers, really? 🔀 We find that preference-based routers rely on category heuristics, not query complexity. They route ALL coding and math queries to the strongest LLM even when simpler models suffice, while sending jailbreaking attempts to weaker models, elevating safety risks! 🚨 We introduce the DSC benchmark: Diverse, Simple, and Categorized, evaluating routers across coding, math, translation, privacy, safety, and more. 📄 aclanthology.org/2026.eacl-long… #EACL2026 #AISafety #LLMs #NLP

English

2.8K

Jinesis Lab (UToronto)@JinesisLab·27 Mar

with @YinyaHuang @MrinmayaSachan @ZhijingJin and the support of @MPI_IS @ELLISforEurope @JinesisLab @UofTCompSci @VectorInst @ETH_en @ETH_AI_Center

English

Jinesis Lab (UToronto)@JinesisLab·27 Mar

Standard text metrics miss correct LLM causal reasoning when the answer is valid but written differently. 🔍 Paul He presented DoVerifier: a symbolic verification framework using do-calculus and BFS with sound & complete guarantees! 🧩 📄 aclanthology.org/2026.eacl-long… #Causality #EACL2026 #NLP

English

959

Keşfet

@Yoshua_Bengio @OanaIgnatRo @jzl86 @maksym_andr @schmidtsciences @coop_ai @MPI_IS @UofTCompSci