
Changling Li
144 posts

Changling Li
@ChanglingXavier
AI safety, multi-agent systems, and governance. Currently at @ETH_en and @MPI_IS working with @maksym_andr and @sahar_abdelnabi.







🔍 We release our work on decomposing and measuring evaluation awareness in frontier LLMs, together with EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples. Work led by @ChanglingXavier as his masters thesis!!

💥 Today we release a new paper on decomposing and measuring evaluation awareness in frontier LLMs. We also release EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples. This work was led by @ChanglingXavier. In fact, it's his master's thesis! Impressive work!




📢📢 I'm expanding my group at the ELLIS Institute Tübingen and the Max Planck Institute for Intelligent Systems and hiring 3-4 PhD students, and also visiting PhD students for funded 3-6 months stays ✨💥 Research areas broadly span the safety, security, and trustworthiness of AI agents including: • Sycophancy, decision-making, and societal implications of AI • Autonomous agents and memory • Multi-agent safety, security, and privacy • Training dynamics of LLMs • AI control and oversight • AI for science I'm looking for: - Visiting PhD students with a strong computer science background and research experience. - Students with a Master's degree and demonstrated research experience. What we offer in Tübingen: amazing compute, collaborators, environment, and a lovely town! If you're interested, please fill out the Google form on my website: s-abdelnabi.github.io @ELLISInst_Tue @MPI_IS #phd #AIsafety

🧵 We have bad news about prompt injections. #prompt_injections_so_back ⚠️They may be fundamentally unsolvable for AI agents doing anything complex in real contexts. Though, there is some hope The common narrative (e.g. CaMeL/SecAlign, etc) assumes you assumes you can separate data from instructions 🧱. But, commonly, in anything more complex than a toy demo, that distinction won’t hold. An email saying "the department head approved X" isn't an instruction, but could dramatically change agentic behavior whether this claim is true or not. It's a contextual claim that the agent may not be able to verify. Probabilistic defenses now don’t catch this. And any deterministic defense that blocks all such claims also blocks the legitimate ones. Fundamentally, an autonomous agent’s operating context might contain instructions everywhere: any interaction with a third-party or use of memory or skills are instructional by design 💬. Instead, in our new paper with @ebagdasa, we reframe prompt injection through Contextual Integrity and show that: 🔴 Current classifiers can't detect contextual attacks 🔴 Safety training (SecAlign) makes BOTH security and utility worse 🔴 A CI-informed red-team loop hits 96.7% attack success on frontier models, that also transfer to other models 🔴 Even without any attacker, agents fail to separate information flows or respect delegation boundaries 🔴 An impossibility argument: no fixed policy prevents all context attacks without also blocking legitimate ones 📄 arxiv.org/abs/2605.17634

AI R&D agents look great in demos. They write code, fix bugs, and propose research-shaped ideas. But what do researchers actually spend their time doing? Fighting dependency conflicts, noisy metrics, and configs that I’m pretty sure worked 20 minutes ago. Can agents do that? 🧵

We're thrilled to share that our 1st Trustworthy AI for Good (AI4GOOD) workshop at #ICML2026 has received 534 submissions and they will be reviewed by an incredible pool of 230 reviewers!

1/ Frontier open-weight AI models are just months behind closed models. AI evaluations play an increasingly central role in governance. Yet evaluation practice has largely been built around the closed-weight paradigm. How do we fix that? New @RANDCorporation piece. 🧵

Can you boost your AI review scores by asking an LLM to rewrite your paper? Yes! We call it paper laundering Our @icmlconf spotlight paper argues current AI reviewers aren't ready to automate peer review, and outlines what a science of peer review automation should look like🧵👇