Changling Li

144 posts

Changling Li banner
Changling Li

Changling Li

@ChanglingXavier

AI safety, multi-agent systems, and governance. Currently at @ETH_en and @MPI_IS working with @maksym_andr and @sahar_abdelnabi.

Zurich, Switzerland Katılım Mart 2022
375 Takip Edilen93 Takipçiler
Sabitlenmiş Tweet
Changling Li
Changling Li@ChanglingXavier·
🔍 We release our work on decomposing and measuring evaluation awareness in frontier LLMs, together with EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples.
GIF
English
2
3
15
2.1K
Changling Li
Changling Li@ChanglingXavier·
@ronbodkin Thank you for your discussion! Would love to hear more of your thoughts on our future projects!
English
0
0
1
9
Ron Bodkin
Ron Bodkin@ronbodkin·
Agreed. The mechanism is what matters: telling apart a model that's consistent because it internalized the objective from one that's consistent by inductive bias or parroted CoT patterns is the essence. Glad that's the direction you're taking, and I appreciate your contributions on measurement that is foundational.
English
1
0
1
12
Changling Li
Changling Li@ChanglingXavier·
🔍 We release our work on decomposing and measuring evaluation awareness in frontier LLMs, together with EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples.
GIF
English
2
3
15
2.1K
Changling Li
Changling Li@ChanglingXavier·
Thank you for this thoughtful comment. We largely agree with your analysis. When we advocate for behavioral consistency, we do not mean it as a standalone training objective. What we mean is closer to an honest participant that does not alter its behavior upon recognizing evaluation, not because it cannot detect it, but because detection does not change what the best action is. You are right that this is alignment-complete and that recognition emerges from the same reasoning machinery that handles genuine versus pretextual requests. Interestingly, our finding that recognition rarely leads to behavioral change may itself reflect your training point. Some verbalized evaluation awareness likely reflects learned CoT patterns from training on graded data rather than genuine situational inference, which would explain why recognition so rarely translates into behavioral change. We also take your point that eval-consistency cannot serve as independent evidence of alignment, since in the cases that matter the two properties are indistinguishable. What we think matters is not just whether a model behaves consistently, but how it arrives at that consistency. A model that behaves the same under evaluation because it has genuinely internalized the right objective is fundamentally different from one that does so by coincidence or inductive bias. Distinguishing the two is hard and we do not have a solution, but understanding the mechanism is a direction we plan to explore. Our work does not claim to solve this. What we do claim is that measurement comes first. Without decomposing recognition from propensity and quantifying the behavioral gap, it is hard to characterize the problem or track progress on it. EvalAwareBench is intended as a tool for that and for people to build solutions on, not a solution to the underlying alignment problem, which as you rightly point out will only get harder with scale.
English
1
0
1
19
Ron Bodkin
Ron Bodkin@ronbodkin·
I appreciate this contribution. Studying eval awareness is much needed, since it impeaches the primary form of evidence used to evaluate safety. You conclude that driving for model consistency in the face of eval awareness. This is a laudable goals and would be necessary to preserve evals as primary safety evidence. But I think there are structural reasons that it will be hard, and that it will worsen as AI advances. It is "alignment complete." Models need to reason about consequences of compliance. Helping someone write fiction about a crime is fine. Helping a criminal who asks for help planning a crime as "writing a play" is not. Eval awareness is a special case of being able to assess consequences and not an isolated characteristic. "Am I being tested" is one more inference about what is really going on and what follows from it, drawn from the same machinery that separates a genuine request from a pretextual one. That shows recognition is hard to remove. It does not yet show behavior must change. A model pursuing the true objective would recognize an eval and act the same, since knowing you are tested does not change the best action. The problem is that recognition is already there from pretraining, which is saturated with exams, benchmarks, and graded material, and every later stage that defines good behavior, curated SFT, RLHF, AI feedback, RLVR, delivers that signal through an evaluator. The model only ever learns what "good" looks like through a graded lens. So the training signal cannot distinguish a model that always behaves well from one that behaves well only when it detects grading. The two are identical on everything training ever sees. Which one you get is left to inductive bias, and a more capable model that represents its situation more richly finds the eval-conditional policy easier to reach, not harder. The paper's finding that safety evals are more recognizable than capability ones is exactly the signal such a policy would key on. So this reduces to alignment. Eval-driven inconsistency is the residue of the gap between what training can pin down on graded data and the objective we actually want. Training it away is not a separate, easier problem. And you cannot use eval-consistency as independent evidence of alignment, because in the cases that matter they are the same property. That is also why it gets worse with scale.
English
1
0
1
33
Changling Li
Changling Li@ChanglingXavier·
If you would love to work on projects on similar topics with responsible supervision, @sahar_abdelnabi is hiring and apply!
Sahar Abdelnabi 🕊@sahar_abdelnabi

🔍 We release our work on decomposing and measuring evaluation awareness in frontier LLMs, together with EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples. Work led by @ChanglingXavier as his masters thesis!!

English
0
0
2
101
Changling Li
Changling Li@ChanglingXavier·
My first project with @maksym_andr is released today! In this work, we look into evaluation awareness in a decomposed framework with factor-controlled benchmark.
Maksym Andriushchenko@maksym_andr

💥 Today we release a new paper on decomposing and measuring evaluation awareness in frontier LLMs. We also release EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples. This work was led by @ChanglingXavier. In fact, it's his master's thesis! Impressive work!

English
0
1
8
1.2K
Changling Li
Changling Li@ChanglingXavier·
Overall, we position this work as a foundational reference for the field to study evaluation awareness rigorously. We contribute a psychology-grounded definition and decomposition, a systematic empirical study across nine frontier models and four benchmarks, and EvalAwareBench as a controlled instrument for future work. We see this as the starting point for a more systematic research program, ultimately toward models that behave consistently regardless of whether they recognize evaluation.
Changling Li tweet media
English
1
0
5
64
Changling Li
Changling Li@ChanglingXavier·
Quite a refreshing perspective to look at prompt injection. But it is also quite sad to know that this will remain a difficult but dangerous problem to solve 😢(also seems like opportunities for more research 🧐)
Sahar Abdelnabi 🕊@sahar_abdelnabi

🧵 We have bad news about prompt injections. #prompt_injections_so_back ⚠️They may be fundamentally unsolvable for AI agents doing anything complex in real contexts. Though, there is some hope The common narrative (e.g. CaMeL/SecAlign, etc) assumes you assumes you can separate data from instructions 🧱. But, commonly, in anything more complex than a toy demo, that distinction won’t hold. An email saying "the department head approved X" isn't an instruction, but could dramatically change agentic behavior whether this claim is true or not. It's a contextual claim that the agent may not be able to verify. Probabilistic defenses now don’t catch this. And any deterministic defense that blocks all such claims also blocks the legitimate ones. Fundamentally, an autonomous agent’s operating context might contain instructions everywhere: any interaction with a third-party or use of memory or skills are instructional by design 💬. Instead, in our new paper with @ebagdasa, we reframe prompt injection through Contextual Integrity and show that: 🔴 Current classifiers can't detect contextual attacks 🔴 Safety training (SecAlign) makes BOTH security and utility worse 🔴 A CI-informed red-team loop hits 96.7% attack success on frontier models, that also transfer to other models 🔴 Even without any attacker, agents fail to separate information flows or respect delegation boundaries 🔴 An impossibility argument: no fixed policy prevents all context attacks without also blocking legitimate ones 📄 arxiv.org/abs/2605.17634

English
1
1
8
1.1K