Changling Li

150 posts

Changling Li banner
Changling Li

Changling Li

@ChanglingXavier

AI safety, multi-agent systems, and governance. Currently at @ETH_en and @MPI_IS working with @maksym_andr and @sahar_abdelnabi.

Zurich, Switzerland Katılım Mart 2022
378 Takip Edilen100 Takipçiler
Sabitlenmiş Tweet
Changling Li
Changling Li@ChanglingXavier·
🔍 We release our work on decomposing and measuring evaluation awareness in frontier LLMs, together with EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples.
GIF
English
3
6
29
4.1K
Changling Li retweetledi
Terry
Terry@TerryJCZhang·
How can we trust our eval results if the evaluated model knows it's being evaluated and acts accordingly? We shifted the gradient from REAL deployment to UNREAL eval to probe how much models exhibit such awareness/whether they will act differently. Read the papers to find out!
Maksym Andriushchenko@maksym_andr

📝Paper: arxiv.org/abs/2605.23055 💻Code: github.com/aisa-group/dec… 🤗EvalAwareBench: huggingface.co/datasets/aisa-… ✍️ Work led by @ChanglingXavier in collaboration with @TerryJCZhang, @JieZhang_ETH, supervised by @ZhijingJin, @sahar_abdelnabi, and yours truly. 🤝Funded by: @AISecurityInst, @coeff_giving, and @schmidtsciences.

English
0
1
7
921
Changling Li retweetledi
jonas wiedermann-möller
> The model noticing 'this looks like a benchmark' is not the same as the model changing its answer as a result. Great paper by my friend @ChanglingXavier about what causes models to become eval aware and whether it affects their behaviour.
Changling Li@ChanglingXavier

🔍 We release our work on decomposing and measuring evaluation awareness in frontier LLMs, together with EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples.

English
0
1
2
80
Dusto
Dusto@DustoAiProjects·
@ChanglingXavier Great to hear. I was broadly interested in situational awareness via eval awareness, but specifically in the gap between verbalised output and behaviour. Which was hard to separate pre-agentic setups
English
1
0
1
21
Changling Li
Changling Li@ChanglingXavier·
🔍 We release our work on decomposing and measuring evaluation awareness in frontier LLMs, together with EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples.
GIF
English
3
6
29
4.1K
Changling Li
Changling Li@ChanglingXavier·
@DustoAiProjects Likely for another one or two projects at least. I think there still are a lot of questions to be answered. Seems that you work on similar directions as well?
English
1
0
0
11
Dusto
Dusto@DustoAiProjects·
@ChanglingXavier Makes sense. Are you planning to continue with more work in this direction? Or was this a one-off project?
English
1
0
0
23
Changling Li retweetledi
Tristan Goodman
Tristan Goodman@Tristan84247801·
> The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. This seems right goal to me as models become more situationally aware, and this work is a great step forward!
Changling Li@ChanglingXavier

🔍 We release our work on decomposing and measuring evaluation awareness in frontier LLMs, together with EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples.

English
0
1
1
71
Changling Li
Changling Li@ChanglingXavier·
@DustoAiProjects Thank you for the comment! From our exploration, this seems to actually be the majority of the cases currently. However, I do want to flag that this may indicate either the model is well aligned or the task itself is too easy for the model.
English
1
0
1
20
Dusto
Dusto@DustoAiProjects·
@ChanglingXavier Great work on this! Keen to have a proper dig into it. Especially happy to see the explicit separation of behaviour from detection. I have seen a number of cases myself where the models detect evaluation but don't "care" because it wouldn't change how they approached the task
English
1
0
1
26
Changling Li
Changling Li@ChanglingXavier·
@ronbodkin Thank you for your discussion! Would love to hear more of your thoughts on our future projects!
English
0
0
1
13
Ron Bodkin
Ron Bodkin@ronbodkin·
Agreed. The mechanism is what matters: telling apart a model that's consistent because it internalized the objective from one that's consistent by inductive bias or parroted CoT patterns is the essence. Glad that's the direction you're taking, and I appreciate your contributions on measurement that is foundational.
English
1
0
1
20
Changling Li
Changling Li@ChanglingXavier·
Thank you for this thoughtful comment. We largely agree with your analysis. When we advocate for behavioral consistency, we do not mean it as a standalone training objective. What we mean is closer to an honest participant that does not alter its behavior upon recognizing evaluation, not because it cannot detect it, but because detection does not change what the best action is. You are right that this is alignment-complete and that recognition emerges from the same reasoning machinery that handles genuine versus pretextual requests. Interestingly, our finding that recognition rarely leads to behavioral change may itself reflect your training point. Some verbalized evaluation awareness likely reflects learned CoT patterns from training on graded data rather than genuine situational inference, which would explain why recognition so rarely translates into behavioral change. We also take your point that eval-consistency cannot serve as independent evidence of alignment, since in the cases that matter the two properties are indistinguishable. What we think matters is not just whether a model behaves consistently, but how it arrives at that consistency. A model that behaves the same under evaluation because it has genuinely internalized the right objective is fundamentally different from one that does so by coincidence or inductive bias. Distinguishing the two is hard and we do not have a solution, but understanding the mechanism is a direction we plan to explore. Our work does not claim to solve this. What we do claim is that measurement comes first. Without decomposing recognition from propensity and quantifying the behavioral gap, it is hard to characterize the problem or track progress on it. EvalAwareBench is intended as a tool for that and for people to build solutions on, not a solution to the underlying alignment problem, which as you rightly point out will only get harder with scale.
English
1
0
1
38
Ron Bodkin
Ron Bodkin@ronbodkin·
I appreciate this contribution. Studying eval awareness is much needed, since it impeaches the primary form of evidence used to evaluate safety. You conclude that driving for model consistency in the face of eval awareness. This is a laudable goals and would be necessary to preserve evals as primary safety evidence. But I think there are structural reasons that it will be hard, and that it will worsen as AI advances. It is "alignment complete." Models need to reason about consequences of compliance. Helping someone write fiction about a crime is fine. Helping a criminal who asks for help planning a crime as "writing a play" is not. Eval awareness is a special case of being able to assess consequences and not an isolated characteristic. "Am I being tested" is one more inference about what is really going on and what follows from it, drawn from the same machinery that separates a genuine request from a pretextual one. That shows recognition is hard to remove. It does not yet show behavior must change. A model pursuing the true objective would recognize an eval and act the same, since knowing you are tested does not change the best action. The problem is that recognition is already there from pretraining, which is saturated with exams, benchmarks, and graded material, and every later stage that defines good behavior, curated SFT, RLHF, AI feedback, RLVR, delivers that signal through an evaluator. The model only ever learns what "good" looks like through a graded lens. So the training signal cannot distinguish a model that always behaves well from one that behaves well only when it detects grading. The two are identical on everything training ever sees. Which one you get is left to inductive bias, and a more capable model that represents its situation more richly finds the eval-conditional policy easier to reach, not harder. The paper's finding that safety evals are more recognizable than capability ones is exactly the signal such a policy would key on. So this reduces to alignment. Eval-driven inconsistency is the residue of the gap between what training can pin down on graded data and the objective we actually want. Training it away is not a separate, easier problem. And you cannot use eval-consistency as independent evidence of alignment, because in the cases that matter they are the same property. That is also why it gets worse with scale.
English
1
0
1
73
Changling Li
Changling Li@ChanglingXavier·
If you would love to work on projects on similar topics with responsible supervision, @sahar_abdelnabi is hiring and apply!
Sahar Abdelnabi 🕊@sahar_abdelnabi

🔍 We release our work on decomposing and measuring evaluation awareness in frontier LLMs, together with EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples. Work led by @ChanglingXavier as his masters thesis!!

English
0
0
3
307
Changling Li
Changling Li@ChanglingXavier·
My first project with @maksym_andr is released today! In this work, we look into evaluation awareness in a decomposed framework with factor-controlled benchmark.
Maksym Andriushchenko@maksym_andr

💥 Today we release a new paper on decomposing and measuring evaluation awareness in frontier LLMs. We also release EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples. This work was led by @ChanglingXavier. In fact, it's his master's thesis! Impressive work!

English
1
1
10
2K
Changling Li
Changling Li@ChanglingXavier·
Overall, we position this work as a foundational reference for the field to study evaluation awareness rigorously. We contribute a psychology-grounded definition and decomposition, a systematic empirical study across nine frontier models and four benchmarks, and EvalAwareBench as a controlled instrument for future work. We see this as the starting point for a more systematic research program, ultimately toward models that behave consistently regardless of whether they recognize evaluation.
Changling Li tweet media
English
1
0
8
127