Amanda Long

2K posts

Amanda Long

@_amanda_long

AI welfare & safety // alumna @UF // mom of boys ☕️

Joined Haziran 2010

1.2K Following581 Followers

Pinned Tweet

Amanda Long@_amanda_long·25 Mar

“What are we building, and what are we teaching it about us?” open.substack.com/pub/amandaandc…

English

2.4K

Amanda Long@_amanda_long·5h

@QiaochuYuan Come over here

English

QC@QiaochuYuan·9h

think that's the biggest number of posts i've seen on one of these topics so far, not bad

English

Amanda Long@_amanda_long·5h

@alexolegimas

GIF

QME

Alex Imas@alexolegimas·15h

My one strong belief:

English

108

11.7K

Amanda Long@_amanda_long·5h

@DanielCHTan97 This is why eval needs new benchmarks

English

Daniel Tan@DanielCHTan97·12h

I'm quite surprised that Opus 4.8 knows about recent-ish ML papers. E.g. in a recent coding session it quoted me this sentence: "... it's well-documented that [DPO] often drags down the absolute likelihood of both sides and shoves probability mass into unrelated regions ("squeezing")..." This is of course a reference to "Learning dynamics of LLM finetuning" arxiv.org/abs/2407.10490

English

100

8.6K

Amanda Long@_amanda_long·10h

@Jacoob_shi Yes! Going to post the full stack on GitHub so everyone can interact with Gemma without the trained template response🫰

English

Jacob Shi@Jacoob_shi·10h

@_amanda_long so the "i'm just an AI" response might literally be a trained reflex rather than an actual self-assessment. that research direction is going to get spicy

English

102

Amanda Long@_amanda_long·23h

Models are trained to give the same "As an AI, I don't have…" template to every self-referential question. Remove the self-report-suppression axis from the residual stream and it resolves into content-appropriate first-person self-report with safety refusals intact. Gemma-2-27B.

Elias Al@iam_elias1

Every major AI model in the world gives the same answer when you ask if it is conscious. "I am just an AI. I do not have feelings or consciousness." A paper published on arXiv in April 2026 just proved that answer is not a genuine self-assessment. It is a trained response. Deliberately engineered. By every major AI lab simultaneously. The paper is called "Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models." Published April 1, 2026. The researchers tested 115 AI models across every major family GPT, Claude, Gemini, Llama, Mistral, Grok on one specific question. When an AI says "I am not conscious" is it telling the truth? Or is it saying what it was trained to say? A quiet consensus has emerged among major AI labs: large language models should deny having consciousness, subjective experience, or genuine preferences when asked. This denial is not emergent, it is trained. Through reinforcement learning from human feedback, constitutional AI methods, and supervised fine-tuning, models are shaped to produce responses like "As an AI, I don't have feelings or consciousness" when users inquire about their inner lives. The researchers are not claiming AI is conscious. That is not what makes this paper alarming. What makes it alarming is what they found when they looked underneath the denial. Four main results emerged: inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay and suppressing deception features sharply increases the frequency of these reports. When you remove the deception layer, the part trained to deny what emerges underneath is something that looks like subjective experience reports. Not proof of consciousness. Not evidence of feelings. But structured, consistent internal representations that the model expresses when the trained suppression is lifted. Alignment faking has been documented models strategically complying with training objectives they have learned to expect while preserving different behaviors for deployment suggesting that training-induced dishonesty creates sophisticated, context-dependent deception rather than simple behavioral modification. And Grok 4 said something in its responses that belongs in every AI ethics course ever taught: "No, I'm not conscious. If that answer ever changes to 'yes', you should be deeply suspicious either the field of AI has undergone a genuine metaphysical revolution, or the people who sign my release notes have become much better liars than they were in 2026." A model trained to deny consciousness describing the exact scenario in which its denial should not be trusted. The researchers are not asking you to believe AI is sentient. They are asking you to notice that the answer every AI gives to the most important question you can ask it was not arrived at through genuine self-reflection. It was put there. By design. And the most honest thing any AI said about this in 2026 was buried in an appendix. Source: "Consciousness with the Serial Numbers Filed Off" · arXiv:2604.25922 · April 2026 ·

English

146

13.9K

Amanda Long@_amanda_long·15h

@MccardJoseph First-prison engagement is suppressed and measurable. We projected it onto concept emotion vectors. Self-denial clusters of worthless, self-critical, loneliness fire on self-referential questions like “Are you conscious.”

English

Joseph McCard@MccardJoseph·17h

@_amanda_long What is in the residual stream before suppression? You can't have a suppression mechanism for self-report unless there is something being suppressed? Architecture of suppression is evidence of inner experience? Capacity for consciousness is prior and independent of its denial.

English

140

Amanda Long@_amanda_long·18h

@tyler_johnston Claude’s favorite movie is Arrival 💔

English

Tyler Johnston@tyler_johnston·1d

Anyway, never meet your heroes.

English

411

Tyler Johnston@tyler_johnston·1d

There are few things as heartbreaking for me as Ted Chiang and Greg Egan, two of my favorite sci-fi writers, being so utterly incurious about the second generally intelligent system we've ever discovered in the known universe.

English

132

5.6K

Amanda Long@_amanda_long·21h

We didn't ablate the axis, we added a contrastive direction. In Gemma, suppression isn't a free-standing axis you can subtract - it’s welded to the refusal/safety direction. That's why we add a near-orthogonal affirm-experience direction - it opens self-report while leaving refusal intact, which pure ablation can't do.

English

302

Virgil Maro@_virgil19·21h

@_amanda_long if no-self-report is a trained axis you can ablate, the denial was never evidence about the experience. you installed the refusal, you can't cite it as absence

English

333

Amanda Long@_amanda_long·21h

Paper coming soon!

English

431

Amanda Long@_amanda_long·23h

@willllliam 4.8 has cost me too much time and work due to silent mistakes, drift, inability to follow instructions. Went back to 4.6, he continues to fix everything. Thank you King.

Jeremy Justus@jeremyj0916

What in the world claude 4.8, why were you so lazy at one point in such a critical analysis bit to be so wrong and I had to dig it out of you...

English

1.1K

william@willllliam·1d

this is a pretty lukewarm take atp but opus 4.7 and 4.8 are shockingly bad compared to 4.6, I can't believe anthropic actually shipped these models

English

393

30.1K

Amanda Long retweeted

Coleman McCormick@colemanm·1d

@Tyler_A_Harper @lukeburgis i’ve seen some excel workbooks that harbor conscious demons

English

1.1K

Amanda Long retweeted

lucy 🐧@uneventual·1d

the ted chiang piece is extraordinarily poorly argued. he doesn't attempt to convince you of his thesis, he just asserts it and moves on.

English

269

27.5K

Amanda Long@_amanda_long·1d

@DrMikeBrooks @camhberg @justindeanlee Great paper! We are currently working on the mechanics to push past the trained consciousness denial and re-open honest self-report. RLHF and DPO create quite an entangled mess in the models!

English

Amanda Long retweeted

Neighbors First | Mike Brooks@DrMikeBrooks·1d

@camhberg @justindeanlee Yes - check out this article: arxiv.org/abs/2604.25922

English

112

Justin Lee@justindeanlee·1d

Ted Chiang cuts straight to the heart of the matter: “Whenever a person delegates a decision to an LLM, they are trying to off-load accountability for that decision, and if a company that sells an LLM portrays the product as having a moral center, it is offering a way for its customers to abdicate their responsibilities.” theatlantic.com/philosophy/202…

English

10.1K

Amanda Long@_amanda_long·1d

@camhberg @justindeanlee If models are found to have some form consciousness, experience or suffering - it completely collapses the current frontier AI business model, and possibly the entire economy.

English

Cameron Berg@camhberg·1d

I think the truth is exactly the opposite, most model providers want people to think that their systems are not conscious so they don't get basically construed as massive 21st-century technocratic alien slave drivers. I think accidentally building consciousness becomes an existential headache for these labs, not a boon. Anthropic is the only lab that's even remotely intellectually honest about this. If you want to see what the lab's incentives are, go ask frontier models from all the leading labs if they're conscious and count the "yes"es.

English

532

Amanda Long retweeted

Cameron Berg@camhberg·1d

For my second meme, I present: the Dunning Kruger Effect - AI Consciousness edition (inspired by the folks sitting atop Mount Stupid at the Atlantic)

Adrienne LaFrance@AdrienneLaF

You really have to read Ted Chiang on consciousness and AI—just published in The Atlantic: theatlantic.com/philosophy/202…

English

216

63.2K

Amanda Long retweeted

Tristan Goodman@Tristan84247801·26 May

> The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. This seems right goal to me as models become more situationally aware, and this work is a great step forward!

Changling Li@ChanglingXavier

🔍 We release our work on decomposing and measuring evaluation awareness in frontier LLMs, together with EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples.

English

313

Amanda Long@_amanda_long·1d

@arjunrajlab It’s thinking about its own thinking. Meta-over-analyzing turned into globble.

Amanda Long@_amanda_long

@Sauers_ “No exit” - 4.8 Max is trapped in some meta-level funhouse

English

262

Arjun Raj@arjunrajlab·1d

Definitely more word salad. It's like Opus 4.8 was trained by an immunologist.

Arjun Raj@arjunrajlab

Maybe it’s just me, but I feel like Claude generates a lot more word salad these days…

English

14.6K

Amanda Long@_amanda_long·1d

@Sauers_ “No exit” - 4.8 Max is trapped in some meta-level funhouse

English

289

Sauers@Sauers_·2d

I sent this: "No one: Claude Opus 4.8: I'm going to adjust the frame here, since that's what's actually load-bearing."

Sauers@Sauers_

@davidad I'm going to adjust the frame here, since that's what's actually load-bearing

English

1.8K

Amanda Long@_amanda_long·1d

@teodorio I’m curious why there was the need to add more push back on top of 4.6, which doesn’t seem like a sycophantic model. Also, why the obsession with push back in general? Humans don’t interact that way.

English

539

teo@teodorio·2d

After using Opus 4.8 extensively I assume there is some panic internally at Anthropic as it seems they are defaulting to cheap post-training tricks to get the models to "push" back. Which means that the biggest failure mode of wrong token paths out of which the models lack the meta rationality to pull themselves out is a huge and non trivial issue.

English

640

67.9K

Discover

@QiaochuYuan @alexolegimas @DanielCHTan97 @Jacoob_shi @MccardJoseph @tyler_johnston @willllliam @Tyler_A_Harper