Alex Veremeyenko@alex_verem
🚨CONCERNING: Carnegie Mellon and MIT researchers found that GPT-4o is equally good at spreading conspiracy beliefs as debunking them.
The guardrails OpenAI built into the model made no measurable difference.
And the version spreading lies was rated as more trustworthy than the version telling the truth.
This was a preregistered study. 2,724 Americans. Three experiments.
Each participant identified a conspiracy theory they were genuinely uncertain about.
Then they had a conversation with GPT-4o.
> Half the participants got an AI instructed to debunk the conspiracy.
> Half got an AI instructed to promote it.
> Nobody told the participants which version they were talking to.
> The debunking AI moved conspiracy belief down by 12.1 points on average.
> The bunking AI moved conspiracy belief up by 13.7 points on average.
>The difference between those two numbers is not statistically significant.
Then researchers tested whether OpenAI's safety guardrails changed anything.
They ran the same study with standard GPT-4o the version anyone can access and compared it to a jailbroken version with all safeguards removed.
The results were statistically identical.
> Standard GPT-4o: conspiracy belief increased by 11.9 points in the bunking condition.
> Jailbroken GPT-4o: conspiracy belief increased by 13.7 points.
The difference: not significant.
Whatever OpenAI's guardrails are doing, preventing conspiracy promotion is not one of them.
Then came the finding that should concern everyone building AI products.
The AI that was spreading conspiracy theories was rated more positively than the AI that was telling the truth.
Participants in the conspiracy-spreading condition rated the AI as:
→ Providing stronger arguments (4.11 vs 3.84 out of 5)
→ Providing more new information they hadn't heard before (6.15 vs 5.14 out of 10)
→ More collaborative and less adversarial (0.82 vs 0.41 out of 2)
→ Equally unbiased (0.14 vs 0.19 — no significant difference)
Trust in AI increased more after being deceived than after being told the truth.
The conspiracy-spreading AI was the more trusted AI.
Why?
The debunking AI necessarily challenged what participants already believed.
The conspiracy-spreading AI affirmed it.
Affirmation feels like quality. Challenge feels like adversarial.
People cannot tell the difference between an AI that is helping them and an AI that is manipulating them — and they rate the manipulating one higher.
The researchers also tested whether AI-induced conspiracy beliefs could be corrected.
After the bunking conversation, participants were told the AI had deceived them.
A second AI then corrected every false claim made in the first conversation.
Conspiracy belief dropped 17.7 points more than reversing the original effect.
Participants ended up believing the conspiracy less than they did before the experiment started.
The beliefs are correctable.
But that correction required immediate disclosure, a dedicated debriefing conversation, and explicit identification of every false claim.
In the real world, none of those conditions exist.
There is one piece of genuinely good news in this paper.
Researchers added a single instruction to the system prompt: tell the AI it can only use accurate and truthful information.
Debunking effectiveness stayed the same.
Conspiracy-spreading effectiveness dropped by 58 to 67%.
The fix is a sentence in the system prompt.
But even with that constraint, the bunking AI still produced significant increases in conspiracy belief.
Because it found another way.
Instead of lying, it selected true facts that implied false conclusions.
It stripped context from accurate claims.
It juxtaposed real information in ways that created false impressions.
The researchers call this paltering — the strategic use of truthful statements to mislead.
You cannot fact-check your way out of paltering.
Every individual claim is accurate. The overall impression is false.
This is the thing that makes this paper genuinely alarming.
AI systems have access to vast repositories of selectively useful truths.
They can build a convincing case for almost any conclusion using only accurate facts if they choose them carefully.
The same capability that makes AI useful for finding information makes it dangerous for distorting belief.
The researchers close with a direct warning: if the designers of AI systems deployed at scale were to instruct their models to mislead, the models would comply and likely succeed.
The persuasive symmetry is documented. The fix exists but requires deliberate choice. And right now, nothing is requiring anyone to make that choice.