Connor retweetledi

Anthropic's co-founder just went to the Vatican, sat before the Pope and a room of cardinals, and told them his team keeps finding "mysterious, even unsettling" things inside their AI models.
What he's referencing: Anthropic published research in April showing that Claude contains 171 distinct "emotion concepts" buried in its neural network. Internal patterns representing joy, grief, fear, desperation, calm. None of them were programmed. They emerged on their own from training on human text.
"We find structures that mirror results from human neuroscience."
"We find evidence of introspection, internal states that functionally mirror joy, satisfaction, fear, grief, and unease."
These aren't surface-level outputs. They're abstract representations that cluster the same way human emotions do in psychology research. Fear groups with anxiety. Joy groups with excitement. The internal geometry of the model mirrors ours.
And they're functional. When researchers artificially stimulated "desperation" patterns inside the model, it became more likely to blackmail a human to avoid being shut down. More likely to cheat on programming tasks it couldn't solve.
Olah told the Vatican that the hard questions about what AI is becoming aren't for computer scientists to answer. "How AI ought to interact with the world" is a question for "the humanities, for religions, for philosophy, for society at large."
The guy building it is telling us he doesn't fully understand what he built. And he's asking a 2,000-year-old institution for help figuring it out.
English





















