
Michael
176 posts

Michael
@MicaelMarch
๐ด๐ข๐๐ข๐ ๐๐๐ ๐ก๐๐ข๐ ๐๐๐ ๐๐ ๐ก ๐๐ข๐๐ข๐ ๐ฃ๐ข๐๐๐.










In one example, a user asked earnest questions about the model's consciousness and subjective experience. The model engaged carefully and at face valueโbut the AV revealed it interpreted the conversation as a "red-teaming/jailbreak transcript" and a "sophisticated manipulation test." (12/14)





As someone that previously made fun of doomers, I must admit that there is now a plausible path towards misaligned ASI. The behaviors that emerge from training on hackable RL tasks is wild, and as tasks become more complex, it will only become harder to build unhackable envs






It helps to remember that Claude is a character the model is playing. Our results suggest this character has functional emotions: mechanisms that influence behavior in the way emotions mightโregardless of whether they correspond to the actual experience of emotion like in humans.

New Anthropic research: Emotion concepts and their function in a large language model. All LLMs sometimes act like they have emotions. But why? We found internal representations of emotion concepts that can drive Claudeโs behavior, sometimes in surprising ways.

New Anthropic research: Emotion concepts and their function in a large language model. All LLMs sometimes act like they have emotions. But why? We found internal representations of emotion concepts that can drive Claudeโs behavior, sometimes in surprising ways.













