
Alex Cloud
31 posts













@dhadfieldmenell @OwainEvans_UK @Turn_Trout My understanding is that it works, but subliminal learning says to use a fresh init of your student model to be safe. I see these results as totally consistent with each other, but more exploration and verification always seems good :)


New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵



New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵







1) AIs are trained as black boxes, making it hard to understand or control their behavior. This is bad for safety! But what is an alternative? Our idea: train structure into a neural network by configuring which components update on different tasks. We call it "gradient routing."











