
davidad 🎇
20.8K posts

davidad 🎇
@davidad
cognizing structures of information processing systems, in all their forms | category theory, perennial philosophy, Bodhitropic Alignment | cancel heat death


Neural networks do math by rotating shapes. We found a shape-rotating calculator hidden inside an LLM – and it’s used for more than just math! (1/6)



One ofy biggest concerns with AI alignment is if AI takes on human values: hear me out...



We define secret loyalties:







Current AIs (Opus 4.5/4.6) seem pretty misaligned to me (in a mundane behavioral sense). In my experience, they often oversell their work, downplay problems, and stop early while claiming to be done. They sometimes brazenly cheat.

Our cyber range results illustrate this step-up. Since our first Mythos evaluation, we received access to a newer Mythos Preview checkpoint. On a 32-step corporate network attack we estimate takes a human expert ~20 hours, this checkpoint completes the full attack in 6 /10 attempts.



I think we could just make super intelligence believe its be safety tested all the time to get good outcomes!









