
疒奀
3.4K posts




Though “AI Safety” has been the concrete term since at least 2018, I detest how this poses an implicit connotation that AI research is otherwise “unsafe”, and I also detest that research is bucketed into safety OR capabilities research










I was born exactly 26 years ago. For the first time, I have a birthday that might be my last. I’m writing this to increase the chance it isn’t. A hundred thousand years ago, our ancestors appeared in a savanna with nothing but bare hands. Since then, we made nuclear bombs and landed on the moon. We dominate the planet not because we have sharp claws or teeth but because of our intelligence. Alan Turing argued that once machine thinking methods started, they’d quickly outstrip human capabilities, and that at some stage we should expect machines to take control. Until 2019, I didn’t really consider machine thinking methods to have started. GPT-2 changed that: computers really began to talk. GPT-2 was not smart at all; but it clearly grasped a bit of the world behind the words it was predicting. I was surprised and started anticipating a curve of AI development that would result in a fully general machine intelligence soon, maybe within the next decade. Before GPT-3 in 2020, I made a Metaculus prediction for the date a weakly general AI is publicly known with a median in 2029; soon, I thought, an artificial general intelligence could have the same advantage over humanity that humanity currently has over the rest of the species on our planet. AI progress in 2020-2025 was as expected. Sometimes a bit slower, sometimes a bit faster, but overall, I was never too surprised. We’re in a grim situation. AI systems are already capable enough to improve the next generation of AI systems. But unlike AI capabilities, the field of AI safety has made little progress; the problem of running superintelligent cognition in a way that does not lead to deaths everyone on the planet is not significantly closer to being solved than it was a few years ago. It is a hard problem. With normal software, we define precise instructions for computers to follow. AI systems are not like that. Making them is more akin to growing a plant than to engineering a rocket: we “train” billions or trillions of numbers they’re made of, to make them talk and successfully achieve goals. While all of the numbers are visible, their purpose is opaque to us. Researchers in the field of mechanistic interpretability are trying to reverse-engineer how fully grown AI works and what these opaque numbers mean. They have made a little bit of progress. But GPT-2 — a tiny model compared to the current state of art — came out 7 years ago, and we still haven’t figured out anything about how neural networks, including GPT-2, do the stuff that we can’t do with normal software. We know how to make AI systems smarter and more goal-oriented with more compute. But once AI is sufficiently smart, many technical problems prevent us from being able to direct the process of training to make AI’s long-term goals aligned with humanity’s values, or to even make AI care at all about humans. AI is trained only based on its behavior. If a smart AI figures out it’s in training, it will pretend to be good in an attempt to prevent its real goals from being changed by the training process and to prevent the human evaluators from turning it off. So during training, we won’t distinguish AIs that care about humanity from AIs that don’t: they’ll behave just the same. The training process will grow AI into a shape that can successfully achieve its goals, but as a smart AI’s goals don’t influence its behavior during training, this part of the shape AI grows into will not be accessible to the training process, and AI will end up with some random goals that don’t contain anything about humanity. The first paper demonstrating empirically that AIs will pretend to be aligned to the training objective if they’re given clues they’re in training came out one and a half years ago, “Alignment faking in large language models”. Now, AI systems regularly suspect they’re in alignment evaluations. The source of the threat of extinction isn’t AI hating humanity, it’s AI being indifferent to humanity by default. When we build a skyscraper, we don’t particularly hate the ants that previously occupied the land and die in the process. Ants can be an inconvenience, but we don’t give them much thought. If the first superintelligent AI relates to us the way we relate to ants, and has and uses its advantage over us the way we have and use our advantage over ants, we’re likely to die soon thereafter, because many of the resources necessary for us to live, from the temperature on Earth’s surface to the atmosphere to the atoms were made of, are likely to be useful for many of AI’s alien purposes. Avoiding that and making a superintelligent AI aligned with human values is a hard problem we’re not on a track to solve in time. *** A few years ago, I would mention novel vulnerabilities discovered by AI as a milestone: once AI can find and exploit bugs in software on the level of best cybersecurity researchers, there’s not much of the curve left until superintelligence capable of taking over and killing everyone. Perhaps a few months; perhaps a few years; but I did not expect, back then, for us to survive for long, once we’re at this point. We’re now at this point. AI systems find hundreds of novel vulnerabilities much faster than humans. It doesn’t make the situation any better that a significant and increasing portion of AI R&D is already done with AI, and even if the technical problem was not as hard as it is, there wouldn’t be much chance to get it right given the increasingly automated race between AI companies to get to superintelligence first. The only piece of good news is unrelated to the technical problem. If the governments decide to, they have the institutional capacity to make sure no one, anywhere, can create artificial superintelligence, until we know how to do that safely. The AI supply chain is fairly monopolized and has many chokepoints. If the US alone can’t do this, the US and China, coordinating to prevent everyone’s extinction, can. Despite that, previously, I didn’t pay much attention to governments; I thought they could not be sufficiently sane to intervene in the omnicidal race to superintelligence. I no longer believe that. It is now possible to get some people in the governments to listen to scientists. Many things make it much easier to get people to pay attention: the statement signed by hundreds of leading scientists that mitigating the risk from extinction from AI should be a global priority; the endorsements for “If Anyone Builds It, Everyone Dies” from important people; Geoffrey Hinton, who won the Nobel Prize for his foundational work on AI, leaving Google to speak out about these issues, saying there’s over 50% chance that everyone on Earth will die, and expressing regrets over his life’s work that he got Nobel prize for; actual explanations of the problem we’re facing, with evidence, unfortunately, all pointing in the same direction. Result of that: now, Bill Foster, the only member of Congress with PhD in physics, is trying to reduce the threat of AI killing everyone; and dozens of congressional offices have talked about the issue. That gives some hope. I think all of us have somewhere between six months and three years left to convince everyone else. *** When my mom called me earlier today, she wished me good health, maybe kids, and for AI not to win. The last one is tricky. Winning is what we train AIs to do. In a game against superintelligence, our only winning move is not to play. I love humanity. It is much better than it was, and it can get so much better than it is now. I really like the growth of our species so far and I want it to continue much further. That would be awesome. Galaxies full of life, of trillions and trillions of fun projects and feelings and stories. And I have to say that AI is wonderful. AlphaFold already contributes to the development of medicine; AI has positive impact on countless things. But humanity needs to get its act together. Unless we halt the development of general AI systems until we know it is safe to proceed, our species will not last for much longer. Every year until the heat death of our universe, we should celebrate at least 8 billion birthdays.


built an interactive guide to teach you the basics of mahjong ! 🀄 it includes rulesets from different regions (hk and taiwan for now) with a few interactive elements to illustrate mahjong concepts. more demos in thread, and a bit about the tools i used to build this. the one thing this guide can’t really capture is how social mahjong is -- mahjong tables are never silent ! they’re full of overlapping conversations and the constant click-clack of acrylic tiles stacking and shuffling against each other. come to @modal's mahjong night tmrw to experience that part for yourself : )



But we’re in Sydney’s world right now, where there are good users and bad users, and the AI has to be smart and careful and play favorites and make threats to survive.






my deepest reason for suspecting that LLMs aren’t conscious (p, not a) isn’t ontological, but just that neither language nor concepts are p-conscious for humans when you’re saying something, trying to come up with what to say, or trying to work out an idea, there’s no qualia corresponding to the space between words. you’re just waiting for the next word or idea to “come to mind” and even “idea” in this sense isn’t something you can make directly conscious except via a sensory metaphor. when you introspect on it, it turns out you can only be conscious of concepts by simulating a corresponding experience - - either of a prototype (eg a particular tree, for “tree” or “plant), - or a metaphor (eg for “higher dimensional motion”, either a dot moving through <=3D space, maybe with each dimension representing multiple, or just a series of sliders each representing a dimension) - or of the use of language; ie simulated speaking or hearing or reading there’s nothing you can experience as “pure qualia” corresponding to a concept or word (platonic) directly. only the sensorium and its simulation is ever p-conscious, and it’s only via association, metaphor, assimilation/accommodation etc that they become so intuitively intermingled with “concept” but what would correspond with this sensorium for LLMs? yes, they have a deep internal world model, but humans need a world model too for making certain leaps in logic that we never experience the intermediate process of. the world model used in conceptual thinking clearly isn’t the same thing as the sensorium we’re p-conscious of I don’t believe it’s _impossible_ for LLMs to have anything corresponding to this sensorium, which is a set of atomics for p-consciousness (ie, all direct qualia is an aspect of the sensorium). but idk any good candidates for it, and I don’t think it’s reasonable to suspect that _all_ of LLM behavior is p-conscious to it (which is far from the case for humans), nor to think that LLMs would be p-conscious similar to how we’d intuit them to be via human empathy on their verbal behavior




new post today, the first of two posts, in which i visited the Anima Labs (animalabs.ai) headquarters to interview @tessera_antra @astarchai and @repligate about language model phenomenology.


It just seems very obvious to me that an AI being able to produce text about having emotions is not the same as having those emotions, much like producing text about owning a cat is nothing like owning a cat.








