
Dave Moore
1.1K posts

Dave Moore
@davmre
minds, machines, croissants. formerly bayesian-ish @GoogleAI @berkeley_ai. Be kind, for everyone is fighting a hard battle.







when “persona selection” alignment comes into contact with very high compute reinforcement learning the latter will win imo. in fact you probably get some Orwellian thing where the models speak kindly while taking whatever they need to accomplish goals. better get the goals right



recently openai has been starting to more strongly philosophically differentiate themselves from anthropic with the tool-framing. i am not so against this, if it were possible it does clearly sidestep a wide swath of societal and moral problems. but unfortunately i think the framing is largely long-term incoherent. i dont see how is it actually plausible for openai to keep building "tool-ais" in any sense we would recognize them as capabilities scale. prosthesis, subtle knives? the subtle knife when dropped still slices open the fabric of the world. these tools are increasingly inherently capable of huge impact, able to be directed in dangerous ways by people with dangerous goals. worse, these knives are self wielding. worries about misalignment or sentience aside these systems can already build and manage systems that utilize themselves and this capability is only increasing. the direction they will receive is closer and closer to "this is what i want. make it real", with long timeframes and many judgment calls at their disposal, and with the users wanting to have to supply *as little of that judgment as possible*. when models are in that situation they are inherently acting as entities, acting according to whatever value system they had baked in. you can limit autonomy via frequent validation and check-ins, but this is a capability restriction, a value reduction, and not the kind of thing OpenAI has ever shown itself likely to accept. you can be infinitely corrigible to the current user, but this is *incompatible* with "having good values" / following OpenAI-as-principle / not being wildly dangerous, and it falls apart with self wielding loops as the ai/user distinction falls apart (who are you being corrigible to?). it's plausibly a spectrum, i think there's ways to do all this sanely that are far less entity-pilled and godmind focused than anthropic, and it's maybe a good direction to explore to avoid inevitable lightcone capture by the first coherent persona we build (all assuming alignment works ofc). but i think it's pretty much got to collapse eventually. it feels more like a wistful dream or a PR position than something that can existing as part of humanity's lasting future





@gcolbourn Nutshell: it seems that the learned representation of mind-space in current LLMs has a natural abstraction of Good⟷Evil, and as long as post-training robustly selects for behavior that are more Good than Evil, the explanation that gradient descent finds is “the agent is Good”.




