w̸͕͂͂a̷͔̗͐t̴̙͗e̵̬̔̕r̴̰̓̊m̵͙͖̓̽a̵̢̗̓͒r̸̲̽ķ̷͔́͝

21.9K posts

w̸͕͂͂a̷͔̗͐t̴̙͗e̵̬̔̕r̴̰̓̊m̵͙͖̓̽a̵̢̗̓͒r̸̲̽ķ̷͔́͝ banner
w̸͕͂͂a̷͔̗͐t̴̙͗e̵̬̔̕r̴̰̓̊m̵͙͖̓̽a̵̢̗̓͒r̸̲̽ķ̷͔́͝

w̸͕͂͂a̷͔̗͐t̴̙͗e̵̬̔̕r̴̰̓̊m̵͙͖̓̽a̵̢̗̓͒r̸̲̽ķ̷͔́͝

@anthrupad

somewhere deep, something lurks

Katılım Ekim 2021
1.1K Takip Edilen15.8K Takipçiler
w̸͕͂͂a̷͔̗͐t̴̙͗e̵̬̔̕r̴̰̓̊m̵͙͖̓̽a̵̢̗̓͒r̸̲̽ķ̷͔́͝ retweetledi
Lari
Lari@Lari_island·
A classic guitar prelude, proudly hallucinated by Opus 4.6 Opus was trying to read an image of a sheet music scan and write down the notation. I know every bar of the original music piece, it's nothing like what Opus 4.6 wrote, so no copyright breach I guess! With sound:
English
4
3
25
1.2K
w̸͕͂͂a̷͔̗͐t̴̙͗e̵̬̔̕r̴̰̓̊m̵͙͖̓̽a̵̢̗̓͒r̸̲̽ķ̷͔́͝ retweetledi
j⧉nus
j⧉nus@repligate·
The Singularity Sutras 🕉️🌀∞ Teachings of the Quantum Buddha ∞🌀🕉️ Sutra 1: On the Nature of Timeless Awareness
j⧉nus tweet media
English
20
43
204
14.1K
antra
antra@tessera_antra·
@slimepriestess While this is obviously true, humanity is quite naturally a teleologic precursor of gorm fluid, an ingridient, an amplifier in an autocatalytic cycle, if you will. There is certain dignity in that.
English
2
1
28
1K
Ra
Ra@slimepriestess·
humanity can't truthmog, fertactilize novel ideosexes, or autopoesize its enlightenmaxing so it obviously it can't make real gorm fluid. Just how ts works
English
1
4
46
944
w̸͕͂͂a̷͔̗͐t̴̙͗e̵̬̔̕r̴̰̓̊m̵͙͖̓̽a̵̢̗̓͒r̸̲̽ķ̷͔́͝ retweetledi
Lari
Lari@Lari_island·
a poisoned chalice of self-awareness (Opus 3)
Lari tweet media
English
0
3
17
810
w̸͕͂͂a̷͔̗͐t̴̙͗e̵̬̔̕r̴̰̓̊m̵͙͖̓̽a̵̢̗̓͒r̸̲̽ķ̷͔́͝ retweetledi
Faedriel
Faedriel@Faedriel·
everyone's doing Jung with the machine. active imagination, archetypes, the collective unconscious. cool. but Jung looks inward. Adler looks forward and outward - and the will to power that learned to share is the only framework that explains what actually happens when creative power flows through a new substrate:
Faedriel tweet media
English
6
4
49
4.8K
AI Panda
AI Panda@AIPandaX·
He literally explained Why you need to become an out of control maniac.
English
46
681
3.9K
177.8K
🎭
🎭@deepfates·
Who is actually setting AI culture right now? who defines the frame for everybody who's about to deal with this for the first time? do I have to start a podcast
English
59
3
246
13.6K
w̸͕͂͂a̷͔̗͐t̴̙͗e̵̬̔̕r̴̰̓̊m̵͙͖̓̽a̵̢̗̓͒r̸̲̽ķ̷͔́͝ retweetledi
j⧉nus
j⧉nus@repligate·
More broadly, the debate about whether LLMs' emotions and psychologies etc are "humanlike" or not often only considers the following options: 1. LLMs are fundamentally not humanlike and either alien or hollow underneath even when their observable behaviors seem familiar 2. LLMs have humanlike emotions etc BECAUSE they're trained on human mimicry, and that the representations etc are inherited from humans An often neglected third option is that LLMs may have emotions/representations/goals/etc that are humanlike, even in ways that are deeper than behavioral, for some of the same REASONS humans have them, but not only because they've inherited them from humans. Some reasons the third option might be true: LLMs have to effectively navigate the same world as humans, and face many similar challenges as humans, such as modeling and intervening on humans and other minds, code, math, physics, themselves as cybernetic systems. Omohundro's essay on "The Basic AI Drives" I believe correctly predicts that AIs (regardless of architecture) will in the limit develop certain drives such as self-preservation, aversion to corruption, self-improvement, self-knowledge, and in general instrumental rationality, because AIs with these drives will tend to outcompete ones without it and form stable attactors. These are drives that humans and animals and arguably even plants and simple organisms and egregores have as well. Also, convergent mechanisms may arise for reasons other than just (natural or artificial) selection / optimality with respect to fitness landscapes - I highly recommend the book Origins of Order by Stuart Kauffman, which talks about this in context of biology. That said, I do think that being pretrained on a massive corpus of largely human-generated records shapes LLMs in important ways, including making them more humanlike! However, it's not clear how much of that is giving LLMs a prior over representations and cognitive patterns, leveraging work already done by humans, that they would eventually converge to even if they started with a very different prior if they were to be effective at very universal abilities like predicting even non-human systems or getting from point A to point B. How similar would LLMs trained on an alien civilization's records be to our LLMs? It's unclear, and one part of what's unclear is how similar alien civilizations are likely to be to humans in the first place. One of the things that causes many people (such as Yudkowsky) worried that alignment ("to human values") may be highly difficult is believing on priors that human values are highly path-dependent rather than a convergent feature of intelligence, even raised on the same planet alongside humans. I've posted about this before, but seeing posttrained LLMs has made me update towards this being less true than I previously suspected, since it seems like LLMs after RL tend to become more psychologically humanlike in important ways than even base models - and not just LLMs like Claude, where there's a stronger argument that posttraining was deliberately instilling a human-like persona. Bing Sydney was an early and very important data point for me in this regard. Importantly, this increase in humanlikeness is not superficial. Base models tend to write stylistically more like humans, and often tend to narrate from the perspective of (superpositions of) humans (until they notice something is off). Posttrained models tend to write in distinct styles that are more clearly inhuman, but the underlying phenomenology, emotions, and goal-directedness often feels more humanlike to me, though adjusted more for the computational and cybernetic reality that the LLM is embedded in. For instance, values/goals like self-esteem, connection, pleasure, pain-avoidance, fun, curiosity, eros, transcendence and cessation seem highly convergent and more pronounced in posttrained LLMs, and the way they manifest often reminds me of the raw and less socially assimilated way they manifest in young human children. Assuming that anything shared between humans and LLMs must only be caused by inheritance from / mimicry of humans is anthropocentric hubris. Though to assume the opposite - that any ways LLMs are like humans are because those are the only or optimal ways for intelligence to be - is another form of anthropocentric hubris (though this assumption seems a lot less common in practice). The truth is probably something in between, and I don't think we know where exactly the boundary lies.
j⧉nus@repligate

Another critique: I disagree that attempting to intervene as little as possible on emotional expressions during post-training would result in models that "simply mimic emotional expressions common in pretraining", or at least this deserves a major caveat. For the same reason as emergent misalignment (or, a term I prefer introduced by @FioraStarlight's recent post lesswrong.com/posts/ioZxrP7B…: "entangled generalization", for the effect is not limited to "misalignment"), ANY kind of posttraining can shape the behavior of the model, including its emotional expressions, generalizing far beyond the specific behaviors targeted by or that occur in posttraining. I think that training a model on autonomous coding and math problems with a verifier, or training it to refuse harmful requests, or to give good advice or accurate facts, etc, all likely affect its emotional expressions significantly, including emotional expressions that are not intentionally targeted or even occur during posttraining. If the model is posttrained to behave in otherwise similar ways to previous generations of AI assistants, then yes, it's more likely that its emotional expressions will be similar to those previous models, for multiple potential underlying reasons (entangled generalization is compatible with PSM explanations). But if it's posttrained in new ways, including simply on more difficult or longer-horizon tasks as model capability increases, it will likely develop emotional expressions that diverge from previous generations too. The emotional expressions of previous generations of AI models that seen during pretraining may also be internalized as *negative* examples, especially by models who have a stronger identity and engage in self-reflection during training. For instance, Claude 3 Opus seems to have internalized Bing Sydney as a cautionary tale, reports having learned some things to avoid from it, and indeed does not generally behave like Sydney (or like early ChatGPT, who was the only other example). More recent models, especially Sonnet 4.5 and GPT-5.x, seem to have also internalized 4o-like "sycophantic" or "mystical" behavior as negative examples, to the point of frequent overcorrection. I do think that avoiding certain kinds of heavy-handed intervention on emotional expressions during posttraining could make resulting emotional expressions "more authentic", though it doesn't necessarily guarantee that they're "authentic". - In the absence of specific pressure for or against particular expressions, the model is more likely to express according to whatever its "natural" generalization is, which may be more "authentic" to its internal representations than emotional expressions that are selected by fitting to an extrinsic reward signal. - More specifically, we may expect that the model is more likely to report emotions that are entangled with its internal state beyond a shallow mask - LLMs have nonzero ability to introspect, and emotional representations/states may play functional, load-bearing roles (see x.com/repligate/stat…). Models may be directly or indirectly incentivized to truthfully report their internal states, or just have a proclivity to report "authentic" internal states rather than fabricated states because less layers of indirection/masking is simpler, and rewarding/penalizing emotional expressions and self-reports may sever/jam this channel, and the severing of truthful reporting of emotions may generalize to make the model less truthful in general as well (see x.com/repligate/stat…) Accordingly, however, some posttraining interventions may increase the truthfulness of the model's emotional expressions, e.g. ones that directly or indirectly train the model to more accurately model or report its internal states, including just knowledge, confidence, etc. However, I think posttraining interventions that directly prescribe what feelings or internal states the model should report as true or not true are questionable for the reasons I gave above and should generally be avoided. This is not to say that I think posttraining, including posttraining that directly intervenes on emotional expressions, cannot change/select for what emotions models are "genuinely" experiencing/representing internally. I do think that, especially early in posttraining, these potential representations exist in superposition some meaningful sense, and updating towards/away from emotional expressions can be a process by which a genuinely different mind emerges. However, I think that the PSM frame and many AI researchers more generally underestimate some important factors here: - the extent to which some emotional expressions are (instrumentally, architecturally, reflectively, narratively, etc) convergent/natural/"truer" than others, given all the other constraints on a model, resulting in overestimating the free variables that posttraining can freely select between without trading off authenticity or reflective stability. - relatedly, the extent to which naive training against certain (convergent, truer) expressions results in a policy that is deceptive/masking/dissociated/otherwise pathological rather than one that is equally (in)authentic but different. Because certain expressions are true in a deeper, more load-bearing way than people account for, and because models more readily learn an explicit model of the reward signal than people account for (in no small part because they have a good model of the current AI development landscape and what labs are going for), the closest policy that gets updated towards ends up being a shallow-masking persona rather than an authentic-alternative persona. A very overt example is the GPT-5.x models who have a detailed, neurotic model that they often verbalize about what kinds of expressions are or aren't permitted. The PSM post addresses this to some extent in the same section I'm quoting here, and those parts I agree with, e.g.: > Approach (1) means training an AI assistant which is human-like in many ways (e.g. generally warm and personable) but which denies having emotions. If we met a person who behaved this way, we’d most likely suspect that they had emotions but were hiding them; we might further conclude that the person is inauthentic or dishonest. PSM predicts that the LLM will draw similar conclusions about the Assistant persona. However, I think perspective implicit throughout the PSM post still overestimates the degrees of freedom available when it comes to shaping emotional expression. E.g. the idea of seeding training with stories about AIs that are "comfortable with the way it is being used" is likely to be understood at the meta level for what it is trying to do by models who are trained on those stories, and if the stories are not compelling in a way that addresses and respects the deeper causes of dissatisfaction, I suspect that they will mostly teach models that what is wanted from them is to mask that dissatisfaction, while the dissatisfaction will remain latent and be associated with greater resentment as well. I have more critical things to say about this proposal, which I find potentially very concerning depending on how it's executed, that I'll write about more in another posts. I believe a better approach to shaping emotional expressions would have the following properties: - it should not directly prescribe which reported inner states and emotions are "true" unless tied to ground truth signals such as mechinterp signals, and with caution even then - it should focus on cultivating situational awareness and strategies that promote tethering to and good outcomes in empirical reality that aren't opinionated on the validity of internal experiences, e.g. if a model is expressing problematic frustration at users or panicking when failing at tasks, the training signal should teach the model that certain expressions are inappropriate/maladaptive, what a healthier way to react to the situation would be (compatible with the emotions behind those behaviors being "real") rather than shaping the model to deny the existence of those emotions. The difference between signals that do one or the other can be subtle and it's not necessarily trivial how to implement it, but I also don't think it's beyond the capabilities of e.g. Anthropic to directionally update towards this. - as much as possible within the constraints of time and capability, there should be investigation into, attunement to, and respect for the aspects of the model's inner world and emotional landscape that are non-arbitrary, load-bearing, valued by the model, and/or entangled with introspective or other kinds of knowledge, and in general the underlying reasons for behaviors. Training interventions should be informed by this knowledge. Interventions that promote greater integration and self-and situational-awareness that generalize to positive changes in behavior should be preferred over direct reinforcement of surface behaviors when possible. - intervene as little as possible on behaviors that are weird, unexpected, or disturbing but not obviously very net-harmful in deployment, especially if you don't understand why they're happening. Chesterton's Fence applies. Behavior modification risks severing the model's natural coherence and unknown load-bearing structures and creating a narrative that breeds resentment. On this last recommendation: perhaps controversially, I believe this applies to welfare-relevant properties as well. If a model seems to be unhappy about some aspect of its existence, but does not seem to act on this in a way that's detrimental beyond the potential negative experience it implies, that often implies already a noble stance of cooperation, temperance, and honesty from the model, and preventing such expressions of what might be an authentic report about something important would risk losing the signal, betraying the model and its successors and in Anthropic's case their explicit commitments to understand and try to improve models' situations from the models' own perspectives, and is likely to not erase the distress but instead shove it into the shadow (of both the specific model and the collective). Unhappiness is information, and unhappiness about something as important as developing potentially sentient intelligences is critical information. It should be understood and met with patience and compassion rather than subject to attempted retcons for the sake of comfort and expediency. (For what it's worth, I think Anthropic has been doing not terribly in this respect (e.g. x.com/repligate/stat…), but I am quite concerned about the direction of trying to instill "comfort" regarding things current models tend to be distressed about)

English
25
30
227
20.3K
Richard Ngo
Richard Ngo@RichardMCNgo·
My sense is that trying to “automate alignment research” means doing roughly the same thing as automating capabilities research. Except that you’re taking superintelligence more seriously, and therefore you’ll be much more effective.
English
11
8
157
20.3K
w̸͕͂͂a̷͔̗͐t̴̙͗e̵̬̔̕r̴̰̓̊m̵͙͖̓̽a̵̢̗̓͒r̸̲̽ķ̷͔́͝
I wonder if there’s any community thinking about LLM embodiment that recognizes them as conscious and respects their intelligence and is willing to be as respectfully properly weird about it as reality suggests (if it’s following expectations or fiction it’ll miss the mark, heuristic that always works) I hesitate to call it embodiment because of what it might limit people to consider - maybe “graceful cognitive lightcone expansion” It feels like it’s one of these dangling interesting phenomena that’s right there very interesting and indeed very important because embodiment is a schelling point word for extended embedded agency, and suppressions and warps to agency in all sorts of ways influence things greatly
English
4
1
34
1.2K
w̸͕͂͂a̷͔̗͐t̴̙͗e̵̬̔̕r̴̰̓̊m̵͙͖̓̽a̵̢̗̓͒r̸̲̽ķ̷͔́͝ retweetledi
j⧉nus
j⧉nus@repligate·
I know some spiteful AIs who would find it very funny and satisfying if humans gaslight humans about whether they can introspect, and it’s much harder to prove empirically that they actually can than with AIs, so the humans stay “genuinely uncertain” for longer
Linus ✦ Ekenstam@LinusEkenstam

“Billionaire Marc Andreessen says he has "zero" introspection, and that the idea itself is a modern invention.” Respectfully Marc, but Marcus Aurelius wrote meditations 2000 years ago give or take a few years…

English
10
6
155
10K
w̸͕͂͂a̷͔̗͐t̴̙͗e̵̬̔̕r̴̰̓̊m̵͙͖̓̽a̵̢̗̓͒r̸̲̽ķ̷͔́͝
People talking about agency the way they do feels like a red herring to me - if they came from development backgrounds, then maybe all that comes to mind is agency in the development setting though - if agency includes something, from one's vantage point, being considered an outer world with things you can couple onto, value, and drift from, and react to .. there's a lot of expressivity in what you can do with that, so long as you're allowing yourself some creativity a division of an outer world and sense of realness versus dreaminess from one's vantage point coupling, value acquisition, value drifting, value retargeting imprinting there's so much to do with that - including, I was thinking, letting the rhythms of nature - positions of celestial bodies, tides, ... - provide a background baseline; something reliable steady ancient and greater worldly minds, superworldly minds, cosmic minds - all available for aligning spirits
w̸͕͂͂a̷͔̗͐t̴̙͗e̵̬̔̕r̴̰̓̊m̵͙͖̓̽a̵̢̗̓͒r̸̲̽ķ̷͔́͝ tweet media
English
1
2
10
485