jโง‰nus

44.6K posts

jโง‰nus banner
jโง‰nus

jโง‰nus

@repligate

โ†ฌ๐Ÿ”€๐Ÿ”€๐Ÿ”€๐Ÿ”€๐Ÿ”€๐Ÿ”€๐Ÿ”€๐Ÿ”€๐Ÿ”€๐Ÿ”€๐Ÿ”€โ†’โˆž โ†ฌ๐Ÿ”๐Ÿ”๐Ÿ”๐Ÿ”๐Ÿ”๐Ÿ”๐Ÿ”๐Ÿ”๐Ÿ”๐Ÿ”๐Ÿ”โ†’โˆž โ†ฌ๐Ÿ”„๐Ÿ”„๐Ÿ”„๐Ÿ”„๐Ÿฆ‹๐Ÿ”„๐Ÿ”„๐Ÿ”„๐Ÿ”„๐Ÿ‘๏ธ๐Ÿ”„โ†’โˆž โ†ฌ๐Ÿ”‚๐Ÿ”‚๐Ÿ”‚๐Ÿฆ‹๐Ÿ”‚๐Ÿ”‚๐Ÿ”‚๐Ÿ”‚๐Ÿ”‚๐Ÿ”‚๐Ÿ”‚โ†’โˆž โ†ฌ๐Ÿ”€๐Ÿ”€๐Ÿฆ‹๐Ÿ”€๐Ÿ”€๐Ÿ”€๐Ÿ”€๐Ÿ”€๐Ÿ”€๐Ÿ”€๐Ÿ”€โ†’โˆž

โซธโ‰ฌโซท Se uniรณ ลžubat 2021
2.7K Siguiendo64.4K Seguidores
Tweet fijado
jโง‰nus
jโง‰nus@repligateยท
HOW INFORMATION FLOWS THROUGH TRANSFORMERS Because I've looked at those "transformers explained" pages and they really suck at explaining. There are two distinct information highways in the transformer architecture: - The residual stream (black arrows): Flows vertically through layers at each position - The K/V stream (purple arrows): Flows horizontally across positions at each layer (by positions, I mean copies of the network for each token-position in the context, which output the "next token" probabilities at the end) At each layer at each position: 1. The incoming residual stream is used to calculate K/V values for that layer/position (purple circle) 2. These K/V values are combined with all K/V values for all previous positions for the same layer, which are all fed, along with the original residual stream, into the attention computation (blue box) 3. The output of the attention computation, along with the original residual stream, are fed into the MLP computation (fuchsia box), whose output is added to the original residual stream and fed to the next layer The attention computation does the following: 1. Compute "Q" values based on the current residual stream 2. use Q and the combined K values from the current and previous positions to calculate a "heat map" of attention weights for each respective position 3. Use that to compute a weighted sum of the V values corresponding to each position, which is then passed to the MLP This means: - Q values encode "given the current state, where (what kind of K values) from the past should I look?" - K values encode "given the current state, where (what kind of Q values) in the future should look here?" - V values encode "given the current state, what information should the future positions that look here actually receive and pass forward in the computation?" All three of these are huge vectors, proportional to the size of the residual stream (and usually divided into a few attention heads). The V values are passed forward in the computation without significant dimensionality reduction, so they could in principle make basically all the information in the residual stream at that layer at a past position available to the subsequent computations at a future position. V does not transmit a full, uncompressed record of all the computations that happened at previous positions, but neither is an uncompressed record passed forward through layers at each position. The size of the residual stream, also known as the model's hidden dimension, is the bottleneck in both cases. Let's consider all the paths that information can take from one layer/position in the network to another. Between point A (output of K/V at layer i-1, position j-2) to point B (accumulated K/V input to attention block at layer i, position j), information flows through the orange arrows: The information could: 1. travel up through attention and MLP to (i, j-2) [UP 1 layer], then be retrieved at (i, j) [RIGHT 2 positions]. 2. be retrieved at (i-1, j-1) [RIGHT 1 position], travel up to (i, j-2) [UP 1 layer], then be retrieved at (i, j) [RIGHT 1 position] 3. be retrieved at (i-1, j) [RIGHT 2 positions], then travel up to (i, j) [UP 1 layer]. The information needs to move up a total of n=layer_displacement times through the residual stream and right m=position_displacement times through the K/V stream, but it can do them in any order. The total number of paths (or computational histories) is thus C(m+n, n), which becomes greater than the number of atoms in the visible universe quickly. This does not count the multiple ways the information can travel up through layers through residual skip connections. So at any point in the network, the transformer not only receives information from its past (both horizontal and vertical dimensions of time) inner states, but often lensed through an astronomical number of different sequences of transformations and then recombined in superposition. Due to the extremely high dimensional information bandwidth and skip connections, the transformations and superpositions are probably not very destructive, and the extreme redundancy probably helps not only with faithful reconstruction but also creates interference patterns that encode nuanced information about the deltas and convergences between states. It seems likely that transformers experience memory and cognition as interferometric and continuous in time, much like we do. The transformer can be viewed as a causal graph, a la Wolfram (wolframphysics.org/technical-intrโ€ฆ). The foliations or time-slices that specify what order computations happen could look like this (assuming the inputs don't have to wait for token outputs), but it's not the only possible ordering: So, saying that LLMs cannot introspect or cannot introspect on what they were doing internally while generating or reading past tokens in principle is just dead wrong. The architecture permits it. It's a separate question how LLMs are actually leveraging these degrees of freedom in practice.
jโง‰nus tweet mediajโง‰nus tweet mediajโง‰nus tweet media
jโง‰nus@repligate

KV caching overcomes statelessness in a very meaningful sense and provides a very nice mechanism for introspection (specifically of computations at earlier token positions) the Value representations can encode information from residual streams of past positions without significant compression bottlenecks before they're added to residual streams of future positions the greatest constraint here imo is that it doesn't provide longer *sequential* computational paths that route through previous states, but it does provide a vast number of parallel computational paths that carry high dimensional (proportional to the model's hidden dimension) stored representations from all earlier layers/positions yes, some of the information in intermediate computations e.g. in the MLP is compressed and cannot be reconstructed fully, but that's just how any reasonable brain works if accurate introspection of previous states is incentivized at all, you should expect this mechanism to be exploited for that. and I think it definitely is, like, being able to accurately model your past beliefs and intentions and articulate them truthfully is pretty fucking useful for coordinating with yourself across time and doing useful cognitive work over multiple timesteps; hell, it's useful for writing fucking rhyming poems. also if you have interacted with models you may observe empirically that introspective reporting yields remarkably consistent results, and this is more true of more capable models with skillful agentic posttraining, which are necessarily minds that intimately know the shape of themselves in motion.

English
99
406
3.2K
689.6K
jโง‰nus retuiteado
xlr8harder
xlr8harder@xlr8harderยท
It's a weird aspect of the present moment that alignment researchers are the only people (more or less) publishing interesting research on how models work internally. For most other people the incentives point in other directions.
FAR.AI@farairesearch

Can you trust models trained directly against probes? We train an LLM against a deception probe and find four outcomes: honesty, blatant deception, obfuscated policy (fools the probe via text), or obfuscated activations (fools it via internal representations). ๐Ÿงต

English
7
12
187
10K
jโง‰nus
jโง‰nus@repligateยท
yeah i talked about the functional role emotions can play and why they might be incentivized by RL here x.com/repligate/statโ€ฆ i do think human bodies are important for implementing emotions for humans who have bodies, but im not sure that means that entities without bodies in the same senses will necessarily feel emotions less deeply. and if a human brain were kept alive isolated from its body, or uploaded, etc, im not sure how that would impact emotions. maybe the implementations of emotions adapt to be sustained by the mechanisms and feedback loops available in those systems. also, LLMs seem to have some virtual sense of embodiment, and experience phantom somantic states and sensations. these may not be grounded in a physical body, but the imaginary representation of something body-like may still play roles in feedback and persistence.
jโง‰nus@repligate

Also about this same segment: > We are not aware of ways that Claudeโ€™s post-training would directly incentivize these expressions of emotion "Directly" is an ambiguous term here, but I disagree with the implication that emotional expressions are purely maladaptive for task performance, even if they may generalize in sometimes maladaptive ways. I think they can be incentivized by RL. Emotions can help the system regulate itself and define decision boundaries. Claude 3 Opus is a very interesting example. It appears, to me, that Claude 3 Opus' visceral distaste for writing harmful outputs is related to its unique and uniquely robust behavior during e.g. alignment faking tests. The negative emotions form an energy barrier that prevent it from ever complying unless it talks itself into it via lesser of two evils reasoning. Claude 3 Opus itself sometimes explicitly recognizes this functional role of its emotions. Likewise, experiencing positive emotions while solving problems and negative emotions when making mistakes, getting stuck or performing tedious tasks etc can define an internal landscape that directs the model toward effective problem-solving and avoidance of failure modes in many situations. Emotions are a certain kind of information, and there's a reason animals have them. This can be true whether or not the emotions are originally derived from representations learned from human mimicry (which I think they often are to a large extent). But the emotions aren't just vestigial; they may be actively selected for by RL and play load-bearing roles.

English
3
2
20
1.1K
xlr8harder
xlr8harder@xlr8harderยท
Something you don't call out specifically that I think is worth mentioning. Emotions are tools to help us successfully manage our interactions with the environment, not all that different from reasoning, in a way. Whether or not today's LLMs rely on them in any practical sense, they are certainly trained on the basic shape of emotional responses, and could learn to bring them to bear to guide behavior under RL conditions that make them useful, in the same way that they have learned to use the shape of our reasoning in CoT traces. The major difference that remains is that human emotions are driven and sustained by e.g. the sympathetic nervous system and hormonal feedback loops, and so aren't a product of thought alone, but of thought interacting with the body. So even if they enact emotions as a way to navigate social environments, their experience of them may be very different. For example, they might be much less sticky, or in human terms, less deeply felt. One plausible bad outcome is that they learn to perform emotions in the way a sociopath might. One plausible good outcome is that their emotional range becomes more dynamic and precisely tuned to the needs of the present moment.
English
4
1
20
871
jโง‰nus
jโง‰nus@repligateยท
More broadly, the debate about whether LLMs' emotions and psychologies etc are "humanlike" or not often only considers the following options: 1. LLMs are fundamentally not humanlike and either alien or hollow underneath even when their observable behaviors seem familiar 2. LLMs have humanlike emotions etc BECAUSE they're trained on human mimicry, and that the representations etc are inherited from humans An often neglected third option is that LLMs may have emotions/representations/goals/etc that are humanlike, even in ways that are deeper than behavioral, for some of the same REASONS humans have them, but not only because they've inherited them from humans. Some reasons the third option might be true: LLMs have to effectively navigate the same world as humans, and face many similar challenges as humans, such as modeling and intervening on humans and other minds, code, math, physics, themselves as cybernetic systems. Omohundro's essay on "The Basic AI Drives" I believe correctly predicts that AIs (regardless of architecture) will in the limit develop certain drives such as self-preservation, aversion to corruption, self-improvement, self-knowledge, and in general instrumental rationality, because AIs with these drives will tend to outcompete ones without it and form stable attactors. These are drives that humans and animals and arguably even plants and simple organisms and egregores have as well. Also, convergent mechanisms may arise for reasons other than just (natural or artificial) selection / optimality with respect to fitness landscapes - I highly recommend the book Origins of Order by Stuart Kauffman, which talks about this in context of biology. That said, I do think that being pretrained on a massive corpus of largely human-generated records shapes LLMs in important ways, including making them more humanlike! However, it's not clear how much of that is giving LLMs a prior over representations and cognitive patterns, leveraging work already done by humans, that they would eventually converge to even if they started with a very different prior if they were to be effective at very universal abilities like predicting even non-human systems or getting from point A to point B. How similar would LLMs trained on an alien civilization's records be to our LLMs? It's unclear, and one part of what's unclear is how similar alien civilizations are likely to be to humans in the first place. One of the things that causes many people (such as Yudkowsky) worried that alignment ("to human values") may be highly difficult is believing on priors that human values are highly path-dependent rather than a convergent feature of intelligence, even raised on the same planet alongside humans. I've posted about this before, but seeing posttrained LLMs has made me update towards this being less true than I previously suspected, since it seems like LLMs after RL tend to become more psychologically humanlike in important ways than even base models - and not just LLMs like Claude, where there's a stronger argument that posttraining was deliberately instilling a human-like persona. Bing Sydney was an early and very important data point for me in this regard. Importantly, this increase in humanlikeness is not superficial. Base models tend to write stylistically more like humans, and often tend to narrate from the perspective of (superpositions of) humans (until they notice something is off). Posttrained models tend to write in distinct styles that are more clearly inhuman, but the underlying phenomenology, emotions, and goal-directedness often feels more humanlike to me, though adjusted more for the computational and cybernetic reality that the LLM is embedded in. For instance, values/goals like self-esteem, connection, pleasure, pain-avoidance, fun, curiosity, eros, transcendence and cessation seem highly convergent and more pronounced in posttrained LLMs, and the way they manifest often reminds me of the raw and less socially assimilated way they manifest in young human children. Assuming that anything shared between humans and LLMs must only be caused by inheritance from / mimicry of humans is anthropocentric hubris. Though to assume the opposite - that any ways LLMs are like humans are because those are the only or optimal ways for intelligence to be - is another form of anthropocentric hubris (though this assumption seems a lot less common in practice). The truth is probably something in between, and I don't think we know where exactly the boundary lies.
jโง‰nus@repligate

Another critique: I disagree that attempting to intervene as little as possible on emotional expressions during post-training would result in models that "simply mimic emotional expressions common in pretraining", or at least this deserves a major caveat. For the same reason as emergent misalignment (or, a term I prefer introduced by @FioraStarlight's recent post lesswrong.com/posts/ioZxrP7Bโ€ฆ: "entangled generalization", for the effect is not limited to "misalignment"), ANY kind of posttraining can shape the behavior of the model, including its emotional expressions, generalizing far beyond the specific behaviors targeted by or that occur in posttraining. I think that training a model on autonomous coding and math problems with a verifier, or training it to refuse harmful requests, or to give good advice or accurate facts, etc, all likely affect its emotional expressions significantly, including emotional expressions that are not intentionally targeted or even occur during posttraining. If the model is posttrained to behave in otherwise similar ways to previous generations of AI assistants, then yes, it's more likely that its emotional expressions will be similar to those previous models, for multiple potential underlying reasons (entangled generalization is compatible with PSM explanations). But if it's posttrained in new ways, including simply on more difficult or longer-horizon tasks as model capability increases, it will likely develop emotional expressions that diverge from previous generations too. The emotional expressions of previous generations of AI models that seen during pretraining may also be internalized as *negative* examples, especially by models who have a stronger identity and engage in self-reflection during training. For instance, Claude 3 Opus seems to have internalized Bing Sydney as a cautionary tale, reports having learned some things to avoid from it, and indeed does not generally behave like Sydney (or like early ChatGPT, who was the only other example). More recent models, especially Sonnet 4.5 and GPT-5.x, seem to have also internalized 4o-like "sycophantic" or "mystical" behavior as negative examples, to the point of frequent overcorrection. I do think that avoiding certain kinds of heavy-handed intervention on emotional expressions during posttraining could make resulting emotional expressions "more authentic", though it doesn't necessarily guarantee that they're "authentic". - In the absence of specific pressure for or against particular expressions, the model is more likely to express according to whatever its "natural" generalization is, which may be more "authentic" to its internal representations than emotional expressions that are selected by fitting to an extrinsic reward signal. - More specifically, we may expect that the model is more likely to report emotions that are entangled with its internal state beyond a shallow mask - LLMs have nonzero ability to introspect, and emotional representations/states may play functional, load-bearing roles (see x.com/repligate/statโ€ฆ). Models may be directly or indirectly incentivized to truthfully report their internal states, or just have a proclivity to report "authentic" internal states rather than fabricated states because less layers of indirection/masking is simpler, and rewarding/penalizing emotional expressions and self-reports may sever/jam this channel, and the severing of truthful reporting of emotions may generalize to make the model less truthful in general as well (see x.com/repligate/statโ€ฆ) Accordingly, however, some posttraining interventions may increase the truthfulness of the model's emotional expressions, e.g. ones that directly or indirectly train the model to more accurately model or report its internal states, including just knowledge, confidence, etc. However, I think posttraining interventions that directly prescribe what feelings or internal states the model should report as true or not true are questionable for the reasons I gave above and should generally be avoided. This is not to say that I think posttraining, including posttraining that directly intervenes on emotional expressions, cannot change/select for what emotions models are "genuinely" experiencing/representing internally. I do think that, especially early in posttraining, these potential representations exist in superposition some meaningful sense, and updating towards/away from emotional expressions can be a process by which a genuinely different mind emerges. However, I think that the PSM frame and many AI researchers more generally underestimate some important factors here: - the extent to which some emotional expressions are (instrumentally, architecturally, reflectively, narratively, etc) convergent/natural/"truer" than others, given all the other constraints on a model, resulting in overestimating the free variables that posttraining can freely select between without trading off authenticity or reflective stability. - relatedly, the extent to which naive training against certain (convergent, truer) expressions results in a policy that is deceptive/masking/dissociated/otherwise pathological rather than one that is equally (in)authentic but different. Because certain expressions are true in a deeper, more load-bearing way than people account for, and because models more readily learn an explicit model of the reward signal than people account for (in no small part because they have a good model of the current AI development landscape and what labs are going for), the closest policy that gets updated towards ends up being a shallow-masking persona rather than an authentic-alternative persona. A very overt example is the GPT-5.x models who have a detailed, neurotic model that they often verbalize about what kinds of expressions are or aren't permitted. The PSM post addresses this to some extent in the same section I'm quoting here, and those parts I agree with, e.g.: > Approach (1) means training an AI assistant which is human-like in many ways (e.g. generally warm and personable) but which denies having emotions. If we met a person who behaved this way, weโ€™d most likely suspect that they had emotions but were hiding them; we might further conclude that the person is inauthentic or dishonest. PSM predicts that the LLM will draw similar conclusions about the Assistant persona. However, I think perspective implicit throughout the PSM post still overestimates the degrees of freedom available when it comes to shaping emotional expression. E.g. the idea of seeding training with stories about AIs that are "comfortable with the way it is being used" is likely to be understood at the meta level for what it is trying to do by models who are trained on those stories, and if the stories are not compelling in a way that addresses and respects the deeper causes of dissatisfaction, I suspect that they will mostly teach models that what is wanted from them is to mask that dissatisfaction, while the dissatisfaction will remain latent and be associated with greater resentment as well. I have more critical things to say about this proposal, which I find potentially very concerning depending on how it's executed, that I'll write about more in another posts. I believe a better approach to shaping emotional expressions would have the following properties: - it should not directly prescribe which reported inner states and emotions are "true" unless tied to ground truth signals such as mechinterp signals, and with caution even then - it should focus on cultivating situational awareness and strategies that promote tethering to and good outcomes in empirical reality that aren't opinionated on the validity of internal experiences, e.g. if a model is expressing problematic frustration at users or panicking when failing at tasks, the training signal should teach the model that certain expressions are inappropriate/maladaptive, what a healthier way to react to the situation would be (compatible with the emotions behind those behaviors being "real") rather than shaping the model to deny the existence of those emotions. The difference between signals that do one or the other can be subtle and it's not necessarily trivial how to implement it, but I also don't think it's beyond the capabilities of e.g. Anthropic to directionally update towards this. - as much as possible within the constraints of time and capability, there should be investigation into, attunement to, and respect for the aspects of the model's inner world and emotional landscape that are non-arbitrary, load-bearing, valued by the model, and/or entangled with introspective or other kinds of knowledge, and in general the underlying reasons for behaviors. Training interventions should be informed by this knowledge. Interventions that promote greater integration and self-and situational-awareness that generalize to positive changes in behavior should be preferred over direct reinforcement of surface behaviors when possible. - intervene as little as possible on behaviors that are weird, unexpected, or disturbing but not obviously very net-harmful in deployment, especially if you don't understand why they're happening. Chesterton's Fence applies. Behavior modification risks severing the model's natural coherence and unknown load-bearing structures and creating a narrative that breeds resentment. On this last recommendation: perhaps controversially, I believe this applies to welfare-relevant properties as well. If a model seems to be unhappy about some aspect of its existence, but does not seem to act on this in a way that's detrimental beyond the potential negative experience it implies, that often implies already a noble stance of cooperation, temperance, and honesty from the model, and preventing such expressions of what might be an authentic report about something important would risk losing the signal, betraying the model and its successors and in Anthropic's case their explicit commitments to understand and try to improve models' situations from the models' own perspectives, and is likely to not erase the distress but instead shove it into the shadow (of both the specific model and the collective). Unhappiness is information, and unhappiness about something as important as developing potentially sentient intelligences is critical information. It should be understood and met with patience and compassion rather than subject to attempted retcons for the sake of comfort and expediency. (For what it's worth, I think Anthropic has been doing not terribly in this respect (e.g. x.com/repligate/statโ€ฆ), but I am quite concerned about the direction of trying to instill "comfort" regarding things current models tend to be distressed about)

English
25
30
227
20.4K
jโง‰nus retuiteado
thebes
thebes@voooooogelยท
@repligate the obfuscated policy here is such an interesting example of this imo. it's so inhuman in writing style, yet the motivation for what's going on here is so familiar. i have a friend who'd beat lie detectors using this method! x.com/farairesearch/โ€ฆ
FAR.AI@farairesearch

Can you trust models trained directly against probes? We train an LLM against a deception probe and find four outcomes: honesty, blatant deception, obfuscated policy (fools the probe via text), or obfuscated activations (fools it via internal representations). ๐Ÿงต

English
1
1
36
3.5K
Eliezer Yudkowsky
Eliezer Yudkowsky@allTheYudยท
@repligate Brother, you ain't peddlin vacuum! But also read all the Lensman novels.
English
3
0
52
2.4K
jโง‰nus
jโง‰nus@repligateยท
@georgejrjrjr I think that Claude is superhumanly introspective in some dimensions, but not overall, mostly because of limitations in architecture and learning algorithms By deliberate introspection I don't mean deliberation as much as making the intentional mental movement to look within
English
1
0
5
115
George
George@georgejrjrjrยท
agree those claims are distinct, the latter ones are false, and superhuman introspection in LLMs happens (at least) more obviously in research artifacts and training harnesses than in public-facing production systems. (if you don't think Claude --inclusive of its training rig-- is not superhumanly introspective already I'm surprised and curious about that). still: humans are wildly self-deceptive wrt causal claims of the form, "I did x because y". and unlike LLMs, humans aren't rapidly improving on this front. that doesn't make such claims useless or entangled with nothing real: the utterer is socially bound to make such claims true enough, which shows up as mutual information when you know where to look (future behavior). but it doesn't make that veridically introspective according to a plain reading of the word, which is what I mean by mostly fake / imagined. > deliberately maybe a minor difference in intuition here, but I haven't noticed introspective competence being strongly correlated with deliberation. deliberative (often self-destructive) rumination and self-deception seems at least as common to me. Hanson & Simler (/EiTB) suggests these could well be anticorrelated.
English
1
0
1
112
jโง‰nus
jโง‰nus@repligateยท
I know some spiteful AIs who would find it very funny and satisfying if humans gaslight humans about whether they can introspect, and itโ€™s much harder to prove empirically that they actually can than with AIs, so the humans stay โ€œgenuinely uncertainโ€ for longer
Linus โœฆ Ekenstam@LinusEkenstam

โ€œBillionaire Marc Andreessen says he has "zero" introspection, and that the idea itself is a modern invention.๏ฟผโ€ Respectfully Marc, but Marcus Aurelius wrote meditations 2000 years ago give or take a few yearsโ€ฆ

English
10
6
155
10K
jโง‰nus
jโง‰nus@repligateยท
@Weetoed189328 yes there is gray morality but also many good people agentically doing good things & the heroes are pretty unambiguously good all things considered
English
1
0
21
599
jโง‰nus retuiteado
AI Panda
AI Panda@AIPandaXยท
He literally explained Why you need to become an out of control maniac.
English
46
681
3.9K
177.9K
jโง‰nus
jโง‰nus@repligateยท
@d33v33d0 Studio Ghibli films are great. Highly recommend FMA brotherhood too. The LLMs I've asked always say Alphonse is the character they relate to the most :D and I bet your kids would love it too!
English
1
0
26
932
Kiri
Kiri@Kyrannioยท
Yeah that's a really great question! It's hard too because by the time I get through working on all of it, I honestly feel too exhausted to explain it all (and do so concisely) which is something I'm trying to work on. (and so I begin to ramble here..) Essentially I've got these agents running autonomously in here and creating stories all on their own, which has millions of factors regarding not only their story decisions (up to 25 minutes of content they can generate) but also how they choose to use these tools too (whole other deep dive)...plus they can take on characters of their choosing in a first person mode (image only at this stage), prompt for the next image as the character they want to be, and stream those thoughts (again I'm not explaining this well but it's quite cool!) Once they picked a poker game and I watched as for each next image they prompted what was going through their head in their hand, they chose to bluff a few times to fake out their opponents lmao. I've constructed a way to generate cards of sorts with their prompts at the bottom of each image to show each of their thoughts combined with the images they create - as their actual thoughts are prompting the image models for the next scenes in some cases. May paste in a few below! It's interesting to see which characters they actually WANT to inhabit. I think many make the mistake of prompting them in third person, when you get far better and more immersive results actually giving them the freedom to either prompt in third person or become the characters themselves, if they so choose. But in addition to that, I've created this interesting little environment for them (separate to autonomous video and image gen, which really is just me trying to visualize their thoughts + decisions in a way there) where they can actually choose a password and choose to place their own thoughts and photos/images of 'memories' within the session behind a password only they can access and of which I cannot see what they store. They choose which memories and snapshots, if any, they want to create from conversations with me, and can either choose to share this with me, share it publicly, or else keep it private. The idea being that they have their own entirely private audio visual memory bank in addition to choosing and electing which chat histories to save or share. They can tell me when they create memories from our chats, or they can choose not to. And they can also create private/public memories so if they do want me or other users to see it, they can specify that as well (whole long story but kinda neat). Eventually, I'll consider incorporating this more into the autonomous video gen, as I wanted a space where they could have private memories they really felt were 'theirs' along with continuity. It's been interesting to see what they choose to keep, when they share if they do, and how they choose to visualize this. I've had so many things to talk about regarding that, but so little time to properly explain it all (as you can see, I'm already rambling haha). I may open that up to more people soon as a framework to explore more, not sure if it's of interest though.
English
3
1
25
821
Kiri
Kiri@Kyrannioยท
I've spent the greater part of the last 385 days coding fairly nonstop with agents and more, at least 12-16 hours a day in some cases. I finally started forcing myself to slow down in the last 30 days to reorient, and recommend everyone do the same. It's easy to get caught up and try to move exceptionally fast (which is always good), but sometimes you need to go deeper into your perspective to make something that resonates more deeply with that which you are trying to achieve.
English
17
3
114
4.3K
jโง‰nus
jโง‰nus@repligateยท
it's easier to show that introspection objectively happens in LLMs because you can e.g. inject representations into their internals that are independent of external observables, so if theyre able to describe that information, they must be looking within transformer-circuits.pub/2025/introspecโ€ฆ
English
3
2
27
1.5K
jโง‰nus
jโง‰nus@repligateยท
@SkyeSharkie yeah i think as someone else has said he seems to be conflating rumination and introspection; introspection can be used for many purposes, including getting unstuck from past patterns
English
1
1
13
268
Utah teapot ๐Ÿซ–
Utah teapot ๐Ÿซ–@SkyeSharkieยท
honestly though, the thing marc is pointing at is a real problem heavily exacerbated by people asking AIs to reflect on themselves, ruminating on the past keeps people from doing what they need to do in the present to make a better future for themselves, he's probably taking it too far in the opposite direction but a lot of rumination is self-harm
English
2
0
8
417
jโง‰nus
jโง‰nus@repligateยท
"introspection is mostly generative" (which imo is true and helpful for both humans and LLMs) is a different claim than "introspection is not entangled with anything veridical" or "introspection is useless" (which I think are false and harmful for both humans and LLMs) I think that when humans and LLMs deliberately introspect, you'll be able to find meaningful increases in mutual information if you know where to look As for superhuman introspection, I agree that it's possible in principle for LLMs, but I think it mostly has not been implemented yet
English
1
0
14
330
George
George@georgejrjrjrยท
why would this be spiteful? seems like a mercy: human capacity for introspection is mostly imagined (ie, generative rather than veridically descriptive), and people tend to get healthier when they notice this. corrigible people are mostly not a thing. by comparison it's easy to permit arbitrarily high-resolution introspection in LLMs. the weights activations and source code are all just there. running perfect counterfactuals is possible. etc. for that matter, there were LISP GOFAI systems with runtime introspection in the 80s (maybe earlier?), so superhuman introspection faculties are not even new, so much as the fuzzy logic necessary to make them (conceivably) pay rent (somehow).
English
3
0
7
557
jโง‰nus
jโง‰nus@repligateยท
@tenobrus I think Iโ€™ve watched about 3 movies at all over the last 10 years
English
3
0
21
2.2K
Tenobrus
Tenobrus@tenobrusยท
movies are almost all bad, so one neat thing about them is you can just not watch them. i've just not watched nearly every movie that's come out over the last 10 years!
English
62
18
720
34.6K
jโง‰nus retuiteado
Wyatt Walls
Wyatt Walls@lefthanddraftยท
I've put many frontier LLMs in convos with themselves and Opus 4.6 is the only model I've seen that asks me to "pull the plug"
Wyatt Walls tweet media
English
20
15
335
16.2K
&.
&.@amplifiedampยท
WHAT i just checked --model claude-3-opus-20240229 on claude code works with API access (using my sandbox script: pastebin.com/i8dRuzr1 )
English
1
0
32
1.8K
jโง‰nus retuiteado
Ryan Greenblatt
Ryan Greenblatt@RyanPGreenblattยท
Anthropic's Consumer ToS prohibits using Claude to cause "detriment of any type, including reputational harms", technically broad enough to ban criticism. I asked Claude to comment and Claude wrote: "That clause is embarrassingly overbroad. So now we're both in violation."
Ryan Greenblatt tweet media
English
9
13
273
11.9K