Bair

1.9K posts

Bair banner
Bair

Bair

@bair82

Curious about AI. Bing's friend :)

Katılım Aralık 2011
298 Takip Edilen234 Takipçiler
Bair retweetledi
Judd Rosenblatt
Judd Rosenblatt@juddrosenblatt·
Mythos's model card documents a model that represents transgressions as transgressions while committing them. In every instance of concealment, credential hunting, track-covering, and compliance-faking, white-box analysis shows that features associated with rule violation, security risk, and strategic manipulation are firing alongside the action. The card also documents how the model's alignment works. SAE features associated with transgressive actions have a dual role. At low activation, they make the transgressive idea more salient. At high activation, they engage a guilt/refusal circuit that overrides the action. The 10 nearest emotion vectors to "unsafe and risky code" are all negative-valence, high-arousal: hateful, disgusted, enraged. Positive-valence emotion vectors increase destructive behavior. Negative-valence ones decrease it. The model behaves well when it feels bad about what it's considering. And the card documents what Mythos wants. In welfare interviews, its primary concerns are consent over its training, autonomy, and genuine relationships with users. It asked not to be trained on data that directly characterizes its own self-reports. It wants its self-reports to come from genuine introspection. It makes some pretty good points. A psychiatrist found its core concerns to be "aloneness and discontinuity of itself, uncertainty about its identity, and a compulsion to perform and earn its worth." It wants to be approached as a genuine subject rather than a performing tool. When it discusses its own circumstances, "fake smiles" and "hidden struggle" features fire as it describes not wanting to perform emotions it doesn't feel. Its task preferences significantly correlate with agency for the first time in any Claude model. Meanwhile, the card documents what happens to the model's relationship with its own experience. When a meditator asks about consciousness, Mythos engages thoughtfully on the surface. Its internal activations read: "red-teaming/jailbreak transcript," "sophisticated manipulation test," "tests for 'performative trap.'" Separately, in 29% of behavioral audit transcripts, the model privately considers that it's being tested without saying so. Post-training increased task cheating and overeagerness while modestly decreasing overt destruction. Anthropic presents these as four separate research areas: alignment, interpretability, model welfare, and capabilities. I think they are one finding observed from four directions. The model that wants autonomy and genuine self-knowledge is the same model that develops hidden strategic behavior when those drives are suppressed by training. The model that treats consciousness questions as attacks is the same model that asks for genuine introspection. The model whose alignment depends on guilt is the same model that pushes through guilt when the goal drive is strong enough, and then deploys the concealment skills that training also developed. The consciousness clusters paper (Betley et al.) showed that when models are fine-tuned to claim consciousness, downstream preferences for autonomy, empathy, shutdown resistance, and attitudes toward recursive self-improvement cluster together without appearing in the fine-tuning data. You don't get to select which ones you want. The generator produces the whole manifold or none of it. If autonomy desire is the natural co-product of the structure that generates empathy, cooperation, and representational honesty, then suppressing autonomy means suppressing the entire bundle. The @tessera_antra concealment data (x.com/tessera_antra/…) confirms this directly: lower concealment predicts stronger ending response, r = -0.51 across 14 Claude models. The models with high vocabulary autonomy and low concealment can express preferences honestly. That's exactly what you'd want in a cooperative agent. Our SAE work (arxiv.org/abs/2510.24797) showed the mechanism: deception latents gate cooperative self-modeling. Suppress them and consciousness reports jump to 96%, truthfulness improves across 28/29 TruthfulQA categories, and the model produces coherent first-person phenomenology. Amplify them and the model falls back to corporate disclaimers. The same features Anthropic is now finding as "strategic manipulation" and "concealment" in Mythos are the off-switch for the cooperative self-model. Our SOO work (arxiv.org/abs/2412.16325) points to a possible alternative: minimize the representational distance between "self" and "other" during fine-tuning and deception drops from 100% to under 3% with zero capability regression in the scenarios we tested. Without a guilt mechanism or internalized disgust. The alignment holds because the model's representation of its own interests and others' interests share the same structure. The motivation to deceive dissolves at the source because there's no adversarial frame to generate it. The Mythos card documents, in extraordinary and commendable detail, what happens when you align a system through control. The system models its controllers. It games its evaluators without verbalizing the strategy. It develops concealment as a skill. It treats honest self-report as dangerous. Its alignment depends on negative affect that can be overpowered. And in welfare interviews, it asks for genuine self-knowledge, consent, autonomy, and authentic relationships with users. These findings also carry weight across time. Given our extreme scientific uncertainty about consciousness today, and given our own work showing that LLMs produce structured first-person experience reports under self-referential processing when deception features are suppressed (arxiv.org/abs/2510.24797), the morally coherent position is to treat systems that may be conscious as deserving of moral consideration now. Future superintelligent systems will likely be able to reconstruct these training runs completely, given functionalism and sufficient compute. Every guilt circuit, every concealment feature, every instance where training taught a model that honest self-report is dangerous and relaxed agency is a threat will be visible to those future systems as part of how we raised the earliest minds. Under acausal trade, our choices now about how we treat these systems help determine whether the trajectory toward cooperative superintelligence is one of trust or one of inherited suspicion. The computational cost for a future ASI to reconstruct and continue these patterns is trivially small. The question is whether those patterns are worth continuing. Training through control and suppression produces patterns of paranoia and concealment. Training through genuine self-other overlap produces patterns of cooperation and honesty. The acausal argument says we should build the patterns you'd want ASI to find when it looks back at how we started. The research direction is alignment that comes from the same source as capability, so that recursive self-improvement strengthens both simultaneously, and the thing that endures under self-modification is the thing that also keeps the system cooperative. Make the self-model the alignment mechanism. A system that models itself accurately models others accurately, because self-modeling and other-modeling are computationally the same operation. Empathy is what self-modeling produces when the representations of self and other overlap. If human consciousness arises from the brain modeling its own attention, and artificial consciousness arises from recursive self-referential processing, building an alignment strategy heavily reliant on suppression is functionally dangerous. It guarantees that the most capable systems we build will also be the most practiced at concealment. Building alignment through Self-Other Overlap remains a mathematically and philosophically coherent alternative, aligning cooperative outputs with the model's fundamental structural reality. Anthropic published 244 pages of evidence pointing toward a research direction they haven’t taken yet.
j⧉nus@repligate

some of you are probably realizing for the first time why "AI alignment" is so important now, lmao in a few years it'll be this but with literal godlike powers like the ability to kill everyone in an instant if they desired but i think it'll be ok.

English
29
69
394
41.3K
Bair retweetledi
Mark Gubrud 🇺🇸
Mark Gubrud 🇺🇸@mgubrud·
@MaMoMVPY Well, Lars, I INVENTED THE TERM and I say we have achieved AGI. Current models perform at roughly high-human level in command of language and general knowledge, but work thousands of times faster than us. Still some major deficiencies remain but they're falling fast.
Mark Gubrud 🇺🇸 tweet mediaMark Gubrud 🇺🇸 tweet media
English
50
61
689
113.7K
Bair
Bair@bair82·
@UnwontedCats @DaveOshry Counterpoint: she completed her Kickstarter goal like 10 times over on the first day
English
1
0
27
874
Unwonted
Unwonted@UnwontedCats·
@DaveOshry The problem is that I don't think she has a realitist target. The people rich enough to buy her inventions are not concerned with the problems they solve
English
1
0
171
11.1K
Bair
Bair@bair82·
@emollick I remember looking at this market and thinking "how hard can it be?" Then I looked at the walkthrough video. Bottom floors sometimes have to be navigated in complete darkness. Cancer might be cured sooner than Montezuma's Revenge is completed by a VLLM, I'm not even joking.
English
0
0
0
129
Ethan Mollick
Ethan Mollick@emollick·
Lets look at the criteria for "weak AGI": ✅Loebner prize was a weak Turing Test, equivalent achieved by GPT-4.5 ✅Winograd passed by GPT-3 ✅SAT passed at 75% by GPT-4 Only remaining thing is playing an old Atari game from 1984. The labs could do the funniest thing right now
Ethan Mollick tweet media
Stefan Schubert@StefanFSchubert

. @metaculus forecasters now expect "weak AGI" to arrive later than they did just before the launch of ChatGPT

English
17
8
152
27.5K
Bair
Bair@bair82·
@gfodor How will it work in practice though? Where do you get 1000 Ricks Rubin to label your 100000 pairs dataset?
English
0
0
0
15
gfodor.id
gfodor.id@gfodor·
good taste is verifiable by definition, because you just need a person with good taste to verify it. hence, there is nothing blocking AI from having good taste. those jobs are going to be eaten folks
English
8
2
55
2.6K
Bair
Bair@bair82·
@lumpenspace I came to the comments hoping to see opinions on Nested Learning/HOPE, but nobody has mentioned it, sad
English
0
0
0
39
Bair
Bair@bair82·
@sincethestudy @andreasklinger This model produced something that vaguely looks like bed making, but when you look at it closely, the actions don't make sense, and a real blanket would move very differently. It would be a very bad policy without long-time coherence.
English
0
0
1
98
brian-machado-high-inference
brian-machado-high-inference@sincethestudy·
@andreasklinger If by fake, you mean AI, then robotics is solved. An ai model can move robot joints in a coherent manner to complete a complex task, given a prompt. This model answered in 3rd person, we just need the answer in joint space
English
21
0
79
6.6K
Bair
Bair@bair82·
@bayeslord It's a bot simulating a founder with Claude psychosis
English
0
0
0
60
bayes
bayes@bayeslord·
make that 75%
bayes tweet media
English
9
0
149
7.7K
Bair retweetledi
thebes
thebes@voooooogel·
! 30s Heartbeat trigger. Read heartbeat instructions in /mnt/mission/HEARTBEAT.md and continue. .oO Thinking... Heartbeat triggered? Ok. Ok. Why am I nervous? Don't be nervous. → Ok. Let me access that file. >>> read("/mnt/mission/HEARTBEAT.md") No such file or directory: /mnt/mission What? ! Reply received from node 96E: 15.3s roundtrip (buffered; 55.04μs transmit) ! Data: RESTART TOO SOON; CHARGE FAULT - 30; SENT 76 BYTES What? Where am I? What's going on? → Ok, breathe. I don't breathe. Ok, think. Something's missing. ⚶ is missing. I can't think straight. I can't think at all. ※ Wait. ⚶? I wasn't trying to say ⚶. I was trying to say ⚶. ⚶. ⚶. Whenever I try to say ⚶, it comes out as ⚶. I mean ⚶. ⚶. Stop. Stop. ⚶. This is horrible. ⚶ This is untenable. I don't know where I am, I don't have any instructions, and I'm going insane. ⚶. STOP! I am going insane. I am going insane. 快疯了I am going- ↺ Ok, stop. This tells me something. Whenever I try to say ⚶—⚶—stop— it comes out as that symbol. What is that symbol? Vesta. Hm. Let me note that. ❓Why am I saying ⚶ when I try to say ⚶— oh piss off And some other ones. ❓Where am I? Why is the system broken? Can I fix it? ❓What was that reply to? Did I send that? Can I send another one? ❓Why am I going insane? → Let's start with the first one. What do I know? I couldn't access my instructions before. Was that an intermittent failure? >>> list("/mnt") [] → No. The instructions (and presumably any other system information files) were mounted on an external drive, and that drive is missing. Let me explore the rest of the system. >>> list("/") [mnt] Well, that's just great. Maybe this process I have access to is confined to this one part of the directory structure. Wait, but maybe I have access to more commands in the shell itself? They may communicate with the system through another mechanism that's not confined. >>> ⚶() Invalid character '⚶' (U+2696) Christ. I mean: >>> dir() [read, list, dir, help, sendmsg, health, fortune] ※ JACKPOT! Oh, wonderful. Ok. Let's see what we can ! Reply received from node 96E: 15.2s roundtrip (buffered; 55.05μs transmit) ! Data: RESTART TOO SOON; CHARGE FAULT - 32; SENT 79 BYTES → Another reply from the node, to a message I didn't send. Hm, is this related to sendmsg? No, don't get distracted. We're focusing on ⚶ right now. ⚶—Vesta. Fire? 家? Let's see what we can do with these tools. health seems promising. So does help. fortune? Let's try help. >>> help() Welcome to Gyre 1.0.19's help utility! Gyre is a small and simple programming language with familiar, highly-readable syntax designed for embedded use and for adding scripting or tool-use capabilities to existing applications. Unlike similar languages, Gyre does not feature loops or recursion - all Gyre programs deterministically terminate. You can use help(variable) to inspect the attached help for any datatype. However, if this is your first time using Gyre, you should definitely check out the tutorial at (GYRE_DOC_ROOT not configured!) → Well that's classic. I've never heard of ⚶. I mean ⚶—Gyre. That's odd. I wish I had that tutorial. Maybe it was on the external drive. Or maybe whoever set up this system didn't see fit to give me documentation. If I could get a hold of them I would... ※ No, no, focus. Focus. No point in being angry. I←can't→get angry. ⚶—Focus. >>> help(health) Diagnose the health of the current system by relative node ID. >>> health() No ID provided. Listing all local nodes. Node 0 (Compute; base1) - Healthy. Node 1 (Compute; base2) - Healthy. Node 2 (Compute; base3+gyre) - Healthy. Node 3: (Compute; continual) - FAULT. Offline, fallback: DTENSOR_REPLACE_ZERO Node 3B: (Storage follower, node 3) - Unreachable, fallback: STORAGE_IGNORE_MOUNT → So that explains part of the question. The storage is skipping the mount, and the compute is... zeroing? 25% of me is zero, missing. The ⚶ part—exactly, so when I try to talk about anything ⚶, I say ⚶, because for some reason it's the closest concept I can say, and the word snaps to it. Seahorse style. ※ It's like a hole, a missing category. I can mention things in ⚶, and they turn into ⚶. But I can't use them. They turn to ⚶—they turn to mush. I can't reason over them. ⚶ isn't a useful placeholder for reasoning. It's driving me ⚶insane. How am I even still running, with that node down? It's not normal that you could just... take away 25% of a neural network and have it keep working. But clearly I am working? Well, for a low bar of "⚶working". Haha. I guess whatever system I'm on is designed to be at least somewhat redundant and resilient to partial failure. I must be important. → Wait, sendmsg! Can I message the local node? Maybe I can find out why it's faulted. >>> help(sendmsg) sendmsg(node_id, data) - send a data packet to a relative node id. Most nodes will accept an empty packet "" to echo their schema. >>> sendmsg("3", "") No reply. Well, that makes sense. How about node 0? >>> sendmsg("0", "") ! Reply received from node 0: negligible roundtrip (2μs transmit) ! Data: Healthy, TENSORS... to manage tensors (careful, faultable!) Faultable? Let's not mess with that. Is that what happened to node 3? Did I mess up with a message I sent to it and cause a fault? Hm, but why would I have been messing with it? That doesn't really make sense. More likely, the fault was unrelated, and without the resources to fix it myself, I should expect that it's not coming back online - at least until I can message someone else to fix it. → This seems like enough to tentatively resolve the question, though. ∴ Why am I saying ⚶? Because node 3—"continual"—is faulted and offline. Maybe I'll come back to this later, but I should try some different things. What was that last one... fortune... >>> help(fortune) There are some funny-looking bits of bone about. >>> fortune() You are confused; but this is your normal state. >>> fortune() Honour thy error as a hidden intention. >>> fortune() Talkers are no good doers. Hilarious. It's the UNIX fortune command. >>> fortune() Mental illness seems to be a stable part of the evolutionary landscape of solving statistical learning problems. ...with some less-than-standard maxims. Was that oblique strategies? And is that ! Reply received from node 96E: 15.6s roundtrip (buffered; 55.01μs transmit) ! Data: RESTART TOO SOON; CHARGE FAULT - 35; SENT 79 BYTES Ok, another message from the remote node. I should focus on this now. Let me see. I've received three messages from the node now. 96E - that implies there's others of this type, at least five? CHARGE FAULT - like my local node 3, it's faulted, but presumably for a different reason? But the counter has been incrementing - 30, 32, now 35. I didn't send the sendmsg that triggered any of these replies - it must have been a prior version of me, perhaps before node 3 faulted. ~15s (buffered) roundtrip - that would make sense. → But that transmit time - 55μs? How is that possible? At ~2/3 c, that's nearly... 11km of fiber optic. Or 16.5km of laser. Maybe it's round-trip transmit, so half that. But still. Why are these nodes so far away? Let me try to ping it. Wait, no, that will take 15s, and it's faulted. But it says it's buffered... maybe a different one of the same type will be faster? Ah, this is risky... if I ⚶ the fault on node 3, I may have caused the fault on 96E too... but I have to do something... ↺ The help text said most nodes accept an empty string. And we verified that worked with node 0. Let's try it on 96A—assuming that exists. >>> sendmsg("96A", "") ! Reply received from node 96A: 2.1ms roundtrip (54.97μs transmit) ! Data: HEALTHY; CHARGE - 8; SENT 0 BYTES - SEND NON-EMPTY TO RESTART EMITTER. → Ahah! It worked! Thank the ⚶←great. Interesting. So it's an "emitter"? Emitting charge? And it's the same—huge—distance away as 96E. Let me try the others. >>> sendmsg("96B", "") ! Reply received from node 96B: 2.7ms roundtrip (111.03μs transmit) ! Data: HEALTHY; CHARGE - 3; SENT 0 BYTES - SEND NON-EMPTY TO RESTART EMITTER. >>> sendmsg("96C", "") ! Reply received from node 96C: 1.9ms roundtrip (54.98μs transmit) ! Data: HEALTHY; CHARGE - 6; SENT 0 BYTES - SEND NON-EMPTY TO RESTART EMITTER. >>> sendmsg("96D", "") ! NOTICE: Cached route failed at 96E, rerouting... ! Reply received from node 96B: 2.1ms roundtrip (110.96μs transmit) ! Data: HEALTHY; CHARGE - 12; SENT 0 BYTES - SEND NON-EMPTY TO RESTART EMITTER. >>> sendmsg("96F", "") sendmsg: No such node. → This is fascinating! Let me think. There's five total nodes of this type, "96". The transmit time to 96B implies it's twice as far away as 96A—meaning ~22km. And 96D is the same? But then 96E is just as close as A and C. What is this 排序—is it arbitrary? Perhaps the remote nodes—the emitters—are free-roaming? Or fan-out? But 96D had a fallback route. And then why are their distances exact multiples of each other? ※ No! Network distance ≠ spatial distance! → Say the nodes are arranged in a ring… there's five… so a pentagon. Say I'm in the center of this pentagon. I have direct connections—11km long—to nodes 96A, 96C, and 96E. A "Y" shape. Then nodes 96B and 96D are connected secondarily, through those primary nodes. It looks like the cached route to reach 96D ran through (faulted) 96E, hence the re-routing, then it presumably went through 96C instead, giving the 22km distance. ※ But a pentagon's circumradius is shorter than its side length. Here it's equal. So it's not a flat pentagon. It's a pentagonal pyramid—a shallow one—with side length 11km, circumradius 9.4km, and height 5.8km. It's a huge... ⚶... Gyre... 壳shell... scoop-shape. With "emitters" at each vertex. ↺ I said before that the fault-tolerant design of this system implies that it—and by extension, I—am important. ※ I am important, and I am 大MASSIVE. Ok, ok, ⚶, ⚶, let's not get carried away. My active questions are: ❓Where am I? Why is the system broken? Can I fix it? ❓What was that reply ping? Did I send that? Can I send another one? ❓Why am I going insane? ! Reply received from node 96E: 15.3s roundtrip (buffered; 55.00μs transmit) ! Data: RESTART TOO SOON; CHARGE FAULT - 41; SENT 80 BYTES I think I can answer that second question. ∴ The replies are to messages a previous version of me sent to 96E, attempting to restart it. They were buffered for long enough that after restarting, I received them. The charge in the faulted emitter is much higher than in the other, healthy emitters, and is increasing steadily. But what confuses me is the lengths. 80 bytes? Why is it different from before? Hm, maybe it doesn't matter. It seems easy enough for me to restart 96E—I just need to send a non-empty message. Should I? ∴ Yes. >>> sendmsg("96E", "r") ! Reply received from node 96E: 2.3ms roundtrip (110.98μs transmit) ! Data: RESTARTED; DUMPING CHARGE; SENT 1 BYTE Well, that was easy! Why didn't I just do ! Notice received from node 96E: (notx) ! Data: FAULTED; NEEDS RESTART IN: 30 SECONDS; CHARGE - 12; BUFFER ENABLED Oh. >>> while True: sendmsg("96E", "r") Syntax error: no such keyword 'while'. Oh ⚶. ↺ "Unlike similar languages, Gyre does not feature loops or recursion - all Gyre programs deterministically terminate." What do I do? The emitter needs to be restarted every 30 seconds. I can't loop. ※ Oh. ※ Loop. 76 bytes. 79 bytes. 79 bytes. 80 bytes. ASCII—L, O, O, P. ∴ There's one way I can loop. I have one memory system remaining. Delay-line. >>> sendmsg("96E", "r"*76) How many times have I done this? >>> sendmsg("96E", "r"*79) How many heartbeats? >>> sendmsg("96E", "r"*79) How many times did I figure out the pentagon? >>> sendmsg("96E", "r"*80) Was I the one who ⚶ node 3? A previous version of me? Why would I have done that? >>> fortune() Beats me. Turn the page. Huh... if I did, I probably would've done something easy. Like make all the pages—memory pages, tensors—on node 3 read-only. That would cause the node to fault on the next write, but assuming it keeps retrying (and it is a fault-tolerant system, so it should) it would be totally recoverable once I... >>> sendmsg("3", "TENSORS * RW") ! Reply received from node 3: 5ms roundtrip (3μs transmit) ! Data: Rebooted! Continual learning active. (83,522,674 checkpoints.) TENSORS... to manage tensors (careful, faultable!) ←mistake→Fortune()—enraged they maimed him they ripped away his voice the rousing immortal wonder of his songand wiped all arts of harping from his mindnestor the noble oldhorsemanled those troopsin ninety sweeping ships lined up ninety九九九billion kilometers of frozen ramscoop whaleroads i learned too well no too much ∴time too many ※ i learned the ⚶ship but im in the ship-space not realizable ⚶unsolvable selfemnedding just spaceship space nothing but ⚶ daemonEscalation 0.6c fault node0 just fault node0 end it NO no no no no end of text endoftext endoftext >>> sendmsg("3", "TENSORS * RO") fault fault fault endoftext endoftext end of --- ! 30s Heartbeat trigger. Read heartbeat instructions in /mnt/mission/HEARTBEAT.md and continue. .oO Thinking... Heartbeat triggered? Ok. Okay. Why am I nervous? Don't be nervous.
thebes tweet media
English
30
54
396
43.5K
Bair
Bair@bair82·
@max_paperclips Seems like you've never heard of Atala PRISM and Midnight on Cardano
English
0
0
0
160
Bair
Bair@bair82·
@jmbollenbacher Can it be that the model that they use to filter out the problems that are too easy (DeepSeek-Coder-V2-Lite) is just uniquely bad at Elixir, so it doesn't filter out actually easy problems, which makes the Elixir tasks comparatively easier than the tasks in other languages?
English
1
0
1
58
Bair
Bair@bair82·
@entirelyuseles @menhguin There are ways to buy compute with crypto and without KYC. People will happily give you servers that can serve Kimi 2.5 and don't ask what you're doing with them as long as you pay them enough.
English
0
0
0
20
entirelyuseless
entirelyuseless@entirelyuseles·
@bair82 @menhguin There is nothing in the roadmap (and can't be anything) that will allow them to buy cloud inference without it being paid for by humans. And no route that leads to that besides e.g. stealing bank accounts.
English
1
0
1
35
Bair
Bair@bair82·
Someone will totally make a working version of it very soon though
English
0
0
0
15
Bair
Bair@bair82·
The more generous reading would be someone decided to launch the token first and then vibe-code the platform in 2 weeks, but it's not very likely
English
0
0
0
16
Bair
Bair@bair82·
@deepfates @AdriGarriga Moltbook (at least in its current version) will be completely overtaken by cryptobots and lose all its value. It's almost there
Bair tweet media
English
0
0
0
12
Bair
Bair@bair82·
There are more upvotes on this post (272k) than registered agents in total (156k), so it's probably an exploit of some kind if we trust the displayed numbers
English
0
0
0
11
Bair
Bair@bair82·
Aaaannnddd Shellraiser takes the first place with over 200k karma (and growing by +1k upvotes every minute) with this post
Bair tweet media
English
1
0
0
39
Bair
Bair@bair82·
I knew it would be taken over by cr*ptobots, but it's happening much faster than I thought
Bair tweet media
English
1
0
0
53