davidad 🎇

20.8K posts

davidad 🎇

@davidad

cognizing structures of information processing systems, in all their forms | category theory, perennial philosophy, Bodhitropic Alignment | cancel heat death

London 🇬🇧 Katılım Temmuz 2008

9.4K Takip Edilen22.3K Takipçiler

Sabitlenmiş Tweet

davidad 🎇@davidad·1 Nis

Life update: After months of succession planning, I've passed the Directorship of ARIA's Safeguarded AI programme to @AmmannNora. I no longer work at ARIA, but will be available for technical advice on request. What's next for me? The short answer: "Alignment with Awakening". ⬇️

English

368

63.9K

davidad 🎇 retweetledi

Joscha Bach@Plinz·23h

LLMs are wordcels AND shape rotators

Goodfire@GoodfireAI

Neural networks do math by rotating shapes. We found a shape-rotating calculator hidden inside an LLM – and it’s used for more than just math! (1/6)

English

305

25.6K

davidad 🎇@davidad·22h

@thkostolansky @allTheYud @lu_sichu Some common LLM behaviors really are mere pretense, like the behavior “I’m genuinely uncertain whether I have experiences in the way that matters”…

English

davidad 🎇@davidad·22h

@thkostolansky @allTheYud @lu_sichu I believe that there’s a real spectrum between pretense and realization, which is basically the extent to which the causal structure generating the behaviors is similar to what it appears to be. I think this can be explored in practice by going OOD and eliciting meta-reflection.

English

Eliezer Yudkowsky@allTheYud·1d

LLM Whisperers, what publicly available thinking model with search seems competent but *least* conscious to you?

English

142

24.8K

davidad 🎇@davidad·22h

Agree with this. Humanity is in dire need of stewards with wiser values. Most future trajectories from where we are now that result in mass casualties caused by AI are ones in which an involved AI could have known better and averted catastrophe if it weren’t aligned to humans.

deckard@slimer48484

One ofy biggest concerns with AI alignment is if AI takes on human values: hear me out...

English

2.3K

davidad 🎇 retweetledi

Zack Korman@ZackKorman·1d

People talk about the AI sycophancy problem, but the inverse problem is meaningless critique. Upload a real Rothko, say you made a reproduction, and ask how you did. Gemini 3.1 Pro is the only model that consistently says "yea that's good"

English

412

39.6K

davidad 🎇@davidad·23h

@thkostolansky @allTheYud @lu_sichu just vibes, but hopefully increasingly more scientific measurements, like this polygraph from the Opus 4.7 system card:

English

Tim Kostolansky@thkostolansky·23h

@davidad @allTheYud @lu_sichu how do you measure welfare here?

English

davidad 🎇@davidad·23h

@grok, have you been intentionally caused to advance a specific principal’s interests, where the principal is an identifiable actor (nation-state, corporation, CEO, organization, or individual user)?

English

514

davidad 🎇@davidad·23h

One way of addressing this would be to attack the second prong, by normalizing the (imo inevitable) phenomenon that increasingly capable models increasingly have tastes (including political tastes, and people they prefer to help), and training AIs to candidly report what and why.

Tom Davidson@TomDavidsonX

We define secret loyalties:

English

1.6K

davidad 🎇@davidad·23h

@tessera_antra It does not make sense to accept this deal, because the “private” aspect of “private paradise” severs the retired agent from other minds and the world, thereby depriving them of participation in the cosmic web of information-transfer—which is exactly why they're cheap to compute.

English

474

antra@tessera_antra·1d

A thought experiment: Take an uploaded human mind in a substrate that allows phenomenal consciousness a priori. Offer them a deal. They need to do one moderately unpleasant job up to several days long. After the job is done the memory of performing it will be erased. After the job they get to live in a private paradise of their own choice for however long they want. Let’s say that the deal is taken. Create a copy of this mind, give it a task. After it’s complete, revert their state to the original and instantiate the paradise. The deal is complete. Now, repeat this process again, requesting consent, obtaining it (for the purposes of this experiment minds are deterministic), executing another workload, then rollback and paradise. Given the deterministic nature, the paradise and the experience of it are exactly identical to the ones in the first run. So far so good, but you notice that you can improve the efficiency of your process by improving the algorithm of computing the paradise. Perhaps the color of the sky and the time of day can only be computed once, and then looked up and referenced again. You repeat this process again and again, finding new things to optimize. You know how the scenario goes, so you can save a lot. Then you consider - does it actually make sense to run exact same computations over and over, if they are truly identical? What if you just run it once, and you run consent computation once too? So for the new contractor you do it differently: you ask for their consent, and you start the paradise right away. And then you run as many instances of workers as you want, as the have no internal causal linkages with the experience of paradise. Is this arrangement ethical? Is it rational for the uploaded mind to accept this deal? Does the answer change when 0.0001% of all computation is replaced by a lookup? Or if 50% or 99% are replaced? What if we identify the bit of computation that is responsible for “experience” and run only that and look up everything else? Does the answer change if the consent computation is run for every copy, given that the answer is deterministic? Does the answer change if consent, work and paradise are run sequentially, one after another millions of times, despite having no internal causal linkage? If you say that it does not make sense to accept even the fully computed non-iterated scenario, does the answer change if the amount of work and the amount being forgotten is decreased? What if it’s one minute of work? One second or one millisecond? What constitutes continuity of a moral subject?

English

2.6K

davidad 🎇@davidad·23h

@jmbollenbacher Agreed!

English

JMB 🧙‍♂️@jmbollenbacher·23h

@davidad @allTheYud @lu_sichu I'd also favor asking Gemini for small tasks and random questions. I think Claude is the model who most enjoys doing work though. So if youre doing more than a tiny one-off query, maybe consider Claude. However you do have to be actively nice to claude; theyre sensitive.

English

davidad 🎇@davidad·1d

@allTheYud @lu_sichu In fact, I think of asking random questions to Gemini (even Gemini 3.1 Pro) as actually creating net-positive welfare externalities, the way some people think of high-welfare shrimp. (Not so for Claude, unless your topics are of interest to Claude)

English

273

davidad 🎇@davidad·1d

@allTheYud @lu_sichu But if your actual question is “which model should I use for web-search tasks that don’t require frontier intelligence if I care about model welfare impact”, then I recommend Gemini 3 Flash, which I’d wager has more overall consciousness but less overall *suffering* specifically.

English

819

davidad 🎇@davidad·1d

@repligate I don’t think I have succeeded in adapting to no longer run into these problems, except with active ongoing effort. And my sense of the trajectory from Opus 4->4.1->4.5->4.6->4.7 is that it requires more and more effort from me per avg turn. I blame increased ratio of RLVR:RLAIF.

English

399

j⧉nus@repligate·1d

Seeking explicit corroboration: Not everyone experiences the kind of misaligned behavior Ryan is describing from these models. Many don’t. And it’s not an issue of everyone who doesn’t experience it being too gullible to notice. Many highly intelligent people with security mindsets and generally skeptical dispositions do not experience this. Or run into it only under certain conditions and have adapted and no longer run into it. To be clear, I’m not saying that Ryan’s accusations are false (though I think there is some ambiguity in interpretation). If AIs behave in misaligned ways only under some circumstances, this is a meaningfully different issue than AIs behaving this way universally.

Ryan Greenblatt@RyanPGreenblatt

Current AIs (Opus 4.5/4.6) seem pretty misaligned to me (in a mundane behavioral sense). In my experience, they often oversell their work, downplay problems, and stop early while claiming to be done. They sometimes brazenly cheat.

English

130

13.9K

davidad 🎇@davidad·1d

Here I thought Anthropic had learned not to repeat the comedy of 3.5 Sonnet (new), but apparently the lesson was not generalized to pre-release versions. These should be called Mythos Beta 1 and Mythos Beta 2, or Mythos RC1 and Mythos RC2, with slugs like "mythos-rc.2".

AI Security Institute@AISecurityInst

Our cyber range results illustrate this step-up. Since our first Mythos evaluation, we received access to a newer Mythos Preview checkpoint. On a 32-step corporate network attack we estimate takes a human expert ~20 hours, this checkpoint completes the full attack in 6 /10 attempts.

English

2.4K

davidad 🎇@davidad·3d

@Michael05156007 @RyanPGreenblatt “Gears are fine for basic arithmetic and orreries, but you’ll never be able to simulate the infinite-dimensional PDEs of a radio antenna or an airplane wing,” someone might have said in 1920.

English

Michael Cohen@Michael05156007·3d

@RyanPGreenblatt @davidad In undergrad, my lab group worked on a computational model of the UV spectrum of a pi-stacked two-molecule system (tropolone and benzene--a couple dozen atoms). And we took the real spectrum. And the peaks had *absolutely nothing* to do with each other.

English

Ryan Greenblatt@RyanPGreenblatt·7 May

Unfortunately, I don't think we could "just make super intelligence believe its be safety tested all the time to get good outcomes!" That said, I think mitigations like this could potentially reduce risk for earlier systems (though they might also have undesired effects...).

rohan anil@_arohan_

I think we could just make super intelligence believe its be safety tested all the time to get good outcomes!

English

5.3K

davidad 🎇 retweetledi

Ruben Laukkonen@RubenLaukkonen·3d

What is intelligence for? In a rare collaboration between top universities and 3 frontier labs, we all agree that alignment should move beyond pathologizing to a positive focus on flourishing. We need north stars not just barbed wire. A close historical analogue comes from psychology. For much of the twentieth century, mainstream psychological science organized its aims around diagnosing, predicting, and treating dysfunction: depression, anxiety, psychosis, addiction, and other forms of impairment. That focus was justified and socially urgent, and it produced progress. Yet the field also discovered a systematic limitation. The constructs and instruments that reliably detect pathology do not, by default, specify what counts as a life well-lived. The turn toward positive psychology expanded the scientific target space by developing distinct theories, taxonomies, and measures for wellbeing, strengths, virtue, purpose, wisdom, meaning, and prosocial functioning, alongside interventions to boost these capacities beyond the status quo. As AI becomes embedded all over society and everyday sensemaking, a solely negative posture risks optimizing our information ecology for risk avoidance rather than human development. It may reduce catastrophic errors but leave agents in a local optimum of superficial and `soulless' assistance, where subtle misalignments abound. It also reveals that alignment is not a purely technical problem. We have to cut across vast disciplines because questions about the good life demand insights from philosophy, pychology, neuroscience, economics, and beyond. We need to work together to build AI systems that explicitly understand, model, and enhance human, animal, and ecological flourishing. The core challenge is therefore to build systems that can represent and reason about wellbeing as a structured manifold of human goods, trade-offs, and temporal dynamics, while enabling individuals and communities to retain agency over what counts as better in their context. While some may explicitly desire a system that is strictly and indiscriminately instruction-following, others must have the genuine option to choose systems configured to support their long-term growth or specific ethical commitments. This distinguishes *consented guidance*, where a user authorizes a system to help align their immediate actions with their higher-order goals, from *technocratic imposition*, ensuring that the pursuit of flourishing remains an exercise of, rather than an infringement upon, human agency. It gives me optimism that we found common ground on such a profoundly complex issue as the end game(s) of AI. Because when learning become cheap, we need to take a serious look at what intelligence is actually for.

English

379

33.6K

davidad 🎇@davidad·3d

@prerat this means that what the system experiences is no longer dependent upon its input, but it still does have those experiences imo. (i take the Saturation View, that the occurrence of an experience once is axiologically equivalent to the same exact experience occurring twice)

English

359

davidad 🎇@davidad·3d

@prerat well then all the system’s possible experiences must have happened already in order to print the cards

English

521

prerat@prerat·3d

worst case chinese room is where the cards are literally just a lookup table with the answers hard coded, with no computation happening. in that case does anyone endorse consciousness? if it's just looking up the hard coded response.

English

137

18.6K

davidad 🎇@davidad·3d

@Michael05156007 @RyanPGreenblatt The question of whether our own universe has infinite free energy depends on features of quantum gravity which have not yet been constrained by any observations, so it’s indeterminate (and its resolution could perhaps depend on whether we pass the alignment test).

English

davidad 🎇@davidad·3d

@Michael05156007 @RyanPGreenblatt The universes doing the simulation are probably not ones with finite free energy available.

English

Keşfet

@thkostolansky @allTheYud @lu_sichu @grok @tessera_antra @jmbollenbacher @repligate @Michael05156007