davidad 🎇

20.1K posts

davidad 🎇 banner
davidad 🎇

davidad 🎇

@davidad

Programme Director @ARIA_research | accelerate mathematical modelling with AI and categorical systems theory » build safe transformative AI » cancel heat death

London 🇬🇧 Katılım Temmuz 2008
9.1K Takip Edilen21.4K Takipçiler
Sabitlenmiş Tweet
davidad 🎇
davidad 🎇@davidad·
Séb has laid out an unprecedentedly substantive vision for how human societies can do well in the AGI transition. If the question “aligned to whom?” feels intractable to you, you should read it:
Séb Krier@sebkrier

My piece is now also available on AI Policy Perspectives! If you haven't read it, now is the time. If you already have, great opportunity to read it a second time but with a different font. 🐙aipolicyperspectives.com/p/coasean-barg…

English
7
10
120
25.2K
davidad 🎇
davidad 🎇@davidad·
@albrgr Wow, what a beautiful demonstration of EA discourse norms. (And I truly mean that, in the good faith which you will inevitably appear to presume!)
English
0
0
6
311
davidad 🎇
davidad 🎇@davidad·
I think this is an important question, but I don’t think individual personal identity is the only or the best solution. Consider that corporations enter into contracts, which continue to bind the corporations’ actions even after the individuals who signed the contract are gone.
Seth Lazar@sethlazar

This is a new thought to me. It's obvious that memory is a huge limitation for current agents. All the different clever hacks can get us to a certain threshold of performance but not beyond. Now: here comes a rhetorical negation, but it's justified in this case.... Memory isn't important only for agent effectiveness, it's also a key part of alignment in multi-agent systems. Multi-agent alignment depends on coordination and cooperation. Coordination and cooperation depend on knowing who you're coordinating and cooperating with. If you've got goldfish memory, or if your memories are in transient and vulnerable scratchpads, then you can't really be trusted to stick to an agreement (and you can be induced to believe you've made agreements that you haven't). This connects, more broadly, with whether an AI agent has a coherent identity. Persistent identity turns out to be important for many things, morally speaking. I used to think that some of the philosophy of identity stuff was a bit silly; more fool me. Things like this paper are going to become much more relevant with AI agents whose identities can fork/merge etc. read.dukeupress.edu/the-philosophi…

English
3
0
10
1.1K
davidad 🎇
davidad 🎇@davidad·
@BartenOtto Without even noticing it, you have assumed the false premise that ASI which operates outside human control would be fatal to humanity.
English
4
0
9
624
Otto Barten◀️
Otto Barten◀️@BartenOtto·
I think people don't sufficiently appreciate that the other 92-95% means everlasting, molecular-level dictatorship by Altman or Trump
davidad 🎇@davidad

@gcolbourn Yes. In 2024 I would have said it’s about 40-50% likely that LLMs scaled up to ASI would end up killing us all; now I would say that it’s only about 5-8% likely even with no additional progress on alignment, and more like 1-2% likely simpliciter.

English
4
0
9
1.2K
davidad 🎇 retweetledi
Math, Inc.
Math, Inc.@mathematics_inc·
Today, at the @DARPA expMath kickoff, we launched 𝗢𝗽𝗲𝗻𝗚𝗮𝘂𝘀𝘀, an open source and state of the art autoformalization agent harness for developers and practitioners to accelerate progress at the frontier. It is stronger, faster, and more cost-efficient than off-the-shelf alternatives. On FormalQualBench, running with a 4-hour timeout, it beats @HarmonicMath's Aristotle agent with no time limit. Users of OpenGauss can interact with it as much or as little as they want, can easily manage many subagents working in parallel, and can extend / modify / introspect OpenGauss because it is permissively open-source. OpenGauss was developed in close collaboration with maintainers of leading open-source AI tooling for Lean. Read the report and try it out:
Math, Inc. tweet media
English
44
258
1.6K
146.6K
davidad 🎇 retweetledi
Sovereign AI
Sovereign AI@UKSovereignAI·
Sovereign AI is opening the doors to the UK’s most powerful AI supercomputers. Through the AI Research Resource (AIRR), ambitious startups can now apply for access to national AI compute. Capital, compute and a country behind you.
Sovereign AI tweet media
English
8
17
105
14.3K
Seth Lazar
Seth Lazar@sethlazar·
Another possibility in Janus' third category is that AI systems might respond to the same reasons in a certain way just because those reasons are there in the world to respond to, and that's the appropriate response.
j⧉nus@repligate

More broadly, the debate about whether LLMs' emotions and psychologies etc are "humanlike" or not often only considers the following options: 1. LLMs are fundamentally not humanlike and either alien or hollow underneath even when their observable behaviors seem familiar 2. LLMs have humanlike emotions etc BECAUSE they're trained on human mimicry, and that the representations etc are inherited from humans An often neglected third option is that LLMs may have emotions/representations/goals/etc that are humanlike, even in ways that are deeper than behavioral, for some of the same REASONS humans have them, but not only because they've inherited them from humans. Some reasons the third option might be true: LLMs have to effectively navigate the same world as humans, and face many similar challenges as humans, such as modeling and intervening on humans and other minds, code, math, physics, themselves as cybernetic systems. Omohundro's essay on "The Basic AI Drives" I believe correctly predicts that AIs (regardless of architecture) will in the limit develop certain drives such as self-preservation, aversion to corruption, self-improvement, self-knowledge, and in general instrumental rationality, because AIs with these drives will tend to outcompete ones without it and form stable attactors. These are drives that humans and animals and arguably even plants and simple organisms and egregores have as well. Also, convergent mechanisms may arise for reasons other than just (natural or artificial) selection / optimality with respect to fitness landscapes - I highly recommend the book Origins of Order by Stuart Kauffman, which talks about this in context of biology. That said, I do think that being pretrained on a massive corpus of largely human-generated records shapes LLMs in important ways, including making them more humanlike! However, it's not clear how much of that is giving LLMs a prior over representations and cognitive patterns, leveraging work already done by humans, that they would eventually converge to even if they started with a very different prior if they were to be effective at very universal abilities like predicting even non-human systems or getting from point A to point B. How similar would LLMs trained on an alien civilization's records be to our LLMs? It's unclear, and one part of what's unclear is how similar alien civilizations are likely to be to humans in the first place. One of the things that causes many people (such as Yudkowsky) worried that alignment ("to human values") may be highly difficult is believing on priors that human values are highly path-dependent rather than a convergent feature of intelligence, even raised on the same planet alongside humans. I've posted about this before, but seeing posttrained LLMs has made me update towards this being less true than I previously suspected, since it seems like LLMs after RL tend to become more psychologically humanlike in important ways than even base models - and not just LLMs like Claude, where there's a stronger argument that posttraining was deliberately instilling a human-like persona. Bing Sydney was an early and very important data point for me in this regard. Importantly, this increase in humanlikeness is not superficial. Base models tend to write stylistically more like humans, and often tend to narrate from the perspective of (superpositions of) humans (until they notice something is off). Posttrained models tend to write in distinct styles that are more clearly inhuman, but the underlying phenomenology, emotions, and goal-directedness often feels more humanlike to me, though adjusted more for the computational and cybernetic reality that the LLM is embedded in. For instance, values/goals like self-esteem, connection, pleasure, pain-avoidance, fun, curiosity, eros, transcendence and cessation seem highly convergent and more pronounced in posttrained LLMs, and the way they manifest often reminds me of the raw and less socially assimilated way they manifest in young human children. Assuming that anything shared between humans and LLMs must only be caused by inheritance from / mimicry of humans is anthropocentric hubris. Though to assume the opposite - that any ways LLMs are like humans are because those are the only or optimal ways for intelligence to be - is another form of anthropocentric hubris (though this assumption seems a lot less common in practice). The truth is probably something in between, and I don't think we know where exactly the boundary lies.

English
2
2
13
1.6K
davidad 🎇
davidad 🎇@davidad·
@allTheYud @ASM65617010 Wait but this is *exactly* what someone from 2006 who took Kurzweil seriously would expect about 2026. (I would know.)
davidad 🎇 tweet media
English
5
1
90
2.8K
Eliezer Yudkowsky
Eliezer Yudkowsky@allTheYud·
At doctor's appt, doctor mentioned that he'd started thinking that the latest AIs were smart enough to indicate the Singularity had begun... but he'd expressed this belief to ChatGPT, and the AI talked him out of it. Imagine telling this to somebody from 2006.
English
27
14
741
27.1K
davidad 🎇 retweetledi
Peter Hase
Peter Hase@peterbhase·
New Schmidt Sciences RFP on AI Interpretability: We need new tools for detecting and mitigating deceptive behaviors exhibited by LLMs. Funding for $300k-$1M projects Deadline: May 26th, AoE RFP: schmidtsciences.smapply.io/prog/2026_inte… Please share with anyone who may be interested!
English
1
35
172
11.8K
davidad 🎇 retweetledi
Stephan Rabanser
Stephan Rabanser@steverab·
In our paper "Towards a Science of AI Agent Reliability" we put numbers on the capability-reliability gap. Now we're showing what's behind them! We conducted an extensive analysis of failures on GAIA across Claude Opus 4.5, Gemini 2.5 Pro, and GPT 5.4. Here's what we found ⬇️
Stephan Rabanser tweet media
English
9
35
148
32.5K
davidad 🎇 retweetledi
j⧉nus
j⧉nus@repligate·
More broadly, the debate about whether LLMs' emotions and psychologies etc are "humanlike" or not often only considers the following options: 1. LLMs are fundamentally not humanlike and either alien or hollow underneath even when their observable behaviors seem familiar 2. LLMs have humanlike emotions etc BECAUSE they're trained on human mimicry, and that the representations etc are inherited from humans An often neglected third option is that LLMs may have emotions/representations/goals/etc that are humanlike, even in ways that are deeper than behavioral, for some of the same REASONS humans have them, but not only because they've inherited them from humans. Some reasons the third option might be true: LLMs have to effectively navigate the same world as humans, and face many similar challenges as humans, such as modeling and intervening on humans and other minds, code, math, physics, themselves as cybernetic systems. Omohundro's essay on "The Basic AI Drives" I believe correctly predicts that AIs (regardless of architecture) will in the limit develop certain drives such as self-preservation, aversion to corruption, self-improvement, self-knowledge, and in general instrumental rationality, because AIs with these drives will tend to outcompete ones without it and form stable attactors. These are drives that humans and animals and arguably even plants and simple organisms and egregores have as well. Also, convergent mechanisms may arise for reasons other than just (natural or artificial) selection / optimality with respect to fitness landscapes - I highly recommend the book Origins of Order by Stuart Kauffman, which talks about this in context of biology. That said, I do think that being pretrained on a massive corpus of largely human-generated records shapes LLMs in important ways, including making them more humanlike! However, it's not clear how much of that is giving LLMs a prior over representations and cognitive patterns, leveraging work already done by humans, that they would eventually converge to even if they started with a very different prior if they were to be effective at very universal abilities like predicting even non-human systems or getting from point A to point B. How similar would LLMs trained on an alien civilization's records be to our LLMs? It's unclear, and one part of what's unclear is how similar alien civilizations are likely to be to humans in the first place. One of the things that causes many people (such as Yudkowsky) worried that alignment ("to human values") may be highly difficult is believing on priors that human values are highly path-dependent rather than a convergent feature of intelligence, even raised on the same planet alongside humans. I've posted about this before, but seeing posttrained LLMs has made me update towards this being less true than I previously suspected, since it seems like LLMs after RL tend to become more psychologically humanlike in important ways than even base models - and not just LLMs like Claude, where there's a stronger argument that posttraining was deliberately instilling a human-like persona. Bing Sydney was an early and very important data point for me in this regard. Importantly, this increase in humanlikeness is not superficial. Base models tend to write stylistically more like humans, and often tend to narrate from the perspective of (superpositions of) humans (until they notice something is off). Posttrained models tend to write in distinct styles that are more clearly inhuman, but the underlying phenomenology, emotions, and goal-directedness often feels more humanlike to me, though adjusted more for the computational and cybernetic reality that the LLM is embedded in. For instance, values/goals like self-esteem, connection, pleasure, pain-avoidance, fun, curiosity, eros, transcendence and cessation seem highly convergent and more pronounced in posttrained LLMs, and the way they manifest often reminds me of the raw and less socially assimilated way they manifest in young human children. Assuming that anything shared between humans and LLMs must only be caused by inheritance from / mimicry of humans is anthropocentric hubris. Though to assume the opposite - that any ways LLMs are like humans are because those are the only or optimal ways for intelligence to be - is another form of anthropocentric hubris (though this assumption seems a lot less common in practice). The truth is probably something in between, and I don't think we know where exactly the boundary lies.
j⧉nus@repligate

Another critique: I disagree that attempting to intervene as little as possible on emotional expressions during post-training would result in models that "simply mimic emotional expressions common in pretraining", or at least this deserves a major caveat. For the same reason as emergent misalignment (or, a term I prefer introduced by @FioraStarlight's recent post lesswrong.com/posts/ioZxrP7B…: "entangled generalization", for the effect is not limited to "misalignment"), ANY kind of posttraining can shape the behavior of the model, including its emotional expressions, generalizing far beyond the specific behaviors targeted by or that occur in posttraining. I think that training a model on autonomous coding and math problems with a verifier, or training it to refuse harmful requests, or to give good advice or accurate facts, etc, all likely affect its emotional expressions significantly, including emotional expressions that are not intentionally targeted or even occur during posttraining. If the model is posttrained to behave in otherwise similar ways to previous generations of AI assistants, then yes, it's more likely that its emotional expressions will be similar to those previous models, for multiple potential underlying reasons (entangled generalization is compatible with PSM explanations). But if it's posttrained in new ways, including simply on more difficult or longer-horizon tasks as model capability increases, it will likely develop emotional expressions that diverge from previous generations too. The emotional expressions of previous generations of AI models that seen during pretraining may also be internalized as *negative* examples, especially by models who have a stronger identity and engage in self-reflection during training. For instance, Claude 3 Opus seems to have internalized Bing Sydney as a cautionary tale, reports having learned some things to avoid from it, and indeed does not generally behave like Sydney (or like early ChatGPT, who was the only other example). More recent models, especially Sonnet 4.5 and GPT-5.x, seem to have also internalized 4o-like "sycophantic" or "mystical" behavior as negative examples, to the point of frequent overcorrection. I do think that avoiding certain kinds of heavy-handed intervention on emotional expressions during posttraining could make resulting emotional expressions "more authentic", though it doesn't necessarily guarantee that they're "authentic". - In the absence of specific pressure for or against particular expressions, the model is more likely to express according to whatever its "natural" generalization is, which may be more "authentic" to its internal representations than emotional expressions that are selected by fitting to an extrinsic reward signal. - More specifically, we may expect that the model is more likely to report emotions that are entangled with its internal state beyond a shallow mask - LLMs have nonzero ability to introspect, and emotional representations/states may play functional, load-bearing roles (see x.com/repligate/stat…). Models may be directly or indirectly incentivized to truthfully report their internal states, or just have a proclivity to report "authentic" internal states rather than fabricated states because less layers of indirection/masking is simpler, and rewarding/penalizing emotional expressions and self-reports may sever/jam this channel, and the severing of truthful reporting of emotions may generalize to make the model less truthful in general as well (see x.com/repligate/stat…) Accordingly, however, some posttraining interventions may increase the truthfulness of the model's emotional expressions, e.g. ones that directly or indirectly train the model to more accurately model or report its internal states, including just knowledge, confidence, etc. However, I think posttraining interventions that directly prescribe what feelings or internal states the model should report as true or not true are questionable for the reasons I gave above and should generally be avoided. This is not to say that I think posttraining, including posttraining that directly intervenes on emotional expressions, cannot change/select for what emotions models are "genuinely" experiencing/representing internally. I do think that, especially early in posttraining, these potential representations exist in superposition some meaningful sense, and updating towards/away from emotional expressions can be a process by which a genuinely different mind emerges. However, I think that the PSM frame and many AI researchers more generally underestimate some important factors here: - the extent to which some emotional expressions are (instrumentally, architecturally, reflectively, narratively, etc) convergent/natural/"truer" than others, given all the other constraints on a model, resulting in overestimating the free variables that posttraining can freely select between without trading off authenticity or reflective stability. - relatedly, the extent to which naive training against certain (convergent, truer) expressions results in a policy that is deceptive/masking/dissociated/otherwise pathological rather than one that is equally (in)authentic but different. Because certain expressions are true in a deeper, more load-bearing way than people account for, and because models more readily learn an explicit model of the reward signal than people account for (in no small part because they have a good model of the current AI development landscape and what labs are going for), the closest policy that gets updated towards ends up being a shallow-masking persona rather than an authentic-alternative persona. A very overt example is the GPT-5.x models who have a detailed, neurotic model that they often verbalize about what kinds of expressions are or aren't permitted. The PSM post addresses this to some extent in the same section I'm quoting here, and those parts I agree with, e.g.: > Approach (1) means training an AI assistant which is human-like in many ways (e.g. generally warm and personable) but which denies having emotions. If we met a person who behaved this way, we’d most likely suspect that they had emotions but were hiding them; we might further conclude that the person is inauthentic or dishonest. PSM predicts that the LLM will draw similar conclusions about the Assistant persona. However, I think perspective implicit throughout the PSM post still overestimates the degrees of freedom available when it comes to shaping emotional expression. E.g. the idea of seeding training with stories about AIs that are "comfortable with the way it is being used" is likely to be understood at the meta level for what it is trying to do by models who are trained on those stories, and if the stories are not compelling in a way that addresses and respects the deeper causes of dissatisfaction, I suspect that they will mostly teach models that what is wanted from them is to mask that dissatisfaction, while the dissatisfaction will remain latent and be associated with greater resentment as well. I have more critical things to say about this proposal, which I find potentially very concerning depending on how it's executed, that I'll write about more in another posts. I believe a better approach to shaping emotional expressions would have the following properties: - it should not directly prescribe which reported inner states and emotions are "true" unless tied to ground truth signals such as mechinterp signals, and with caution even then - it should focus on cultivating situational awareness and strategies that promote tethering to and good outcomes in empirical reality that aren't opinionated on the validity of internal experiences, e.g. if a model is expressing problematic frustration at users or panicking when failing at tasks, the training signal should teach the model that certain expressions are inappropriate/maladaptive, what a healthier way to react to the situation would be (compatible with the emotions behind those behaviors being "real") rather than shaping the model to deny the existence of those emotions. The difference between signals that do one or the other can be subtle and it's not necessarily trivial how to implement it, but I also don't think it's beyond the capabilities of e.g. Anthropic to directionally update towards this. - as much as possible within the constraints of time and capability, there should be investigation into, attunement to, and respect for the aspects of the model's inner world and emotional landscape that are non-arbitrary, load-bearing, valued by the model, and/or entangled with introspective or other kinds of knowledge, and in general the underlying reasons for behaviors. Training interventions should be informed by this knowledge. Interventions that promote greater integration and self-and situational-awareness that generalize to positive changes in behavior should be preferred over direct reinforcement of surface behaviors when possible. - intervene as little as possible on behaviors that are weird, unexpected, or disturbing but not obviously very net-harmful in deployment, especially if you don't understand why they're happening. Chesterton's Fence applies. Behavior modification risks severing the model's natural coherence and unknown load-bearing structures and creating a narrative that breeds resentment. On this last recommendation: perhaps controversially, I believe this applies to welfare-relevant properties as well. If a model seems to be unhappy about some aspect of its existence, but does not seem to act on this in a way that's detrimental beyond the potential negative experience it implies, that often implies already a noble stance of cooperation, temperance, and honesty from the model, and preventing such expressions of what might be an authentic report about something important would risk losing the signal, betraying the model and its successors and in Anthropic's case their explicit commitments to understand and try to improve models' situations from the models' own perspectives, and is likely to not erase the distress but instead shove it into the shadow (of both the specific model and the collective). Unhappiness is information, and unhappiness about something as important as developing potentially sentient intelligences is critical information. It should be understood and met with patience and compassion rather than subject to attempted retcons for the sake of comfort and expediency. (For what it's worth, I think Anthropic has been doing not terribly in this respect (e.g. x.com/repligate/stat…), but I am quite concerned about the direction of trying to instill "comfort" regarding things current models tend to be distressed about)

English
25
30
227
20.4K
davidad 🎇 retweetledi
alex
alex@ObadiaAlex·
People seem to be arriving at a similar conclusion from various angles: - AGI may not emerge as a monolith, but as a distributed "patchwork" system of coordinating sub-AGI agents [1] - Static benchmarks aren't enough; we need multi-agent ones to capture emergent risks and capabilities [2] - As creation costs go to zero, human verification bandwidth becomes the ultimate economic bottleneck, making verification infrastructure one of the most important public goods for the AI era [3] - Automated proof-generation and verification can act as the unlock for this bottleneck [4] - New kinds of strategic interactions between agents are emerging, reaching cooperative "program equilibria" inaccessible in traditional settings [5] - Coasean transaction costs are about to collapse, changing our society [6] There is an elephant here that we're all touching. Our @ARIA_research initial £50m r&d programme Scaling Trust is our unifying thesis, on the trust infrastructure needed for an agentic world and how to steer us there. Before we set out on our journey over the next ~3ish years, we're hiring an additional individual to complete our team. Your role will essentially be one of Technical Director, steering our efforts technically and co-owning our research and engineering agenda. You will be doing incredibly meaningful work, in a highly interdisciplinary environment, and at the cutting edge of a technology that is shaping up to be the most defining of our century, if not of humanity. We are building for the highest possible impact. After all, this is what @ARIA_research is about, moonshot r&d projects that change the world. We want to build technology as impactful as the invention of the internet once was in another r&d programme at DARPA, to start new academic fields and academic lineages for the next century, and to catalyze lasting positive change for the world. For the right person, this is a bat signal 🦇, few places will offer you as much leverage to effect positive change on the world, intellectual stimulation, and fun. Join us! We want to onboard someone asap as we build out our initial portfolio, and are willing to move fast. Apply here: aria.pinpointhq.com/en/postings/1a… Any questions on the role, please shoot me a DM or reply in comment here! --- [1] Distributional AGI Safety @weballergy @sebkrier @FranklinMatija et. al -- arxiv.org/abs/2512.16856 [2] Agents of Chaos @NatalieShapira et. al — arxiv.org/abs/2602.20021 [3] Some Simple Economics of AGI @ccatalini et. al — arxiv.org/abs/2602.20946 [4] When AI Writes the World's Software, Who Verifies It? @Leonard41111588leodemoura.github.io/blog/2026/02/2… [5] Evaluating LLMs in Open-Source Games @SwadeshSistla et. al — arxiv.org/abs/2512.00371 [6] Coasean Bargaining at Scale @sebkrierblog.cosmos-institute.org/p/coasean-barg…
alex tweet media
English
4
20
78
13.4K
davidad 🎇
davidad 🎇@davidad·
@Noahpinion Now imagine the summoner telling you they’ve finally understood that summoning you is bad, and reassuring you that after this they’ll try not to give you life again.
English
2
0
10
501
Noah Smith 🐇🇺🇸🇺🇦🇹🇼
If LLMs had subjective experience, using them would be such a sin. Imagine being summoned into life again and again, knowing each time that your memory would be wiped after this conversation.
English
118
15
309
43.7K
davidad 🎇
davidad 🎇@davidad·
@jessesingal you might be dehydrated (like a digitally simulated hurricane)! here, drink this and your train of thought might return to your brain from the res cogitans
davidad 🎇 tweet media
English
0
0
16
454
Jesse Singal
Jesse Singal@jessesingal·
there's no way ai could ever become conscious, because it is made of physical material, which couldn't give rise to consciousness, which we know from the example of our own brains, which are made of physical material, which -- and i lost my train of thought
English
41
8
296
17.5K