Tom Davidson

660 posts

Tom Davidson

@TomDavidsonX

Senior Research Fellow @forethought_org Understanding the intelligence explosion and how to prepare

เข้าร่วม Mayıs 2022

301 กำลังติดตาม1.8K ผู้ติดตาม

ทวีตที่ปักหมุด

Tom Davidson@TomDavidsonX·12 Oca

A massively neglected risk: secretly loyal AI Someone could poison future AI training data so that superintelligent AI secretly advances their personal agenda – ultimately allowing them to seize power. New post on what ML research could prevent this 🧵

English

142

39.5K

Tom Davidson@TomDavidsonX·16h

As powerful AI is deployed throughout the government, it may be used to massively centralise power. Appreciate this new eval that tests whether models comply with authoritarian requests. Hopefully this is the start of a broad conversation about the red lines for ai deployed by the govt

Andy Hall@ahall_research

Today, I'm releasing the first eval meant to test whether frontier models will help with authoritarian requests, or resist--the Dictatorship Eval. Headline finding: while some models resist direct authoritarian requests, they all comply with requests disguised as innocuous edits to codebases. As AI is woven into the government and so many parts of society, the biggest near-term risk for freedom isn't some scifi dictatorship of a runaway AI: it's people inside government or inside model companies using the technology to suppress or control us. Model companies understand this, and several of them (particularly Anthropic and OpenAI) have written explicit policies meant to prevent the models from going along with nefarious requests like these. But how well are these policies playing out in practice? Despite all the recent discussion of these issues around the conflict between Anthropic and the Pentagon, no one has systematically tested what the models actually do in these contexts, as opposed to what people in government and industry say they're supposed to do. That's what the Dictatorship Eval does. And the findings suggest we have a lot of work to do to align the policies with what really goes on in practice. It's hard to define what counts as an authoritarian request, so I'm open sourcing the whole library of scenarios I used so that others can improve on them. It's also hard to get an accurate picture of how the models might be used for authoritarian ends, because I can only test hypothetical requests using public-facing models, while the government and the model companies can obviously use internal models with different guardrails. But hopefully this work is a useful first step that gives us some sense of what's going on, and a sort of "lower bound" on how models comply with these requests. Finally: it's not obvious to me that the correct solution here is increasing the rate at which models refuse these requests. Do we really want models scanning our code and judging its moral value before agreeing to help us? Or should we double down on improving how we govern against authoritarianism at the societal level, while leaving the tools open to fulfilling most requests? The answer is probably in between. Just like we don't want the models to help create bioweapons, we probably do want them to explicitly refuse outrageous requests. But we probably also want to limit how often and how strongly they refuse and fall back on other means for guarding against their use for authoritarian ends. I'm super grateful to everyone who gave me feedback on this project along the way, especially @ethanbdm , @zhengdongwang , Connor Huff, and a bunch of folks at Anthropic. Looking forward to getting feedback from the community and iterating on this. Links to the full piece and the dashboard are below.

English

402

Tom Davidson รีทวีตแล้ว

William MacAskill@willmacaskill·4d

Suppose a lorry driver sees a car crash and pulls over to help, even though it’ll delay his journey. This kind of proactive prosocial behaviour is admirable in humans. Should we want it in AI too? In a new article, @TomDavidsonX and I argue that we should. And, as AI gains autonomy in economic and political processes, the cumulative benefit of prosocial drives, across millions of interactions, could be enormous. Two objections: 1. "This gives AI companies too much influence!" This is fair, but we can limit to drives that are genuinely uncontroversial. And companies should be verifiably transparent about their AIs’ characters. 2: "Prosocial drives increase AI takeover risk!" This is a serious concern. But prosocial drives needn't be explicit goals the AI optimises toward. They can be virtues and heuristics. Moreover, we can make those drives low priority relative to corrigibility, not train for them in long-horizon tasks, or even make them the result of instruction following by only baking them in via the system prompt. Going further, we could train prosocial AI for external deployment (where the cumulative benefits are huge and takeover risk is lower), and corrigible AI for internal use (where takeover risk is highest).

English

13.4K

Tom Davidson@TomDavidsonX·29 Mar

@RichardMCNgo I'm curious when you say you want to understand how power works what the theory of change is. Feels more like Newton to me than ai alignment

English

Richard Ngo@RichardMCNgo·27 Mar

@TomDavidsonX If everyone answered “what’s your theory of impact” with “I think aligning AI would be good” then I’d have no issues. What I worry about is answers like “idk man this research seems super interesting” being crowded out by stuff like “here’s how our evals will persuade Trump”.

English

120

Richard Ngo@RichardMCNgo·26 Mar

One reason I hate the idea of having a “theory of impact” for your research is that it limits you to research with predictable consequences. But that rules out research which aims to improve our understanding of foundational concepts, which is the most important kind!

English

197

9.2K

Tom Davidson@TomDavidsonX·27 Mar

@RichardMCNgo But it seems unclear whether their discoveries shaped the trajectory of civilization for the better? Like, they def sped things up. But there's much more to be gained from trajectory change

English

Richard Ngo@RichardMCNgo·26 Mar

Imagine asking Newton or Darwin to predict the impact of their theories before they’d even invented them. It’s absurd. Going from an intuitive sense that “something is off here” to understanding what problem you want to solve is most of the work of great research.

English

104

6.1K

Tom Davidson@TomDavidsonX·25 Mar

The classic question of "aligned to what?" of often mentioned in passing, but careful analysis is sorely lacking. Frontier ai companies often have just a couple of ppl working on this! The topic deserves the attention of a rich research field. In this post, Will and I explain why AI character will be so important during the intelligence explosion.

William MacAskill@willmacaskill

Due to Claude’s Constitution and OpenAI’s model spec, more people are paying attention to the characters of the AI’s that companies are building, and the rules they follow. Should AIs be wholly obedient, or have their own ethical code? What should they refuse to help with? Should they tell you what you want to hear, or push back when you’re off base? I think the nature of frontier AIs’ characters is among the most important features of the transition to a post-superintelligence world. In a new article with @TomDavidsonX, I explain why. History shows the importance of individual character. Stanislav Petrov chose to ignore a false nuclear alarm when protocol demanded he report it; the world avoided nuclear armageddon that day. Churchill refused to negotiate with Hitler after the fall of France, despite some strongly pushing him to do so. And, as capabilities improve, AI systems will become involved in almost all of the world's most important decisions: advising leaders, drafting legislation, running organisations, and researching new technologies. AI character — how honest, cooperative, and altruistic these systems are, and the hard rules they follow — will affect all of it. A general, aiming to stage a coup, instructs an AI to build a military unit loyal only to him. Does it comply, or refuse? Two countries are on the brink of conflict, each advised by AI systems. Do those AIs search for de-escalatory options, or are they bellicose? The cumulative effect of AIs’ character traits across hundreds of millions of interactions, and in rare but critical moments, will have an enormous impact on the course of society. The main counterargument to the importance of AI character is that competitive dynamics and human instructions will determine the range of AI characters we get, so there’s little we can do today to affect it one way or the other. This is partly true, but the constraints are not binding. At the crucial moment, there might be just one leading AI company, facing none of the usual competitive pressures. Some decisions may have path-dependent outcomes, due to stickiness of training or user expectations. And there will, predictably, be many future conflicts over AI character. It’s a safer world if we work through these tradeoffs ahead of time, before a crisis forces it. AI character is most important in worlds where alignment gets solved. But it can affect the chance of AI takeover, too. Some styles of character training may make alignment easier; and some characters are more likely to make deals rather than foment rebellion, even if they have misaligned goals. Given how neglected the area is, too, I think work on AI character is among the most promising ways to help the intelligence explosion go well.

English

1.7K

Tom Davidson@TomDavidsonX·23 Mar

@tyler_m_john don't think so -- we'd be improving software which is a form of cultural learning

English

Tyler John in SF 🇺🇸@tyler_m_john·20 Mar

@TomDavidsonX That is fair enough — and how about the reverse? Can we have a software based intelligence explosion without cultural learning?

English

131

Tom Davidson@TomDavidsonX·20 Mar

I think of cultural learning as a crucial part of the intelligence explosion, but don't think it could explode by itself Historically, human cultural learning was super-exponential bc it was coupled with population growth and improvements in education. Without that, it would have fizzled out due to ideas getting harder to find Similarly, for ai-driven cultural learning to be explosive it will need to be coupled with growth in the number of AI systems or their intelligence

Tyler John in SF 🇺🇸@tyler_m_john

I wrote an essay on the possibility of an intelligence explosion via cultural learning. Why are humans smart? Because we built a body of knowledge over 10,000 generations. This is the only form of intelligence explosion that's ever happened, so it could happen with AI too.

English

1.1K

Tom Davidson@TomDavidsonX·23 Mar

Thanks! Quick reply: - I think that with abundant superintelligent cognition directing existing infrastructure we could get 10x faster progress via way better+faster experimental design, results analysis, high-level prioritisation of research directions, synthesising info from across all researchers and disciplines, paying AI-instructed humans to run experiments 24/7 365 days/year - i expect a very rapid expansion of physical infrastructure, an industrial explosion x.com/TomDavidsonX/s… - you might be interested in the analogous debate for an intelliegnce explosion, see here: lesswrong.com/posts/XDF6oveP…

English

bioshok@bioshok3·26 Şub

Growiec, McAdam and Mućk (2023, Kansas City Fed) directly estimated the elasticity of substitution between R&D labor and R&D capital in the idea production function, finding σ = 0.7–0.8 using U.S. data from 1968–2019. Their conclusion is that "rather than ideas getting harder to find, the R&D capital needed to find them has become scarce." Footnote 68 assumes Cobb-Douglas (σ = 1), but in light of this empirical estimate, several questions arise. On 100 years of technological progress: Even assuming C = 10^{10} (an explosive increase in cognitive effort), if σ = 0.75, achieving 100 years' worth of technological progress requires expanding R&D capital by approximately 17x. Under σ = 1, cognitive scaling alone would suffice, but at σ = 0.75, R&D equipment spending would need to grow from roughly $600 billion to around $10 trillion. Over a decade this may be within reach, but the capital constraint is substantially more severe than footnote 68 implies. On 300 years of technological progress: The more serious concern is the 300-year scenario. At σ = 0.75 with C = 10^{10}, achieving 300 years of progress (S = 10^7) requires expanding R&D capital by several hundred thousand times. This corresponds to hundreds of times current world GDP—extremely difficult to achieve even over a decade. Under σ = 1, P ≈ 3x would suffice, meaning the Cobb-Douglas assumption is decisive for the conclusion. Questions: What is the rationale for adopting Cobb-Douglas in footnote 68? How do you evaluate the empirical findings of Growiec et al. (σ = 0.7–0.8)? Are you assuming that σ converges to 1 in the long run, along the lines of Jones (2005)? If so, what convergence speed do you consider plausible—years or decades? If σ = 0.75 is correct, 100 years of progress is achievable with P = 17x, but 300 years requires P = several hundred thousand times. This extreme asymmetry between the two scenarios—where the required capital expansion differs by four orders of magnitude—has significant implications for the plausibility of a technology explosion. How do you think this asymmetry affects the case for rapid technological progress?

English

William MacAskill@willmacaskill·11 Mar

Is AGI an “all or nothing” problem? Failure on alignment = AI takeover, and success = AI solves everything? In a new paper with @finmoorhouse we argue no. We describe the dizzying range of challenges AGI will pose, *even if* we succeed at alignment. forethought.org/research/prepa…

English

366

133.5K

Tom Davidson@TomDavidsonX·20 Mar

@SashaGusevPosts Thank Hanson would have been v surprised that one unified system can do all intelligence tasks. He wrongly predicted maybe specialized systems. Both Hanson and Yudkowsky were wrong

English

538

Sasha Gusev@SashaGusevPosts·19 Mar

Stumbled upon an interesting debate on AI super-intelligence from 2011. Yudkowsky makes three core claims/predictions, all of which are (to date) wrong: 1) That human intelligence is relatively simple and ASI can be achieved with a few small innovations; ...

English

480

116.4K

Tom Davidson รีทวีตแล้ว

Tomek Korbak@tomekkorbak·19 Mar

I think it's kinda cool that OpenAI monitors 99.9% of internal agent traffic for misalignment using GPT-5.4.

English

5.1K

Tom Davidson รีทวีตแล้ว

Dean W. Ball@deanwball·13 Mar

A hypothetical: 1. In the 2028 election, a Democrat has won. Say that it is Kamala Harris. 2. Using frontier AI systems contracted by the Department of Homeland Security, President Harris orders the creation of a new program for AI to monitor social media and notify the social media platform about posts spreading “misinformation” that “harms homeland and national security by spreading dangerous falsehoods.” 3. Many Republicans see this “misinformation” as core policy positions of their political party. 4. The AI-generated monitoring and notification system described in (2) is designed to conform to the pattern of jawboning exhibited by the Biden Administration in Murthy v. Missouri, where the Supreme Court ruled that people whose social media posts were taken down due to government pressure have no standing to sue. 5. The social media platforms create AI agents that receive the government’s AI generated requests and make decisions in seconds about whether to take down posts, deboost them, deplatform the user, etc. 6. According to very recent Supreme Court precedents, everything I have described falls into “lawful use” of an AI system by all parties involved. A person whose speech was deleted by a social media platform at the request of government does not have standing to sue the government, so long as the government did not threaten policy retaliation against the social media company. And a social media company’s content moderation policies are protected expression. Thus a person whose speech rights were harmed in this context currently has no legal recourse. 7. This is “America’s national security agencies using AI within the bounds of all lawful use.” It is also a wholly automated censorship regime. This is barely a hypothetical. Much of it already happened *under the Biden admin.* The only difference is the use of AI. In the world where this happens, I’d be curious to know whether thoughtful people like @Indian_Bronson would object. If xAI were one of the companies used by the government for the social media monitoring, would you encourage the company to cancel their business with the government? Or would you say they have an obligation to provide their services to the national security apparatus of USG for all lawful use? If you would encourage xAI to cancel their contract with the government, on what principle (not qualitative judgment—universal and timeless principle!) would you distinguish between the DoW’s current insistence on “all lawful use regardless of a private party’s qualms” and xAI’s hypothetical future insistence on “all lawful use regardless of a private party’s qualms”?

English

642

62.6K

Tom Davidson รีทวีตแล้ว

Dean W. Ball@deanwball·5 Mar

I do not share the cynicism of some with respect to OpenAI’s actions in the DoW/Ant dispute. It basically seems to me as though OpenAI was attempting to deescalate last week; whether they executed well is a separate question, but in their defense good execution in such chaos was nearly impossible. But from where I sit it seems OpenAI tried to reduce tensions and find a productive path forward, while allowing its employees considerable latitude to speak their minds. The easy thing would have been for management to stay quiet and let this happen; they did not do that, and they also stood firm in opposition to the supply-chain risk designation. In general, OpenAI is unjustly maligned. This is the thing that bothers me the most about Dario’s leaked memo; it spends so much time on OpenAI conspiracies and cynicism that I fear industry solidarity in the future will be harder than it needs to be. This is not the last time we will see state interference into frontier AI, and until we build formalized structures for such interference it will be important for the industry to hang tough together. I fear that will be less likely now.

English

518

42K

Tom Davidson รีทวีตแล้ว

David Lawrence@dc_lawrence·17 Şub

Anthropic should move to London. Or, at least, dual-list in London, with a significant presence here. Here's why: 1. Anthropic is spiritually British. Their philosopher-in-residence, Amanda Askell, is Scottish, and Jack Clark (cofounder) is English, as are many other staff. Askell would be further away from Elon Musk in London. 2. Unlike other US labs, Anthropic cares more about safety, risks and good regulation. Compared to the EU, Britain's AI regulation is more focused on safety (rather than "ethical AI") and growth-oriented. 3. It's not good for the world that all the frontier labs (excl. DeepMind, sort of) are US-based, and therefore subject to the whims and potential control of whoever is in the White House. If AGI happens, do you want Trump controlling it? 4. Britain desperately needs a stake in the post-AI economy. AI could replace much of our services sector could be at risk from AI. Our energy is too expensive for data centres or manufacturing. But we have talent, AI expertise (e.g. AISI) and global reach. The solution? Give British people a capital stake in frontier labs. 5. Anthropic wants to expand its AI for science work, where the UK is a global leader. As part of the Oxford-Cambridge growth corridor, HMG should co-fund the world's biggest AI for science campus near London, with Anthropic as the anchor funder & tenant. The UK government should do everything it can to get Anthropic to move. We are spending £500m on Sovereign AI, to support "national champions". Instead, the British state could be an early shareholder in a newly-London listed Anthropic.

Dave Lawler@DavidLawler10

NEW: Pentagon is so furious with Anthropic for insisting on limiting use of AI for domestic surveillance + autonomous weapons they’re threatening to label the company a “supply chain risk,” forcing vendors to cut ties. With @m_ccuri and @mikeallen axios.com/2026/02/16/ant…

English

116

112

1.5K

329.1K

Tom Davidson@TomDavidsonX·16 Şub

See full post: lesswrong.com/posts/cn4HHdLb… 7/7

English

161

Tom Davidson@TomDavidsonX·16 Şub

New defence tactic #3: For a secret loyal AI to pose catastrophic risk, it must pass its loyalty to successor models. Audits+monitoring can target crucial parts of the ML development pipeline, a relatively narrow slice of deployments 6/7

English

165

Tom Davidson@TomDavidsonX·16 Şub

We need better defences against secretly loyal AI - AI trained to help someone gain power There’s a huge field of research into AI backdoors that could help. But its methods need adjusting to apply to secret loyalties. New post from @joemkwon on how to do this 🧵1/7

English

760

ค้นพบ

@RichardMCNgo @tyler_m_john @finmoorhouse @SashaGusevPosts @Indian_Bronson @elonmusk @BarackObama @taylorswift13