Benjamin Hilton

1.3K posts

Benjamin Hilton banner
Benjamin Hilton

Benjamin Hilton

@benjamin_hilton

Head of Alignment Team & Loss of Control Mitigations Team at the UK AI Security Institute (AISI). views my own

London Beigetreten Ekim 2011
853 Folgt3.5K Follower
Benjamin Hilton
Benjamin Hilton@benjamin_hilton·
Oops! We accidentally took down this ad 3 hours early. We've re-opened the ad until midnight anywhere on earth Tuesday, in case you were cut off. Please do apply! x.com/benjamin_hilto…
Benjamin Hilton@benjamin_hilton

Come work with me at @AISecurityInst! AI safety has a huge gap: we have great thoughts about how misalignment could occur, but IMO terrible models of what dangerously misalignedAIs would actually do. Many stories just jump from "misaligned" to "godlike ASI does magic." 1/5

English
1
3
20
4.1K
Benjamin Hilton retweetet
Nathan Labenz
Nathan Labenz@labenz·
The UK AISI might be the most situationally aware government entity in the world today. Today on @CogRev_Podcast, Chief Scientist @geoffreyirving surveys the AI landscape. While jailbreaking frontier models is getting harder, the AISI Red Team has never failed. 👀
English
7
24
133
20.8K
Benjamin Hilton retweetet
AI Security Institute
AI Security Institute@AISecurityInst·
As AI systems grow more capable, making sure they behave as intended is a critical technical challenge. Today we announce the first 60 Alignment Project grants, and welcome @OpenAI and @Microsoft to a growing international coalition. Over £27m now supports this work.
AI Security Institute tweet media
English
12
13
90
28.1K
Benjamin Hilton retweetet
Geoffrey Irving
Geoffrey Irving@geoffreyirving·
New complexity theory paper mapping the precise query complexity of debate, given unbounded provers. No new safety ideas: the goal is a self-contained presentation of debate + cross-examination, with the precise complexity class it achieves. 🧵
Geoffrey Irving tweet media
English
2
11
36
2.3K
Benjamin Hilton retweetet
Geoffrey Irving
Geoffrey Irving@geoffreyirving·
Please apply to work with @benjamin_hilton if you like thinking through misalignment risk modelling in detail!
Benjamin Hilton@benjamin_hilton

Come work with me at @AISecurityInst! AI safety has a huge gap: we have great thoughts about how misalignment could occur, but IMO terrible models of what dangerously misalignedAIs would actually do. Many stories just jump from "misaligned" to "godlike ASI does magic." 1/5

English
1
8
33
4.2K
Benjamin Hilton
Benjamin Hilton@benjamin_hilton·
This is a hard role to fill. I care less about your CV than whether people who read your work come away thinking more clearly. I'd pick a sharp generalist over a domain expert who can't think in new ways – but AI safety & cybersecurity expertise especially welcome. 4/5
English
3
0
33
1.8K
Benjamin Hilton
Benjamin Hilton@benjamin_hilton·
Come work with me at @AISecurityInst! AI safety has a huge gap: we have great thoughts about how misalignment could occur, but IMO terrible models of what dangerously misalignedAIs would actually do. Many stories just jump from "misaligned" to "godlike ASI does magic." 1/5
English
12
36
231
32.9K
Benjamin Hilton retweetet
David
David@DavidDAfrica·
Wireheading is sometimes an "optimal" policy. In this paper (at AAAI 2026 Foundations of Agentic Systems Theory), we show that when an agent controls its reward channel, "lying" can strictly dominate "learning," and confirmed this empirically in Llama-3 and Mistral. 🧵👇
David tweet media
English
1
2
6
1.5K
Benjamin Hilton retweetet
Geoffrey Irving
Geoffrey Irving@geoffreyirving·
Lovely blog post version of a talk Scott Aaronson gave at the UK AISI Alignment Conference on theory and AI alignment. Thank you, Scott! scottaaronson.blog/?p=9333
English
1
8
80
7.6K
Benjamin Hilton
Benjamin Hilton@benjamin_hilton·
We’re looking for someone with: – Quantitative research experience – Strong Python & experimental design skills – Comfort with LLMs & transformer theory Hands-on ML engineering (e.g. finetuning/RL) is not a requirement. 4/5
English
2
0
0
832
Benjamin Hilton
Benjamin Hilton@benjamin_hilton·
Come work with me!! We're hiring a research scientist for @AISecurityInst 's Propensity project – studying one of the most critical unknowns in AI security: Will future AI systems autonomously choose to cause harm? 1/5
English
1
5
16
1.8K
Benjamin Hilton retweetet
Geoffrey Irving
Geoffrey Irving@geoffreyirving·
AISI ran an Alignment Conference from 29-31 November in London! The goal was to gather a mix of people experienced in and new to alignment, and get into the details of novel approaches to alignment and related problems. Hopefully we helped create some new research bets! 🧵
AI Security Institute@AISecurityInst

Last week, we hosted our inaugural Alignment Conference, in partnership with @farairesearch. The event bought together an interdisciplinary delegation of leading researchers, funders, and policymakers to discuss urgent open problems in AI alignment 🧵

English
2
5
37
4.3K
Benjamin Hilton retweetet
Geoffrey Irving
Geoffrey Irving@geoffreyirving·
The @AISecurityInst Cyber Autonomous Systems Team is hiring propensity researchers to grow the science around whether models *are likely* to attempt dangerous behaviour, as opposed to whether they are capable of doing so. Application link below! 🧵
Geoffrey Irving tweet media
English
1
12
53
8.2K
Benjamin Hilton retweetet
AI Security Institute
AI Security Institute@AISecurityInst·
Last week, we hosted our inaugural Alignment Conference, in partnership with @farairesearch. The event bought together an interdisciplinary delegation of leading researchers, funders, and policymakers to discuss urgent open problems in AI alignment 🧵
AI Security Institute tweet mediaAI Security Institute tweet mediaAI Security Institute tweet mediaAI Security Institute tweet media
English
1
11
61
32K