I am very excited that AISI is announcing over £15M in funding for AI alignment and control, in partnership with other governments, industry, VCs, and philanthropists!
Here is a 🧵 about why it is important to bring more independent ideas and expertise into this space.
alignmentproject.aisi.gov.uk
Posts by Benjamin Hilton
As always, we'd be very excited to collaborate on further research. If you're interested in collaborating with UK AISI, you can express interest at forms.office.com/e/BFbeUeWYQ9. If you're a non-profit or academic, you can also apply for grants up to £200,000 directly at aisi.gov.uk/grants.
Humans are often very wrong.
This is a big problem if you want to use human judgment to oversee super-smart AI systems.
In our new post, @girving.bsky.social argues that we might be able to deal with this issue – not by fixing the humans, but by redesigning oversight protocols.
Huge thanks to Marie Buhl, Jacob Pfau and @girving.bsky.social for all their work on this. Excited to get stuck in to future work!
There are still loads of open problems.
We need to get each part of the above right – exploration guarantees and human input particularly stand out to me (optimistic about obfuscated arguments, stand by for future publications...)
Two things that stand out for me from this paper:
– Debate gets you correctness/honesty. That's not sufficient for harmlessness, but is a great first step.
– Low-stakes alignment (where you get single errors, but not errors on average) seems (imo) totally do-able
Want to build an aligned ASI? Our new paper explains how to do that, using debate.
Tl;dr:
Debate + exploration guarantees + no obfuscated arguments + good human input = outer alignment
Outer alignment + online training = inner alignment*
* sufficient for low-stakes contexts
On top of the AISI-wide research agenda yesterday, we have more on the research agenda for the AISI Alignment Team specifically. See Benjamin's thread and full post for details; here I'll focus on why we should not give up on directly solving alignment, even though it is hard. 🧵
This is just the start. We'll be following this up shortly with:
– A safety case sketch for debate, giving a whole host more detail on the open problems.
– A series of posts (something like 1 a week) diving into various problems we'd like to see solved.
5/5
We've included a long list of open problems we'd like people to solve – and a reminder that you can express interest in collaborating, and apply to our challenge fund for grant funding!
bsky.app/profile/benj...
4/5
The post sets out:
Why we're excited about safety cases
Why we focus (initially) on honesty
What we mean when we talk about 'asymptotic guarantees'
3/5
Link to the detailed alignment team agenda: alignmentforum.org/posts/tbnw7L...
Link to AISI's research agenda: aisi.gov.uk/research-age...
2/5
The Alignment Team at UK AISI now has a research agenda.
Our goal: solve the alignment problem.
How: develop concrete, parallelisable open problems.
Our initial focus is on asymptotic honesty guarantees (more details in the post).
1/5
Identifying people to work with is the biggest bottleneck for the UK AISI alignment team right now. Help out by filling in or sharing the form below:
forms.office.com/e/BFbeUeWYQ9
We’re particularly excited to hear from:
– ML researchers
– Complexity theorists
– Game theorists
– Cognitive scientists
– People who could build datasets
– People who could run human studies
– Anyone else who thinks they might be doing, or could be doing, relevant work
We’re trying to massively scale up the total global effort going into security-relevant alignment research, to prevent superhuman AI from posing critical risk.
We do this by:
1. Identify key alignment subproblems
2. Identifying people who can solve them
3. Funding research
QR code link to the form
Interested in getting UK AISI support to do alignment research?
Fill in our short, < 5-min form, and we'll get back on proposals within 1 week.
(Caveat: While we may reach out about AISI funding your project, filling out this form is not an application for funding.)