What is the alignment problem?

My attempt at clarifying a confusing topic

Mar 29, 2022

On a very high level, building performant AI systems requires two ingredients:

Capability: The AI system could do the intended task.
Alignment: The AI system does the intended task as well as it could.

Thus if the system doesn’t do the intended task, then this is always due to a capability problem, an alignment problem, or both.

Usually we talk about alignment with human intentions. In this case the intended task is whatever the human wanted the system to do.

Examples

We’re all familiar with incapable systems; in fact, most problems deep learning had so far have been capability problems because the technology was still pretty immature.

We’re all also familiar with alignment problems, even though we don’t always call them that. Misaligned systems are ones that “don’t play on your team.” They might be playing against you, but most times they aren’t with or against you, they are just playing a different game. For example:

A company sends you promotional emails that you never signed up for and don’t want
Somebody cuts in line in front of you
Your computer restarts without saving a file that you wanted to save
You have to watch an ad before you can see the music video you wanted to watch
…

In each of these cases, we can be pretty confident that the problem wasn’t a capability problem: clearly the humans or systems in question here are capable of what you wanted them to do, they just decided not to.

Disentangling alignment and capability problems

In practice it can be very hard to disentangle alignment problems from capability problems:1 if a system doesn’t perform the task, we need to prove that it could do the task to show that it’s misaligned.2 However, in general it’s very hard to prove that a system could do something if it doesn’t do it.

For example, if a customer service representative doesn’t answer my question, is this because they don’t know the answer (capability problem) or because they have been instructed by their employer not to tell me the answer (alignment problem)? Without snooping around the customer service center, it can be really hard for me to tell.

Today’s alignment problems

Today the most obvious misalignment problems in AI are exhibited by large language models: there are a lot of ways in which today’s large language models don’t act in accordance with our intentions. We can separate these into explicit and implicit intentions: explicit intentions could be specified by natural language instructions (“write a summary for this text” / “list some ideas on X”), while implicit intentions are usually numerous and not stated explicitly: don’t use toxic language, don’t give harmful advice, don’t make stuff up, etc.

There is a lot of progress that we can make on these problems today by finetuning models with specially curated datasets, see for example our work on InstructGPT. Ultimately this will be an excellent testing ground for alignment research: can we train these models so that they never do anything obviously bad? If we can’t get even today’s models to be very aligned, this would mean our alignment methods are pretty fundamentally flawed.

The hard problem of alignment

However, today’s problems are pretty different from the problems we ultimately have to face when we have AI systems that are smarter than us. This “hard problem of alignment” is the version of the problem I’m most interested in:

How do we align systems on tasks that are difficult for humans to evaluate?

As AI progress continues, we’ll get smarter and smarter models that can be applied to harder and harder tasks. However, AI progress doesn’t change the range of tasks that humans understand. As tasks get harder, it also becomes harder for humans to evaluate whether a given behavior captures their intent.

For tasks that are difficult to evaluate, many straightforward solutions such as RL from human feedback don’t apply: Humans can’t check everything the system does because the system might try to fool us in ways that are hard for us to detect.

Moreover, the hard problem of alignment is the version of the problem for which the stakes are highest. Once we know how to build AI systems that can do hard tasks better than us, there’ll be a lot of economic pressure to put them in charge of all kinds of economically valuable tasks. But if they are misaligned, they won’t actually perform those tasks as we intend, and thus we face unintended consequences.

Thanks to John Schulman, Steven Bills, and Dan Mossing for feedback on this post. Some of the content was inspired by conversations with Richard Ngo and Allan Dafoe.

This is why I’ve lumped alignment and capabilities together in the past.

See for example the analysis done in the Codex paper.

Stefanya Poesy

Sep 28, 2022

It appears to me that your definition of alignment is slightly askew. In all your examples the real issue is who the system is aligned TO.

It appears to me the systems in question are performing in perfect alignment to a system OTHER than your observer, implying alignment, in the sense you seem to be implying, is relative.

The first question to ask then, IMO, is whose desires and intentions the system was really designed to serve.

Expand full comment

2 replies by Jan Leike and others

Miroslav

May 25, 2024

I do understand the intention and the main point of the post, but this part is contradictory:

“••• examples •••

What does it mean the systems are capeble of doing correctly and it just decides not to? How can we evaluate this and be sure it has nothing to do with the system’s capability? The narative can lead to some non-researchers to overestimate self-awareness capabilities of the system of equations for ex.

3 more comments...

Musings on the Alignment Problem

Discussion about this post