What is the alignment problem?
My attempt at clarifying a confusing topic
On a very high level, building performant AI systems requires two ingredients:
Capability: The AI system could do the intended task.
Alignment: The AI system does the intended task as well as it could.
Thus if the system doesn’t do the intended task, then this is always due to a capability problem, an alignment problem, or both.
Usually we talk about alignment with human intentions. In this case the intended task is whatever the human wanted the system to do.
We’re all familiar with incapable systems; in fact, most problems deep learning had so far have been capability problems because the technology was still pretty immature.
We’re all also familiar with alignment problems, even though we don’t always call them that. Misaligned systems are ones that “don’t play on your team.” They might be playing against you, but most times they aren’t with or against you, they are just playing a different game. For example:
A company sends you promotional emails that you never signed up for and don’t want
Somebody cuts in line in front of you
Your computer restarts without saving a file that you wanted to save
You have to watch an ad before you can see the music video you wanted to watch
In each of these cases, we can be pretty confident that the problem wasn’t a capability problem: clearly the humans or systems in question here are capable of what you wanted them to do, they just decided not to.
Disentangling alignment and capability problems
In practice it can be very hard to disentangle alignment problems from capability problems:if a system doesn’t perform the task, we need to prove that it could do the task to show that it’s misaligned. However, in general it’s very hard to prove that a system could do something if it doesn’t do it.
For example, if a customer service representative doesn’t answer my question, is this because they don’t know the answer (capability problem) or because they have been instructed by their employer not to tell me the answer (alignment problem)? Without snooping around the customer service center, it can be really hard for me to tell.
Today’s alignment problems
Today the most obvious misalignment problems in AI are exhibited by large language models: there are a lot of ways in which today’s large language models don’t act in accordance with our intentions. We can separate these into explicit and implicit intentions: explicit intentions could be specified by natural language instructions (“write a summary for this text” / “list some ideas on X”), while implicit intentions are usually numerous and not stated explicitly: don’t use toxic language, don’t give harmful advice, don’t make stuff up, etc.
There is a lot of progress that we can make on these problems today by finetuning models with specially curated datasets, see for example our work on InstructGPT. Ultimately this will be an excellent testing ground for alignment research: can we train these models so that they never do anything obviously bad? If we can’t get even today’s models to be very aligned, this would mean our alignment methods are pretty fundamentally flawed.
The hard problem of alignment
However, today’s problems are pretty different from the problems we ultimately have to face when we have AI systems that are smarter than us. This “hard problem of alignment” is the version of the problem I’m most interested in:
How do we align systems on tasks that are difficult for humans to evaluate?
As AI progress continues, we’ll get smarter and smarter models that can be applied to harder and harder tasks. However, AI progress doesn’t change the range of tasks that humans understand. As tasks get harder, it also becomes harder for humans to evaluate whether a given behavior captures their intent.
For tasks that are difficult to evaluate, many straightforward solutions such as RL from human feedback don’t apply: Humans can’t check everything the system does because the system might try to fool us in ways that are hard for us to detect.
Moreover, the hard problem of alignment is the version of the problem for which the stakes are highest. Once we know how to build AI systems that can do hard tasks better than us, there’ll be a lot of economic pressure to put them in charge of all kinds of economically valuable tasks. But if they are misaligned, they won’t actually perform those tasks as we intend, and thus we face unintended consequences.
Thanks to John Schulman, Steven Bills, and Dan Mossing for feedback on this post. Some of the content was inspired by conversations with Richard Ngo and Allan Dafoe.