What could a solution to the alignment problem look like?
A high-level view on the elusive once-and-for-all solution
My currently favored approach to alignment research is to build a system that does alignment research better than us. But what would that system actually do?
The obvious answer is “whatever we’re doing right now.” This is unsatisfactory because we’re not actually trying to solve the whole alignment problem–we’re just trying to build a better alignment researcher. At some point we need to switch focus on the grant goal of aligning all future AI systems.
There are two general paths for ensuring that all future AI systems are aligned:
(A) Alignment stays perpetually ahead of AI capabilities: Alignment research progresses fast enough to ensure that the most capable AI systems are always sufficiently aligned and never overpower us. To succeed on this path, we need to be able to either slow down capabilities research enough (which I expect is prohibitively difficult) for alignment research to keep up or be able to spend enough compute on automated alignment research to derive techniques that are sufficient for the next generation of AI systems.
(B) We find a once-and-for-all solution: This is a comprehensive solution to the alignment problem that scales indefinitely. Once we have this solution, “all we need to do” is ensure that it gets implemented everywhere.
By default we’ll keep pushing on A until we discover B. But we currently don’t know if B (or even A) is possible. Nevertheless, I want to try to give a high level sketch of what B could look like. It has 4 parts:
A formal theory for alignment
An adequate process to elicit values
Techniques to train AI systems such that they are fully aligned
Formal verification tools for cutting-edge AI systems
What follows are largely questions and high-level desiderata rather than answers and solutions.
1. A formal theory for alignment
We develop a formal theory for alignment that captures what it means for a system to be aligned with a principal (the human user). This formal theory needs to be grounded in mathematics and allows us to make precise statements about any system that are either true or false. It leaves no room for vagueness or ambiguity and can be automatically checked by a theorem prover.
We don’t have anything like this right now, and I’m not sure how to approach it. Some loose desiderata on this formal theory:
It needs to give a precise definition of the alignment problem that researchers generally agree with.
It needs to capture the key difficulties of the alignment problem, i.e. how to handle tasks that the principal can’t understand.
It needs to be able to deal with inconsistencies and biases that occur when humans express their preferences.
It needs to be extendable to multiple principals and multiple agents.
It needs to either answer or circumvent the question of which parts of a complex system constitutes an agent.
It needs to capture the robustness of AI systems and deal with probabilistic input distributions.
The closest existing work is probably cooperative inverse reinforcement learning, but unfortunately that work doesn't pass most of the bullet points above.
2. An adequate process to elicit values
The question we always come back to when training AI systems on human preferences is “whose preferences?” Right now we use roughly the following process: we hire a bunch of people on the internet and ask them to rank our models’ responses. For sensitive topics (e.g. toxic responses) we use demographic information provided by our labelers to reweigh the labels.
Clearly this is very unsatisfactory, and just slightly better than the laziest thing we could do. What would an actually acceptable process look like? Some desiderata:
Inclusivity: The process needs to be inclusive to humanity as a whole. Humanity is very diverse, and different groups need to be able to provide meaningful input into the process. It has to work across cultures, languages, income levels, ages, etc. It can’t disregard minority views that are very important to that minority.
Fairness: The process needs to be fair, it can’t favor elites or individuals over the rest of humanity.
Representation: The process needs to aggregate values in a way that gives every human equal power to shape the outcome, and decide how to trade off conflicting values with each other.
Incentive-alignment: The process needs to be external to any tech company. Whenever a company is in charge of this process, there is always a risk that the companies’ incentives might interfere with the process. The same holds if the process is housed in any single country.
Legitimacy: The process needs to operate within existing rules and institutions and not circumvent them.
Adaptability: Human values change over time. Locking in humanity’s values of the early 21st century and preventing moral progress would likely be catastrophic, just as we now find some human values and norms that were widespread centuries ago despicable (e.g. slavery).
Transparency: Anyone should be able to look at the process and see how it works.
Simplicity: The process should be simple enough that most humans can understand it well.
Practicality: The process needs to be practical enough that it doesn’t take decades to implement in case AI progresses fast.
It might be impossible to fully satisfy all of these desiderata in theory, akin to Arrow’s impossibility result for social choice theory. However, this doesn’t mean it can’t work in practice: voting is still meaningful despite Arrow’s impossibility result.
One possible path to achieve the outcome of an idealized process with significantly less effort than actually running it is to build a sufficiently capable and aligned AI system and have it figure out what the outcome would be. However, I expect that most people would not regard this substitute process as legitimate.
Thus talking to humans from every subgroup of humanity will be a critical component of such a process. For example, we could make a chatbot that talks to people in their native language about their values and then writes them down. In theory the internet provides the infrastructure to do this, but in practice large parts of humanity are cut off from the internet.
3. Techniques to train AI systems such that they are fully aligned
This is the main piece we’re working on today. Except our standards are much lower: we’re only trying to build a system that is sufficiently aligned such that we can use it to do more alignment research without causing harm or grabbing power. We don't even know what exactly it means for a system to be fully aligned.
Right now we’re approaching this part iteratively and based on a few conceptual motivations (e.g. “evaluation is easier than generation”) rather than any formal theory. Quite unsatisfactory, but we’re still making real progress.
How to do this long-term will hopefully be informed by our solution to part 1: once we have a formal notion of what it means to solve the alignment problem, in theory we could automatically search the space of algorithms for one that makes progress according to this definition. Moreover, with our automated alignment researcher we don’t need to restrict the search space to alignment techniques humans could devise.
4. Formal verification tools for cutting-edge AI systems
Given the system we trained according to part 3 and a set of values elicited according to part 2, we can use the theory from part 1 to express the formal theorem “this system is fully aligned” in mathematics. Now “all we need to do” is prove this theorem. This is incredibly difficult for a number of reasons:
The theorem is likely incredibly large. If we want to prove something about a GPT-3-sized 175 billion parameter model, our theorem’s size is going to be at least 175GB. The input and output space is incredibly large as well: ~10¹⁰⁰⁰⁰ possible inputs for GPT-3.
The specification of our system that we’re verifying is itself fuzzy (the values from part 2). Therefore we need to verify relative to a learned specification (another neural network?) which itself is faulty. How do we ensure this actually solves the problem or even makes progress?
Our inputs are distributional but verification needs to cover all the edge cases. Most of the input space is just random noise. How do we deal with that?
Today we don’t know at all how to do formal verification at this scale: the state of the art methods verify local adversarial robustness (imperceptible perturbations) of MNIST and CIFAR image classifiers, which are comparatively tiny networks relative to the largest language models. There has been good progress on scalable verification in recent years, but we’re still very far away from anything practical for the largest neural networks we have today.
In practice this formal verification might end up looking more like interpretability: the way we actually prove the theorem is to gain a full understanding of every neuron in the model and then use this knowledge to write a much more compact proof.
The parts listed here are very high-level, and it’s currently unclear how to actually make progress on them. The hardest parts are probably either part 1 or part 4. Part 4 is definitely very hard, but my uncertainty over the difficulty of part 1 spans many more orders of magnitude. My understanding is that most people who claim that we’re not making meaningful progress on the alignment problem mostly point to a lack of progress on part 1.
A lot of the work on parts 1, 2, 4 and ultimately also 3 will look very different from the work we do today, and I expect that it’s only feasible to do using significant automation. But if we succeed, we’ll truly have provably beneficial AI.
Thanks to Hendrik Kirchner, William Saunders, Jeff Wu, Leo Gao, and John Schulman for feedback and thanks to Andrew Trask for a discussion that prompted this post.
This desideratum was added later (2022-11-16) upon additional reflection.