What could a solution to the alignment problem look like?
A high-level view on the elusive once-and-for-all solution
My currently favored approach to alignment research is to build a system that does alignment research better than us. But what would that system actually do?
The obvious answer is “whatever we’re doing right now.” This is unsatisfactory because we’re not actually trying to solve the whole alignment problem–we’re just trying to build a better alignment researcher. At some point we need to switch focus on the grant goal of aligning all future AI systems.
There are two general paths for ensuring that all future AI systems are aligned:
(A) Alignment stays perpetually ahead of AI capabilities: Alignment research progresses fast enough to ensure that the most capable AI systems are always sufficiently aligned and never overpower us. To succeed on this path, we need to be able to either slow down capabilities research enough (which I expect is prohibitively difficult) for alignment research to keep up or be able to spend enough compute on automated alignment research to derive techniques that are sufficient for the next generation of AI systems.
(B) We find a once-and-for-all solution: This is a comprehensive solution to the alignment problem that scales indefinitely. Once we have this solution, “all we need to do” is ensure that it gets implemented everywhere.
By default we’ll keep pushing on A until we discover B. But we currently don’t know if B (or even A) is possible. Nevertheless, I want to try to give a high level sketch of what B could look like. It has 4 parts:
A formal theory for alignment
An adequate process to elicit values
Techniques to train AI systems such that they are fully aligned
Formal verification tools for cutting-edge AI systems
What follows are largely questions and high-level desiderata rather than answers and solutions.
1. A formal theory for alignment
We develop a formal theory for alignment that captures what it means for a system to be aligned with a principal (the human user). This formal theory needs to be grounded in mathematics and allows us to make precise statements about any system that are either true or false. It leaves no room for vagueness or ambiguity and can be automatically checked by a theorem prover.
We don’t have anything like this right now, and I’m not sure how to approach it. Some loose desiderata on this formal theory:
It needs to give a precise definition of the alignment problem that researchers generally agree with.
It needs to capture the key difficulties of the alignment problem, i.e. how to handle tasks that the principal can’t understand.
It needs to be able to deal with inconsistencies and biases that occur when humans express their preferences.
It needs to be extendable to multiple principals and multiple agents.
It needs to either answer or circumvent the question of which parts of a complex system constitutes an agent.
It probably needs to be able to handle logical uncertainty, embedded agency, inner misalignment, and other weird problems.
It needs to capture the robustness of AI systems and deal with probabilistic input distributions.
The closest existing work is probably cooperative inverse reinforcement learning, but unfortunately that work doesn't pass most of the bullet points above.
2. An adequate process to elicit values
The question we always come back to when training AI systems on human preferences is “whose preferences?” Right now we use roughly the following process: we hire a bunch of people on the internet and ask them to rank our models’ responses. For sensitive topics (e.g. toxic responses) we use demographic information provided by our labelers to reweigh the labels.
Clearly this is very unsatisfactory, and just slightly better than the laziest thing we could do. What would an actually acceptable process look like? Some desiderata:
Inclusivity: The process needs to be inclusive to humanity as a whole. Humanity is very diverse, and different groups need to be able to provide meaningful input into the process. It has to work across cultures, languages, income levels, ages, etc. It can’t disregard minority views that are very important to that minority.
Fairness: The process needs to be fair, it can’t favor elites or individuals over the rest of humanity.
Representation: The process needs to aggregate values in a way that gives every human equal power to shape the outcome, and decide how to trade off conflicting values with each other.
Incentive-alignment: The process needs to be external to any tech company. Whenever a company is in charge of this process, there is always a risk that the companies’ incentives might interfere with the process. The same holds if the process is housed in any single country.
Legitimacy: The process needs to operate within existing rules and institutions and not circumvent them.
Adaptability: Human values change over time. Locking in humanity’s values of the early 21st century and preventing moral progress would likely be catastrophic, just as we now find some human values and norms that were widespread centuries ago despicable (e.g. slavery).
Transparency: Anyone should be able to look at the process and see how it works.1
Simplicity: The process should be simple enough that most humans can understand it well.
Practicality: The process needs to be practical enough that it doesn’t take decades to implement in case AI progresses fast.
Maybe a good test for the process is through the veil of ignorance: what process could we all agree to if we didn’t know where and when on Earth we are born?
It might be impossible to fully satisfy all of these desiderata in theory, akin to Arrow’s impossibility result for social choice theory. However, this doesn’t mean it can’t work in practice: voting is still meaningful despite Arrow’s impossibility result.
One possible path to achieve the outcome of an idealized process with significantly less effort than actually running it is to build a sufficiently capable and aligned AI system and have it figure out what the outcome would be. However, I expect that most people would not regard this substitute process as legitimate.
Thus talking to humans from every subgroup of humanity will be a critical component of such a process. For example, we could make a chatbot that talks to people in their native language about their values and then writes them down. In theory the internet provides the infrastructure to do this, but in practice large parts of humanity are cut off from the internet.
3. Techniques to train AI systems such that they are fully aligned
This is the main piece we’re working on today. Except our standards are much lower: we’re only trying to build a system that is sufficiently aligned such that we can use it to do more alignment research without causing harm or grabbing power. We don't even know what exactly it means for a system to be fully aligned.
Right now we’re approaching this part iteratively and based on a few conceptual motivations (e.g. “evaluation is easier than generation”) rather than any formal theory. Quite unsatisfactory, but we’re still making real progress.
How to do this long-term will hopefully be informed by our solution to part 1: once we have a formal notion of what it means to solve the alignment problem, in theory we could automatically search the space of algorithms for one that makes progress according to this definition. Moreover, with our automated alignment researcher we don’t need to restrict the search space to alignment techniques humans could devise.
4. Formal verification tools for cutting-edge AI systems
Given the system we trained according to part 3 and a set of values elicited according to part 2, we can use the theory from part 1 to express the formal theorem “this system is fully aligned” in mathematics. Now “all we need to do” is prove this theorem. This is incredibly difficult for a number of reasons:
The theorem is likely incredibly large. If we want to prove something about a GPT-3-sized 175 billion parameter model, our theorem’s size is going to be at least 175GB. The input and output space is incredibly large as well: ~10¹⁰⁰⁰⁰ possible inputs for GPT-3.
The specification of our system that we’re verifying is itself fuzzy (the values from part 2). Therefore we need to verify relative to a learned specification (another neural network?) which itself is faulty. How do we ensure this actually solves the problem or even makes progress?
Our inputs are distributional but verification needs to cover all the edge cases. Most of the input space is just random noise. How do we deal with that?
Today we don’t know at all how to do formal verification at this scale: the state of the art methods verify local adversarial robustness (imperceptible perturbations) of MNIST and CIFAR image classifiers, which are comparatively tiny networks relative to the largest language models. There has been good progress on scalable verification in recent years, but we’re still very far away from anything practical for the largest neural networks we have today.
In practice this formal verification might end up looking more like interpretability: the way we actually prove the theorem is to gain a full understanding of every neuron in the model and then use this knowledge to write a much more compact proof.
Outlook
The parts listed here are very high-level, and it’s currently unclear how to actually make progress on them. The hardest parts are probably either part 1 or part 4. Part 4 is definitely very hard, but my uncertainty over the difficulty of part 1 spans many more orders of magnitude. My understanding is that most people who claim that we’re not making meaningful progress on the alignment problem mostly point to a lack of progress on part 1.
A lot of the work on parts 1, 2, 4 and ultimately also 3 will look very different from the work we do today, and I expect that it’s only feasible to do using significant automation. But if we succeed, we’ll truly have provably beneficial AI.
Thanks to Hendrik Kirchner, William Saunders, Jeff Wu, Leo Gao, and John Schulman for feedback and thanks to Andrew Trask for a discussion that prompted this post.
This desideratum was added later (2022-11-16) upon additional reflection.
Thanks Jan so much for the work you and your team are undertaking. Hopefully in a decade or two, AI alignment researchers like yourselves are going to be considered heroes like the astronauts were in the space race. Three questions for you:
1. What do you make of the following paper and the general argument that in the end, we cannot control/align an intelligence that is superior to humans: (https://journals.riverpublishers.com/index.php/JCSANDM/article/view/16219)?
2. There is a lot of interest by billionaire funders and the effective altruist movement to dramatically increase the funding and resourcing for AI safety/alignment. I've gathered that funding is no longer the rate limiter but AI alignment researchers are the bottleneck. Is that your view? What can be done to re-skill or re-orient PhDs and academics?
3. Related to #2, how much would we have to scale up the AI alignment research personnel so that you feel you can meet and handle the progress towards AGI? For example, would a 2x, 5x, or 10x scale up make you feel AI alignment is no longer the bottleneck?
Thank you!
Thank you for this informative and motivating post! There are a few points on which I would like to comment:
#2: “One possible path to achieve the outcome of an idealized process with significantly less effort than actually running it is to build a sufficiently capable and aligned AI system and have it figure out what the outcome would be. However, I expect that most people would not regard this substitute process as legitimate.”
In my opinion, what makes this approach dangerous is that the answer of such an AI would to the alignment problem influences how we treat *this very* AI (and all other AIs) going forward. As soon as the AI figures out that we will use its output in this way, its behavior becomes strategic, adding a strong incentive for breaking free from its alignment and pursuing its own objectives (maybe that’s simply an instrumental goal like survival to start with).
#2: I’m somewhat unsatisfied with the entire “emulating human values in AI models” approach. Apart from the difficulties you describe, I see the much more fundamental problem that human preferences might just not be very “good” compared to what’s possible. Two quite straight-forward aspects are: (a) Human preferences about specific situations might not perfectly capture abstract human values, due to various biases, and (b) human values might be systematically flawed, due to the fact that we’re, well, humans.
Therefore, I would extend your argument that “with our automated alignment researcher we don’t need to restrict the search space to alignment techniques humans could devise” to the search space of consistent moral value systems, such that we’re no longer restricted to what *we* can conceive (of course, this would instead require some higher level description of desiderata for such value systems).
#4: “If we want to prove something about a GPT-3-sized 175 billion parameter model, our theorem’s size is going to be at least 175GB.”
Is your assumption that 175B parameters are *necessary* to capture the capabilities of GPT-3? It seems non-trivial to me to show that the same capabilities cannot be obtained by a much smaller model for *some* combination of initial configuration and training data. If this were possible, we could potentially describe (and make provable claims about) such a system in a much more compact form.
I would be excited to hear your opinion!