A technique that has been successful at making models more aligned is reinforcement learning from human feedback (RLHF). Recently we used RLHF to align GPT-3 with human intent, such as following instructions. The gist of this method is pretty simple: we show a bunch of samples to a human, and the human says which one is closer to what they intended. Then you use reinforcement learning to optimize the model (in this case GPT-3) on this signal; essentially, this optimizes it for human preferences.
Can we use the same method to solve the hard problem of alignment, training AI systems on tasks that are difficult for humans to evaluate? Not directly.
Reinforcement learning from human feedback won’t scale
One approach to the hard problem of alignment could be to train a really robust reward model from human feedback and then leverage its ability to generalize to difficult tasks to supervise very capable agents.
The main motivation for this approach is the trend with larger models towards better generalization ability (see for example the scaling laws for transfer). Concretely, we’ve seen some impressive generalization in InstructGPT: the model follows instructions in languages other than English and for code, despite not being supervised on those tasks. However, this is probably not just reward model generalization; otherwise we wouldn’t see this behavior in the models trained to clone human demonstrations (the “SFT models”). While we haven’t measured the difference between the generalization ability of SFT and RLHF, conceptually it seems RLHF should generalize better since evaluation is usually “easier” than generation for most tasks we care about.
On a high level, a lot of the most important properties we want out of alignment might be relatively easy for models to grok if they are already familiar with humans. So it’s plausible that a reward model fine-tuned from a large enough language model pretrained on the internet ends up generalizing “do what humans want” really well.
However, there is a big problem with this approach: For tasks that humans struggle to evaluate, we won’t know whether the reward model has actually generalized “correctly” (in a way that’s actually aligned with human intentions) since we don’t have an evaluation procedure to check. All we could do was make an argument by analogy because the reward model generalized well in other cases from easier to harder tasks.1
In the past, we’ve seen problems from over-fitting to a fixed reward function: when trained long enough, the policy learns to exploit “loopholes” in the reward model. Maybe we succeed in making a reward robust enough to withstand a lot of optimization, but I doubt it. However, the important point is:
If we can’t evaluate what the AI system is doing, we don’t know if it's behaving according to our intentions.
So a big problem with relying on generalization of reward models alone is that we’d be tapping in the dark on whether it actually worked instead of relying on empirical data. Our bar for alignment needs to be higher than that.
Enter: AI-assisted human feedback
As AI progress continues, our models will be able to do harder and harder tasks. At some scale humans won’t be able to evaluate what our models are doing anymore, because the task is just too complex. For example, if the model writes a large codebase or produces an answer to a complicated scientific question, then it is very hard for humans to find all the flaws in that response.
So what if we found some way to evaluate the model’s behavior? If we had some procedure to evaluate, we could also leverage this evaluation procedure to train the system to be more aligned.
In general, if we have an AI system that is “smart enough” to perform such a hard task, then we should also be able to leverage this system’s capabilities to help humans make better sense of the task. This means we could train the same AI systems to assist humans with evaluation.
This is the basic idea behind recursive reward modeling (RRM). RRM is a natural extension of RLHF to harder tasks using a recursive procedure: for each task, either the task is simple enough for a human to evaluate directly (then we use RLHF), or we create new tasks whose goal is to help humans evaluate responses on the original task. These “evaluation assistance tasks” are simpler tasks in a more narrow domain (they can focus on just one aspect of the evaluation), thus they are often easier to evaluate. We solve these tasks recursively with RRM.
In particular, for lots of tasks, evaluation assistance should be an easier task in a more narrow domain, so we can break down aligning our system on the harder task to aligning evaluation assistance.
We have previously explored this idea in the context of book summarization: evaluating a summary of an entire book takes a long time for humans who are unfamiliar with the book, because they need to read it first. However, if we can trust the summaries of each book chapter, evaluating the book summary becomes much easier.
We should figure out how far we can push this idea. What is the largest set of tasks we can align our models on by (recursively) training evaluation assistants? I expect there are plenty of tasks that are really hard or impossible to solve with this technique (e.g. how do you break down the evaluation of a book on some hard philosophy question?), but I am optimistic that it’ll get us pretty far.
Generalization might help as well. Ideally we can leverage our model’s ability to generalize to make expensive evaluation a lot cheaper. In a sense the reward models we use for RLHF already do this: instead of providing a comparison for each episode during RL training, we only provide comparisons for a subset of them and let our reward models generalize to the rest of them.
When training models to do evaluation assistance, we would hope to leverage a lot of generalization to the assistance tasks (for example between each other or from the top-level task). But we need to be able to reach a ground-truth evaluation using human judgments. RRM is one way to do this, because humans can evaluate each assistance task separately, then evaluate the higher-level tasks with assistance, and so on. Everything “grounds out” in evaluation, although the work that actually doing the evaluation would involve might grow exponentially with the size of the tree, we can’t do it very often.
Cognitive labor and preference input
In the long run, we should automate the cognitive labor required for evaluation, such as reading and cross-checking a lot of material, reasoning and math, summarizing, and answering questions. With automated cognitive labor, human evaluators can focus entirely on the preference input: resolving ambiguities in statements about their intent, providing more detailed preferences, and providing directive guidance on the solution space.
For example, let’s say I want an AI system that develops a video game for me. I might have a lot of preferences about genre, graphics, multiplayer, etc. This is the preference input. But I don’t really want to have to look at the code at all and think about how many players the game can scale to, how difficult it would be to port to a different platform, etc. Answering these questions requires a lot of cognitive labor, but no knowledge about my preferences. The answers to these “cognitive labor” questions are still very relevant to me, but I don’t need to answer them myself as long as I can trust the answers, because the answers are independent of my preferences about video games.
If we succeed at automating (most of) the cognitive labor involved in human feedback, and we can make techniques like RRM work, then we should be able to align models on a pretty large range of tasks (everything that can be recursively evaluated with assistance) by “only”2 having humans state their preferences.
Thanks to Steven Bills and Katarina Slama for feedback on this post.
Other alignment work attempted to formalize “generalize the way a human would,” but it’s not clear to me how that would solve the problem. It seems to me that human generalization performance, while much better than generalization by today's neural networks, is pretty poor overall (as evidenced by our ability to react to distributional change like COVID-19). I wouldn’t trust it to work on problems humans don’t understand.
There certainly are a lot of difficult questions related to what these preferences are/should be and how to elicit them.