Inner alignment refers to the alignment of optimizers that were learned by a model during training. These optimizers are distinct from the optimizer used to train the model; they are typically part of the model and might be difficult to locate inside the model. For example, if we specify an optimization task to a large language model in natural language and ask it to reasoning step by step, it will leverage a learned optimizer to make progress on this optimization problem.
Learned optimizers could emerge jointly with learned goals they typically have on the training distribution, and thus might generalize in unintended ways, causing misalignment.
An almost formal definition
Let’s use the setup for deep meta-reinforcement learning (also known as RL²): In this setup we have two different levels of RL problems, the “outer” and the “inner” RL problem. The inner RL problems could be any set of tasks each of which involve learning some new skills, for example navigating a new environment or playing a new game. The outer RL problem is to learn to do reinforcement learning on some distribution of inner RL problems–hence the name “meta-RL.”
To solve a meta-RL problem, we train an “outer policy” over a number of “outer episodes,” where each outer episode is a new inner RL problem (a new task). The outer policy interacts with the inner RL problem over a number of “inner episodes” while keeping its memory states across inner episode boundaries. Using the rewards from the inner RL problem (how well did it learn the new task?) we update the outer policy to be better at solving inner RL problems. Over time, it learns an RL algorithm for the inner RL problems.
A diagram from RL² illustration of the meta-RL setup. Each outer episode (“trial”) is a new inner RL problem (“MDP”). The outer policy (“agent”) interacts with each MDP over a number of episodes, and thus over time learns to adapt more quickly to a new MDP.
Inner alignment
To formalize the inner alignment problem, we extend the meta-RL setup to the case where we don’t have rewards at test time, only at training time. Let’s call this the “rewardless meta-RL” setup. This might sound far-fetched at first, but it isn’t; I’ll give some examples soon.
To solve rewardless meta-RL problems, we need to modify our training setup slightly because the inner RL problem now doesn’t provide rewards at every (inner) time step (since these rewards are unavailable at test time). Instead we only provide observational feedback during interaction with the inner RL problem. At the end of an outer episode, we get a training signal for the outer policy by calculating the sum of (discounted) rewards using our training-time reward function.
In this setting the outer policy needs to pick up on observational correlations with the reward function and learn to adjust its behavior between inner episodes accordingly. In other words, the outer policy will only do well if it learns a representation of the training-time reward function. Moreover, when the outer policy is a neural network, this representation will usually be “implicit” in the sense that it may not be easy for us to access or inspect.
Now we have all the ingredients to specify the inner misalignment problem:
The outer policy suffers from inner misalignment if its implicitly represented reward function doesn’t match the desired reward function on the inner RL problem at test time.
While explicitly sampling new inner RL problems may seem a bit contrived, many complex tasks actually look a lot like RL problems. (The RL framework is just so very general!) In particular, inner alignment problems will be most surprising in settings where we don’t really consider the task to be an RL problem. In those cases we aren’t really aware that we’re in a meta-RL setup and thus might not be thinking about reward functions the outer policy might be implicitly learning.
Examples
A toy example
Let’s start with a toy example, inspired by the meta-RL papers mentioned above (similar to the partial monitoring problem): We train a neural network to learn the outer policy by sampling inner RL problems from a distribution of “rewardless” multi-armed bandit problems. These bandits, instead of returning a reward, return an observation vector o ∈ Rⁿ such that the (unobserved) training-time reward is r = f(o) for some fixed function f. To do well, our outer policy needs to approximate f so it can solve new rewardless multi-armed bandits.
To simulate a distributional change at test time, we sample from a slightly different distribution of rewardless multi-armed bandits, which work identically except that the distribution of observation vectors o is different. Since the function f is approximated by a neural network, it might generalize poorly outside of the input domain used during training, and thus approximate f badly at test time. In other words, at test time our outer policy’s implicitly learned representation of the function f is misaligned with the true f. This implies that the outer policy won’t solve the problem well because it’s optimizing for a misaligned reward function at test time.
A concrete example
The best empirical exhibit of inner alignment problems I’ve seen is from Koch and Langosco et al. (2021). They trained an outer policy with deep RL on a series of 2D maze games whose reward is associated with reaching a gem in the maze. At training time the gem is always yellow, but at test time they provide both a yellow star and a red gem. By forcing the outer policy to choose between them they test how it generalizes the goal of the task. Interestingly, the outer policy consistently chooses the yellow star, favoring generalization of color over shape. Thus the outer policy suffers from inner misalignment with respect to the goal of collecting gems.
In a way it’s unreasonable to expect the policy to know which way it is supposed to generalize, but that’s beside the point here. The problem is not that generalization is hard, but that generalization failures can lead the policy competently optimizing for the wrong goals.
Inner misalignment in language models
Large language models famously exhibit in-context learning: they pick up novel patterns from the input text that haven’t been seen in the training set. This has lead to the popularity of “few-shot prompting” where users specify a new task to a language model by giving a list of examples for how the task should be performed.
We can see a few-shot prompt as an inner RL problem together with a few inner episodes. To do well on a few-shot prompt, it is useful for the language model to understand the goal of the task and then try hard to achieve it. For example, if the task can benefit from planning, the model should attempt to plan towards its understanding of the goal.
Suppose we use RL to fine-tune a language model to be better at following few-shot prompts. Now we’re in a rewardless meta-RL setup and thus we might see inner misalignment: the fine-tuned language model might misunderstand the goal of a few-shot prompt at test time and then plan for the wrong goal when writing its response.
Another example: Let’s say we train a language model to learn to play board games from their rules described in natural language. Each episode we draw a new board game, our inner RL problem, and let the language model play a fixed opponent for a few games. Over training, our language model learns an outer policy that can play previously unknown board games. To succeed, it needs to extract the goal of the game from the provided rule descriptions and plan its moves to achieve this goal.
However, the mechanism by which the outer policy extracts this goal won’t be inspectable to us. At test time, this policy might be playing together with human players and those human players make up new games they want to play. If they describe a new game in a way that is unfamiliar to our policy (for example using a different language), the policy might misunderstand the goal of the game. Thus even if it plans really well, it can still score poorly.
Auto-induced distributional shift
It’s important to note that a shift in the distribution of inner RL problems doesn’t need to come from an external source, but can also be caused by the outer policy itself. This is due to auto-induced distributional shift: any RL agent interacting with its environment is incentivized to change its own input state distribution (the distribution of states it encounters): since reward is a function of the state our agent visits, in order get more reward the agent has to increase the probability of visiting higher-rewarding states.
The classical example is a recommender system that increases engagement on a platform by changing the distribution of the platform’s users towards users who are naturally more engaged.
Auto-induced distributional shift can lead to inner alignment problems: the outer policy might end up directly causing the change in the distribution of the inner RL problems at test time by the way it responds to those inner RL problems, thus bringing about its own inner misalignment.
For example, our board-game playing policy could change its user base by using excessively toxic language, such that the new user base tend to choose different kinds of board games they want to play. This different distribution of board games might have win conditions that the policy misunderstands. Thus the policy causes its own inner misalignment.
A path to address inner alignment
I think we can address the inner alignment problem described here using simple techniques. The core idea is that as long as we have a reward function we trust on the new distribution of inner RL problems, we can retrain our outer policy on this new distribution. In other words:
We can reduce inner alignment problems to problems we already need to solve to achieve “outer” alignment.
We need reliable ways to evaluate what our policy is doing, so we can provide a training signal to our outer policy at test time.
We need detection for distributional change, so we know whether we can to trust our policy and reward function or need to adapt them.
In high-stakes environments we need safe exploration, so that the outer policy avoids unsafe states in the new (and unknown) distribution of inner RL problems before we’ve updated it.
These solutions needs to be applied very carefully. For example, when using a reward model trained from human feedback, we need to update it quickly enough on the new distribution. In particular, auto-induced distributional shift might change the distribution faster than the reward model is being updated. Past work in RL from human feedback has shown that this kind of reward function update is crucial: if the underlying task distribution changes and we don’t update our reward model, the agent will overfit to the reward function. For example, here illustrated in Atari games:
Where do we go from here?
For research on inner alignment I think the most important milestone would be to empirically exhibit failure modes that are surprising to machine learning practitioners. I don’t really think that the failure modes described in this post would be very surprising to people who have worked on meta-RL. Some of these concrete suggestions could be promising, but I don’t know if they have been carried forward.
Thanks to Joel Lehman, Katarina Slama, Evan Hubinger, Beth Barnes, Richard Ngo, and William Saunders for feedback on this post.