Distinguishing three alignment taxes
The impact of different alignment taxes depends on the context
In the general sense an alignment tax is any additional cost that is incurred in the process of aligning an AI system. Let’s distinguish three different types of alignment taxes:
Performance taxes: Performance regressions caused via alignment compared to an unaligned baseline.
Development taxes: Effort or expenses incurred for aligning the model: researcher time, compute costs, compensation for human feedback, etc.
Time-to-deployment taxes: Wall-clock time taken to produce a sufficiently aligned model from a pretrained model.1
Alignment taxes are undesirable because they hinder adoption of alignment techniques. In a highly competitive market companies can’t afford to pay significant alignment taxes if there is no enforcement of universal alignment standards. However, even in absence of any competition there is an incentive against adoption of highly taxed alignment techniques: less performant models are less valuable to customers, high development taxes discourage the investment, and each day of delay incurs a commercial opportunity cost if you have customers who’d be willing to pay to use the unaligned model. Therefore we’d like to have alignment techniques where the tax is as low as possible.
Let’s discuss each of these taxes in turn.
Three alignment taxes
Performance taxes
If our unaligned pretrained model has performance Z on capability X and our more aligned model has performance Z’ < Z on capability X, then we say there is a performance tax on capability X.
In the past this performance tax has been measured by how much the model’s score is reduced on standard benchmarks after fine-tuning. While training the first version of InstructGPT OpenAI observed performance regressions on some standard benchmarks on question answering and translation. Those were mostly but not entirely mitigated by mixing pretraining data into the fine-tuning process. Anthropic, DeepMind, and Google have also studied alignment taxes as part of their alignment efforts, and sometimes alignment fine-tuning can even increase performance on several benchmarks, corresponding to a negative performance tax.
However, there is a more natural way we could quantify this tax that lets us translate this tax more directly into monetary terms by measuring how much extra compute we need to spend at inference time to compensate for the performance regression. If our more aligned model needs to spend T% more inference-time compute to get from performance Z’ back to performance Z on capability X, then we say there is a T% alignment tax. For example, if we always need to run best-of-2, this corresponds to a 100% alignment tax. If we need to run best-of-4 for 10% of all tasks, this corresponds to a 4*10% = 40% alignment tax.
Development taxes
Today’s development taxes include building an RLHF codebase, hiring and managing human labelers, compute, and researcher effort. My (pretty rough) guess is that the total development costs of InstructGPT probably sum up to about 5-20% of GPT-3’s development cost. However, most of this development cost is independent of the size of the model, and similarly improving the alignment of a 10x smaller or larger language model would have taken a similar amount of effort. In fact, in reality it is probably the other way around: higher development cost of larger language models justify a larger effort on making it more aligned, such as having a larger team working on it.
We could also see the general effort of the alignment research community as part of the development tax on AGI. If there exists an indefinitely scalable solution to the alignment problem, the total cost of finding this solution would be a one-time development cost. However, this solution isn’t required to make today’s AI more aligned, and thus shouldn’t be budgeted to those efforts.
Time-to-deployment taxes
For time-to-deployment taxes similar considerations apply as they do for performance taxes. Today alignment training as done by InstructGPT, ChatGPT, Sparrow, and Anthropic’s assistant takes several sequential steps: collecting prompts, collecting demonstrations, supervised fine-tuning, collecting comparisons, training reward models, RL fine-tuning, and human evaluations. Each of these steps typically requires some iteration and debugging, which can easily add to the overall timeline. For GPT-3 this pipeline took us about 9 months, while today we have enough infrastructure to produce a pretty good model within 3 months since we can reuse a lot of the existing data and code.
This calculus is flawed for an important reason: at some point more capable models can’t be aligned with the same techniques. Therefore simply optimizing our existing training loop won’t help reduce the time-to-deployment of future models. In particular, once models get capable enough to do hard tasks that humans struggle to evaluate, we’d want to use AI-assisted evaluation to train them. However, the infrastructure to do this well is still being developed.
When do these taxes matter?
Competitive markets demand low alignment taxes
Several companies are competing on large language models. In a level playing field everyone would have roughly equally capable pretrained language models. If you train more aligned models, but they suffer a performance tax on capability X, then customers who care about capability X are incentivized to move to a competitor who is deploying similarly capable but less aligned models that are doing better on capability X because they aren’t subject to this tax. Therefore a performance tax can cause aligned models to lose market share and thus hinders the adoption of alignment techniques.
For example, OpenAI’s DALL·E 2 models launched with more conservative safety mitigations that made them harder to use for some legitimate use cases. This is a performance tax because the model effectively performed worse on some use cases than it could have without those mitigations. DALL·E 2’s competitor models Stable Diffusion and Midjourney were launched with fewer safety mitigations and those models saw wider adoption (though this is only correlation since several other aspects were different).
In these kinds of competitive markets even a 10% performance tax might be prohibitive because being 10% more expensive than your competitors could mean a loss of a lot of customers in the long run. Switching costs for API models are particularly low, so these kinds of products are highly sensitive to performance taxes.
In practice there are also performance taxes on language models corresponding to the “usability” of the model that aren’t well-captured by evaluating on today’s standard benchmarks. These have been quite significantly negative relative to the pretrained models for most use cases: pretrained language models are difficult to wrangle because they aren’t trying to help you. The majority of OpenAI’s customers prefer InstructGPT over similarly-sized base models, and probably would even if we spent a lot of inference-time compute on the base models. For example, in human evaluations on prompts from OpenAI’s playground, even the much smaller 1.3b parameter InstructGPT is on average significantly preferred to the few-shot 175b GPT-3 base model. However, this statistic doesn’t account for any content restrictions and other safety mitigations and these may incur an additional performance tax (for example if the model refuses legitimate use cases).
Performance taxes are lower priority for automated alignment research
While aligned AGIs might have to compete in some market, making progress on the alignment problem shouldn’t be a competition. We all benefit from AI being more aligned with humanity, and so we ought to share alignment research progress freely.
When using AI systems to do automated alignment research, these AI systems will also be subject to alignment taxes. However, in this case our AI system is not directly competing with other AI systems in a market, and thus the performance taxes won’t matter as much. Yet the time-to-deployment tax still matters: if alignment progress can’t keep up with AI capabilities, we’d have to slow down or pause AI progress, which would be a very difficult coordination problem.
The performance tax that could be sustained for automated alignment research depends a lot on the total amount of work the system needs to do. In these cases, the development taxes will be the dominating factor. Consider two possible scenarios:
We need to do a fixed amount of alignment work for each new generation of AI systems, for example when going from GPT-2 to GPT-3. In this case the alignment tax that we can sustain depends on how much work needs to be done. For example, if the “pre-tax” compute cost of the automated alignment work is 1% of the development cost of the new AI system, then a 1,000% tax only brings the total alignment cost to 11% of the overall cost of the AI system. However, this only works if the (object level) performance tax on the next generation isn’t much higher than the performance tax on the current generation, otherwise performance taxes will end up compounding from generation to generation.
We need to invest a fixed amount of alignment work to discover an indefinitely scalable solution to the alignment problem. In this case the critical question is not the performance tax paid on the discovery of this solution, but just the total post-tax dollar cost X of the discovery. If humanity could raise up to Y dollars to invest in discovering the indefinitely scalable solution before it’s too late, then what matters is that Y > X. This is more likely to be the case if X is lower (for example because of a lower tax). However, unless the pre-tax cost is actually very close to Y, the alignment tax doesn’t matter as much for the outcome; it’s mostly a cost-saving exercise.
Thus depending on how the numbers shake out, a 10x or even 100x performance tax could be acceptable in this case.
Conclusion
This post discussed three main types of alignment taxes: performance, development, and time-to-deployment taxes. As the commercial competition on deployed language models heats up, there will be more and more pressure to reduce alignment taxes. However, for automated alignment research performance taxes matter less as the primary goal is not to compete in a market but to make progress on alignment research. For this line of research our focus should be on minimizing development and time-to-deployment taxes, so we need to start this work as early as possible.
Thanks to Jeff Wu, Richard Ngo, and Daniel Kokotajlo, and Reimar Leike for feedback on this post.
Credit to Richard Ngo for framing this type of alignment tax.
I didn't understand this part; can you elaborate or try to explain differently?
"We need to do a fixed amount of alignment work for each new generation of AI systems, for example when going from GPT-2 to GPT-3. In this case the alignment tax that we can sustain depends on how much work needs to be done. For example, if the “pre-tax” compute cost of the automated alignment work is 1% of the development cost of the new AI system, then a 1,000% tax only brings the total alignment cost to 11% of the overall cost of the AI system. However, this only works if the (object level) performance tax on the next generation isn’t much higher than the performance tax on the current generation, otherwise performance taxes will end up compounding from generation to generation."
I like this taxonomy! And I like your point about performance tax not mattering so much for automated alignment research. I don't have anything substantive to add, but I'll contribute by linking to these other posts that also attempt to taxonomize alignment taxes / kinds of competiveness. https://www.alignmentforum.org/posts/sD6KuprcS3PFym2eM/three-kinds-of-competitiveness https://www.alignmentforum.org/posts/fRsjBseRuvRhMPPE5/an-overview-of-11-proposals-for-building-safe-advanced-ai