5 Comments
Feb 14, 2023Liked by Jan Leike

I didn't understand this part; can you elaborate or try to explain differently?

"We need to do a fixed amount of alignment work for each new generation of AI systems, for example when going from GPT-2 to GPT-3. In this case the alignment tax that we can sustain depends on how much work needs to be done. For example, if the “pre-tax” compute cost of the automated alignment work is 1% of the development cost of the new AI system, then a 1,000% tax only brings the total alignment cost to 11% of the overall cost of the AI system. However, this only works if the (object level) performance tax on the next generation isn’t much higher than the performance tax on the current generation, otherwise performance taxes will end up compounding from generation to generation."

Expand full comment
author
Feb 14, 2023·edited Feb 14, 2023Author

Yes, let me try to rephrase:

"You need to do a certain amount of alignment work to make GPT-(N+1) sufficiently aligned given that you've made GPT-N sufficiently aligned. For simplicity, let's assume you can automate all of it, and a perfectly aligned automated 0%-performance-tax alignment researcher (in this case based on GPT-N) would take Y$ of compute cost to do this work. But you don't have the perfectly aligned version; instead you have a sufficiently aligned GPT-N model (sufficient in the sense that it'll do the work and not try to backstab you), but it suffers a large 1000% performance tax in the sense that it would take 11*Y$ of compute cost to do this work with the most suitable model you have. If the total cost to pretrain the new GPT-(N+1) system is much larger, say 100*Y, then the cost of doing the alignment work for GPT-(N+1), 11*Y, is only 10% of the total development cost, which is probably acceptable (even though the performance tax on the GPT-N system used to do this was 1000%). This might not work if the performance tax on the resulting more aligned GPT-(N+1) system is >1000% because it doesn't let you keep iterating, but this also depends on the ratio of pretraining to alignment research costs. All the numbers are made up here for illustrative purposes."

Does that make sense?

Expand full comment

I like this taxonomy! And I like your point about performance tax not mattering so much for automated alignment research. I don't have anything substantive to add, but I'll contribute by linking to these other posts that also attempt to taxonomize alignment taxes / kinds of competiveness. https://www.alignmentforum.org/posts/sD6KuprcS3PFym2eM/three-kinds-of-competitiveness https://www.alignmentforum.org/posts/fRsjBseRuvRhMPPE5/an-overview-of-11-proposals-for-building-safe-advanced-ai

Expand full comment
author

Oh thanks for sharing! I wish I had known about your post earlier; it seems like we arrived at a very similar taxonomy.

Expand full comment

On the bright side, we both converged to a similar taxonomy via thinking independently, which is some evidence that it's a good one.

I think yours is slightly better, because the difference is that inference cost is bundled into performance for you whereas for me it's part of Cost, and I think that's a bit more natural for it to be part of performance. (Thinking from the perspective of decisionmakers making decisions about what kinds of training runs to do and whether to release models and where to allocate engineer hours, it seems like "But this version will cost some % more at inference time" feels like it'll go into the same mental bucket as "but this version will have slightly higher loss / be slightly less preferred by our customers" rather than the same mental bucket as "but this version will cost us lots of money and researcher-hours")

Expand full comment