Discussion about this post

User's avatar
Adam Kaufman's avatar

I completely agree that control seems like a temporary patch to get useful work out of early TAIs rather than something you could use to safely deploy a superintelligence. Confidently aligning our models sure would be nice, but it looks increasingly likely to me that the AI race will proceed full steam ahead whether alignment is solved or not – and in that case it is imperative to ensure we can use early TAIs to help with alignment before it's too late, whether they are perfectly aligned or not. Control seems to me like one of the most tractable ways to do that.

One way to try and get closer to the upper bound of capability when red-teaming a control protocol is to give the untrusted model more affordances than it will actually have in practice, like a secret scratch pad, perfect knowledge of the blue team's strategy, best-of-n attacks, and help from human red-teamers. If the model becomes impossible to control with some generous set of extra affordances, determining which ones are necessary might help pin down the weaknesses of the control protocols being studied. For instance, if the model can beat a control protocol with a secret scratch pad but not without one, it becomes quite important to test whether the model is capable of doing steganographic / hidden reasoning. If some control protocols can be broken with perfect access to the blue team's strategy but not without it, it becomes imperative to ensure that control protocols in practice are kept a tightly-held secret and untrusted models are not trained on information which would let them make educated guesses.

Anurag's avatar
2dEdited

Likely, control shows up as the default response because alignment is usually framed in terms of undesirable outcomes. If the problem is “prevent X” then control is the obvious lever.

But if we instead frame alignment in terms of system-intrinsic deficiencies, the problem shifts from containment to structural resolution. In that framing, an aligned system would not exhibit persistent failures in:

a) goal representation and generalisation

b) boundary adherence and constraint integrity

c) world-model coherence

d) self-modeling and capability awareness

e) internal stability and coherence

I am curious whether you see active industry research on these systemic causes and mitigations, and if so, where you think those efforts are most concentrated?

13 more comments...

No posts

Ready for more?