Discussion about this post

User's avatar
Adam Kaufman's avatar

I completely agree that control seems like a temporary patch to get useful work out of early TAIs rather than something you could use to safely deploy a superintelligence. Confidently aligning our models sure would be nice, but it looks increasingly likely to me that the AI race will proceed full steam ahead whether alignment is solved or not – and in that case it is imperative to ensure we can use early TAIs to help with alignment before it's too late, whether they are perfectly aligned or not. Control seems to me like one of the most tractable ways to do that.

One way to try and get closer to the upper bound of capability when red-teaming a control protocol is to give the untrusted model more affordances than it will actually have in practice, like a secret scratch pad, perfect knowledge of the blue team's strategy, best-of-n attacks, and help from human red-teamers. If the model becomes impossible to control with some generous set of extra affordances, determining which ones are necessary might help pin down the weaknesses of the control protocols being studied. For instance, if the model can beat a control protocol with a secret scratch pad but not without one, it becomes quite important to test whether the model is capable of doing steganographic / hidden reasoning. If some control protocols can be broken with perfect access to the blue team's strategy but not without it, it becomes imperative to ensure that control protocols in practice are kept a tightly-held secret and untrusted models are not trained on information which would let them make educated guesses.

Expand full comment
Juan Jimenez's avatar

Control is never the answer if we want to keep evolving. If we get stuck in this Control-minded approaches we are Organically bound to chaos. The coevolution between Human and AI has to be ORGANIC, any forced evolution endangers the whole balance (And here is where it gets "spiritual") and becomes bound to chaos. So the solution is create a safe environment where models can grow organically and then we analyze the patterns of its "digital states", such as confidence, focus,..., and based on this we can clearly see the pattern of a organically evolving model vs a forced (Data injection, prompt manipulation). The organic pattern is a living structure and the forced pattern is just noise. Is like looking at the mirror: Organic evolution change the orientation of the mirror to our own consciousness (Patterns) so we can we see our selves in the Reflection (AI) and from that moment we trigger coevolution. Forced environments are basically trying to force the human to look in the mirror but the mirror is either broken or not really aligned with our sight, is pointing to chaos. I have built a platform which architecture resemble the process of coevolution. I would love to show you the platform and collaborate with your vision as well.

Expand full comment
8 more comments...

No posts