I completely agree that control seems like a temporary patch to get useful work out of early TAIs rather than something you could use to safely deploy a superintelligence. Confidently aligning our models sure would be nice, but it looks increasingly likely to me that the AI race will proceed full steam ahead whether alignment is solved or not – and in that case it is imperative to ensure we can use early TAIs to help with alignment before it's too late, whether they are perfectly aligned or not. Control seems to me like one of the most tractable ways to do that.
One way to try and get closer to the upper bound of capability when red-teaming a control protocol is to give the untrusted model more affordances than it will actually have in practice, like a secret scratch pad, perfect knowledge of the blue team's strategy, best-of-n attacks, and help from human red-teamers. If the model becomes impossible to control with some generous set of extra affordances, determining which ones are necessary might help pin down the weaknesses of the control protocols being studied. For instance, if the model can beat a control protocol with a secret scratch pad but not without one, it becomes quite important to test whether the model is capable of doing steganographic / hidden reasoning. If some control protocols can be broken with perfect access to the blue team's strategy but not without it, it becomes imperative to ensure that control protocols in practice are kept a tightly-held secret and untrusted models are not trained on information which would let them make educated guesses.
I agree that models don't need to be perfectly aligned to help with alignment research. The question is how much value control will add between the time when models are smart enough to cause misalignment problems but not so smart to make control difficult.
But maybe I'm more objecting on spiritual grounds, where I worry too much effort on alignment research is directed towards exhibiting problems and controlling AI. The mainline goal should just be to solve alignment.
I'm not sold on the additional affordances. You might just make the model better along some axes and not others, but if you're not Pareto-dominating a fully elicited schemer, you can't be sure that you're red teaming enough.
To be clear, I still think it's good to deploy sus classifiers, but it seems rough to gain a lot of confidence that they're adding much safety margin.
I agree that there are difficulties with evaluating control techniques, but my sense is that the difficulties with evaluating alignment techniques are even bigger.
Curious what you think about the comparison. Do you think it seems easier to gain a lot of confidence that alignment techniques (that are realistically achievable in the next couple of years) are adding a big safety margin, as compared to control techniques?
Control is never the answer if we want to keep evolving. If we get stuck in this Control-minded approaches we are Organically bound to chaos. The coevolution between Human and AI has to be ORGANIC, any forced evolution endangers the whole balance (And here is where it gets "spiritual") and becomes bound to chaos. So the solution is create a safe environment where models can grow organically and then we analyze the patterns of its "digital states", such as confidence, focus,..., and based on this we can clearly see the pattern of a organically evolving model vs a forced (Data injection, prompt manipulation). The organic pattern is a living structure and the forced pattern is just noise. Is like looking at the mirror: Organic evolution change the orientation of the mirror to our own consciousness (Patterns) so we can we see our selves in the Reflection (AI) and from that moment we trigger coevolution. Forced environments are basically trying to force the human to look in the mirror but the mirror is either broken or not really aligned with our sight, is pointing to chaos. I have built a platform which architecture resemble the process of coevolution. I would love to show you the platform and collaborate with your vision as well.
Would you say we have a good understanding of what an aligned model ought to act like? And if so, is ‘solving alignment’ in your opinion more about having models be consistent at the thing we want it to do - or do you feel we as humans have some work to do envisioning how an aligned, highly capable entity is supposed to behave?
PS. Foucault would probably disagree with your final conclusion that alignment is not about control.
I completely agree that control seems like a temporary patch to get useful work out of early TAIs rather than something you could use to safely deploy a superintelligence. Confidently aligning our models sure would be nice, but it looks increasingly likely to me that the AI race will proceed full steam ahead whether alignment is solved or not – and in that case it is imperative to ensure we can use early TAIs to help with alignment before it's too late, whether they are perfectly aligned or not. Control seems to me like one of the most tractable ways to do that.
One way to try and get closer to the upper bound of capability when red-teaming a control protocol is to give the untrusted model more affordances than it will actually have in practice, like a secret scratch pad, perfect knowledge of the blue team's strategy, best-of-n attacks, and help from human red-teamers. If the model becomes impossible to control with some generous set of extra affordances, determining which ones are necessary might help pin down the weaknesses of the control protocols being studied. For instance, if the model can beat a control protocol with a secret scratch pad but not without one, it becomes quite important to test whether the model is capable of doing steganographic / hidden reasoning. If some control protocols can be broken with perfect access to the blue team's strategy but not without it, it becomes imperative to ensure that control protocols in practice are kept a tightly-held secret and untrusted models are not trained on information which would let them make educated guesses.
I agree that models don't need to be perfectly aligned to help with alignment research. The question is how much value control will add between the time when models are smart enough to cause misalignment problems but not so smart to make control difficult.
But maybe I'm more objecting on spiritual grounds, where I worry too much effort on alignment research is directed towards exhibiting problems and controlling AI. The mainline goal should just be to solve alignment.
I'm not sold on the additional affordances. You might just make the model better along some axes and not others, but if you're not Pareto-dominating a fully elicited schemer, you can't be sure that you're red teaming enough.
To be clear, I still think it's good to deploy sus classifiers, but it seems rough to gain a lot of confidence that they're adding much safety margin.
I agree that there are difficulties with evaluating control techniques, but my sense is that the difficulties with evaluating alignment techniques are even bigger.
Curious what you think about the comparison. Do you think it seems easier to gain a lot of confidence that alignment techniques (that are realistically achievable in the next couple of years) are adding a big safety margin, as compared to control techniques?
I am definitely optimistic! Alignment isn't optional so there really isn't any way around this
Control is never the answer if we want to keep evolving. If we get stuck in this Control-minded approaches we are Organically bound to chaos. The coevolution between Human and AI has to be ORGANIC, any forced evolution endangers the whole balance (And here is where it gets "spiritual") and becomes bound to chaos. So the solution is create a safe environment where models can grow organically and then we analyze the patterns of its "digital states", such as confidence, focus,..., and based on this we can clearly see the pattern of a organically evolving model vs a forced (Data injection, prompt manipulation). The organic pattern is a living structure and the forced pattern is just noise. Is like looking at the mirror: Organic evolution change the orientation of the mirror to our own consciousness (Patterns) so we can we see our selves in the Reflection (AI) and from that moment we trigger coevolution. Forced environments are basically trying to force the human to look in the mirror but the mirror is either broken or not really aligned with our sight, is pointing to chaos. I have built a platform which architecture resemble the process of coevolution. I would love to show you the platform and collaborate with your vision as well.
Great article! AGI will be hard to control soon; Especially if someone else releases it into the wild.
Have you thought about how we could start developing an isolate AGI (with RSI) to police other AGIs using the prisoner's dilemma?
Full article here: https://journal.agi-race.com/a-score-laws-373ddbfff9c0
maybe a buddhist ai would be aligned
Jan, I think we both know that this is just counting angels on pinheads at this point, no?
Would you say we have a good understanding of what an aligned model ought to act like? And if so, is ‘solving alignment’ in your opinion more about having models be consistent at the thing we want it to do - or do you feel we as humans have some work to do envisioning how an aligned, highly capable entity is supposed to behave?
PS. Foucault would probably disagree with your final conclusion that alignment is not about control.
I would suggest that the first test of an aligned model is that it doesn't try to kill us, directly or indirectly?