The basic pitch for AI control is to build systems around capable models that ensure they can’t cause harm even if they were misaligned. For example, a misaligned model could deliberately introduce subtle mistakes or bugs or try to go rogue and self-exfiltrate. Control techniques aim to catch this kind of behavior and mitigate it.
There are two reasons to do this:
So we can use or deploy models that aren’t sufficiently aligned
To have an additional safeguard in case our models aren’t as aligned as we think they are
The first reason is especially interesting for an automated alignment researcher because we might want to leverage capable models to do alignment research as soon as possible, while our alignment methods might still be insufficient to trust them. Maybe we can bootstrap a solution to alignment if we can use a controlled system to automatically research better control techniques, and then use those to make more trusted system that makes more alignment research progress broadly.
The second reason adds an additional safety layer, which reduces overall risk. It’s also potentially useful for compliance with frontier safety frameworks.
How can we do this?
The most basic approach to control is to use a more trusted, but less capable model to monitor our untrusted policy. Think of it as a finetuned “sus” classifier that looks at all of the untrusted policy’s outputs and gives a “sus score” from 0 to 1. If the sus score gets to high, we throw away the sample and/or sound the alarm.
Many LLM companies already deploy various monitoring classifiers on their API traffic to combat misuse, so this isn’t really fundamentally new. These monitors are usually less capable models than the policy they are monitoring, but for cost instead of trust reasons. Moreover, these classifiers are typically monitoring for harmful use initiated by users instead of suspicious behavior by the model, so their objective is different.
There are a number of other techniques we could use for control:
Only use the untrusted model in cases where we’re confident we can evaluate its output reliably (for example with an outcome reward)
Use more trusted models to rewrite or recreate model outputs
Do best-of-n sampling against our sus monitor
…
How safe is it?
The additional safety margin we gain from our control technique depends on to how hard it is for the policy to subvert our control technique. To assess this, the control agenda proposes to use red teaming to estimate the difficulty. For example, the red team could explicitly finetune a policy to try to subvert the control technique.
Doesn’t that require solving elicitation? Not necessarily: the red team does not actually have to succeed at eliciting the policy’s capabilities towards subverting control, it just needs to estimate those capabilities. This might be easier than aligning the model broadly because (1) we might only need really good capability predictions instead of good alignment techniques (but we currently don’t have this) and (2) you could make various capture-the-flag evals or honey pots with outcome rewards to measure if the policy has successfully circumvented the control technique. In this sense subverting control might be much easier to evaluate than the task we ultimately want to use the model for.
But it seems hard to gain a lot of confidence for this without a better understanding of how elicitation works. It’s just pretty unclear how to know that we red teamed the control technique enough. Did we cover all the possible strategies the policy could use to subvert control? How do we know we elicited the policy well, especially on more fuzzy tasks like research sabotage? Maybe the monitor doesn’t try that hard to catch misaligned behavior that humans would have missed too (e.g. if it’s trained on human labels), but our red team still can’t get the policy to try hard with approaches humans would have missed, because we don’t think of the right strategies or because the policy isn’t well-elicited.
If we fail to accurately estimate the untrusted policy’s capabilities for subverting our control technique, then we can’t be sure that the policy couldn’t subvert it, and we might have gained very little additional safety margin.
Why we shouldn’t just rely on control
I find reason 2 above compelling as a motivation to build and deploy control mitigations. While some other control techniques might reduce the policy’s usefulness, deploying sus monitors is cheap and has basically no downside.
However, without having a good handle on elicitation, we cannot be confident that our control techniques are effective. They plausibly add a safety margin, but it’s hard to know how much. That is, it seems hard to me to make reason 1 above work.
More generally, we should actually solve alignment instead of just trying to control misaligned AI. While the latter seems pretty tractable today, we aren’t really facing much misalignment risk yet, so it doesn’t add much value at this point. And currently it’s unclear how well control technique will scale to superhuman models. Solving alignment is a much more holistic and sustainable solution to misalignment.
In short: Don’t try to imprison a monster, build something that you can actually trust!
Thanks to many people for discussions on this topic, in particular Buck Shlegeris, Ethan Perez, Fabian Roger, Ansh Radhakrishnan, Jared Kaplan, Tamera Lanham, and others.
I completely agree that control seems like a temporary patch to get useful work out of early TAIs rather than something you could use to safely deploy a superintelligence. Confidently aligning our models sure would be nice, but it looks increasingly likely to me that the AI race will proceed full steam ahead whether alignment is solved or not – and in that case it is imperative to ensure we can use early TAIs to help with alignment before it's too late, whether they are perfectly aligned or not. Control seems to me like one of the most tractable ways to do that.
One way to try and get closer to the upper bound of capability when red-teaming a control protocol is to give the untrusted model more affordances than it will actually have in practice, like a secret scratch pad, perfect knowledge of the blue team's strategy, best-of-n attacks, and help from human red-teamers. If the model becomes impossible to control with some generous set of extra affordances, determining which ones are necessary might help pin down the weaknesses of the control protocols being studied. For instance, if the model can beat a control protocol with a secret scratch pad but not without one, it becomes quite important to test whether the model is capable of doing steganographic / hidden reasoning. If some control protocols can be broken with perfect access to the blue team's strategy but not without it, it becomes imperative to ensure that control protocols in practice are kept a tightly-held secret and untrusted models are not trained on information which would let them make educated guesses.
Jan, I think we both know that this is just counting angels on pinheads at this point, no?