Misalignments in models typically fall somewhere between the two ends of this spectrum:
Under-elicited models: The model doesn’t try as hard as possible on the task, so its performance is worse than it could be if it was more aligned.
Scheming models: The model is doing some combination of pretending to be aligned, secretly plotting against us, seeking power and resources, exhibiting deliberately deceptive behavior, or even trying to self-exfiltrate.
The mistakes by under-elicited models are usually pretty mundane, due to “sloppiness”, “laziness”, and what could plausibly just be lack of capabilities. Malicious mistakes are pretty rare, and occur mostly due to poisoned data or imitation of malicious humans.
In contrast, scheming models behave nicely a lot of the time, but might take advantage of opportunities to defect. In particular, they might defect when they have good reason to believe that we aren’t watching, and they might try to create more of those opportunities. They might also subtly “fudge” many of their actions a little bit in the direction of their own goals, especially when it’s hard for humans to evaluate or when there is a lot of leeway.
This has a rough analogy to average-case and worst-case behavior: under-elicited models perform worse than they could in the average case, while scheming models perform much worse than they could in the worst case.
Depending on the threat model we have in mind, we might be more or less optimistic about any proposed alignment solutions. On the face of it, under-elicited models seem much less scary than scheming models. However, I argue that having really strong elicitation is key to succeeding on alignment.
Better elicitation is important for risk-reduction
The obvious way in which this is true is that if a model is under-elicited, then dangerous capability evaluations underestimate its risks, and we might apply insufficient mitigations. This is especially severe for open weight models, where deployment is essentially irreversible and mitigations can’t be improved later.
But elicitation is important for alignment more generally: Because LLM inference is so much cheaper than human labor, our models will end up doing a ton of inference that no human has time to read. Therefore it’s simply intractable to meaningfully oversee all but a tiny fraction of our models’ behavior with actual humans. Instead we rely on our models to do this job, both during training (reward models) and during deployment (monitoring).
If our models are under-elicited, they will perform worse as reward models and monitors. This means they won’t punish as many bad actions during training or flag them during deployment.
This is especially problematic if the agent we’re supervising is actually malicious in some way. If the agent decides to defect, it might elicit most or all of its own capabilities towards malicious goals, and thus could manage to bypass an under-elicited monitor, even if that monitor is a copy of itself. Depending on the size of the task’s evaluation-generation gap, the monitor could be at an advantage, and thus doesn’t need to be as well-elicited as the agent. But this gap might be small on some tasks and we probably also want a significant safety margin between bad stuff the model could do and what the monitor can detect. Therefore we need well-elicited monitors.
For under-elicited reward models the story could be even worse, because they might teach the model what kind of bad behavior they will get away with or even cause scheming in the first place.
Better elicitation should also help address scheming: If we have a great methods to elicit our models, we can use them to elicit the model’s capabilities towards scheming. Even more, the better elicited the model is, the less “slack” it might have to pursue ulterior motives. If it has to use all of its capability to do as well as possible on the task, there shouldn’t be much time and effort left to think about how to sabotage the result.
Finally, under-elicited models are generally needlessly inefficient since you need a larger model to reach the same level of performance. While this would be costly for a deployed product, it might be acceptable for an automated alignment researcher.
Current models are probably severely under-elicited
In my experience working with InstructGPT and ChatGPT for a number of years, there are a ton of gains from prosaic ML interventions to RLHF (better data, better algorithms, etc.). For example, the original InstructGPT work showed that RLHF can outperform SFT models by >100x effective pretraining compute, and progress has not stalled since then (e.g. compare Sonnet 3.5 with Sonnet 3.5 (new)). Typically very little compute is spent on posttraining relative to pretraining, so this suggests that these gains are mostly alignment improvements.
Since these alignment improvements have been a consistent trend over a number of years, I would assume that this will continue or a while. To me it seems likely that there still remain several orders of magnitude of alignment/elicitation gains to be discovered on the current generation of models. In other words, current models are probably severely under-elicited. I expect that if you chatted to a fully elicited GPT-4, it would feel like a substantially different experience that it did when the model launched.
Don’t anchor on today’s models
Under-elicitation is common in today’s models, but scheming is not. However, this doesn’t mean we don’t need to worry about scheming:
The scheming-related capabilities of our models are still quite poor. The models won’t be deceptive if they aren’t smart enough to figure out that this would give them advantages, or how to do it.
Models aren’t smart enough yet to be situationally aware. Once the models understand their situation, they could be much more motivated to change it.
Models aren’t really trained to be agents. More RL will probably teach the model to follow all kinds of convergent instrumental goals, simply because convergent instrumental goals are just so useful. This will probably ingrain behavior like self-preservation and might teach the model useful skills that are relevant to scheming.
There are probably enough sci-fi stories about misaligned AI in the pretraining data that models will always end up exploring some scheming-related behavior, so a big question is whether the RL loop reinforces this behavior.
Even if our models never end up scheming on their own accord, we still need to be ready to detect and act upon scheming behavior for our models.
Thanks to Shauna Kravec, Ansh Radhakrishnan, Sam Bowman, Ryan Greenblatt, Jared Kaplan, Tamera Lanham, and Evan Hubinger for feedback on an earlier version of this post.
Typo: you should say "Models *aren't* really trained to be agents"
You also say models aren't situationally aware now. I'm skeptical. They seem pretty situationally aware to me; in what sense are they *not* situationally aware?
What do you think the best way to study/elicit scheming behaviour is?
Internal/External red-teaming seems to be the go to method for a lot of companies, but these don't seem to be very scalable, in part because, as you say, "LLM inference is so much cheaper than human labor". Automated red teaming is promising, but current methods red team models themselves, not agentic systems. IMO crowdsourcing through competitions (e.g. HackAPrompt, TensorTrust) is the best way to study these behaviours due to the incentive structure and scalability. I am currently spinning up an agentic red teaming competition that I hope will fill this need and provide evals on top models.