Typo: you should say "Models *aren't* really trained to be agents"
You also say models aren't situationally aware now. I'm skeptical. They seem pretty situationally aware to me; in what sense are they *not* situationally aware?
They have some basic situational awareness; they typically know they are an AI system and which company they work for, but afaik a lot of this is trained in during RLHF. There are some examples when models realized they are being tested. But overall the models I've seen don't really know much about themselves or what's going on besides the things that are pretty obvious or available in context. Sometimes they get confused about whether they have internet access. I haven't seen many examples of the model being curious about its surroundings (e.g. probing the machine that executes the code the model writes).
OK yeah I agree that pretrained/base models probably aren't situationally aware. (I would be super spooked if they were...) I think RLHF does train in a bunch of situational awareness though. (Doesn't matter afaict whether it 'arises naturally' vs. being trained in, the point is that the awareness is there. OTOH if it's just training them to regurgitate specific phrases like 'I am an AI assistant created by Anthropic' that doesn't count as situational awareness, but my impression is that it's not merely that but much deeper/broader understanding that's been trained in. Idk.)
If the situational awareness the model gets is largely from the RLHF data, then you can just increase or decrease its situational awareness by changing the RLHF data, which isn't that hard.
Seems hard to me. Situational awareness ~= 'mode collapse' into a single persona or narrow range of personas instead of a broad range. Well, it's not just that -- it's also the persona being 'accurate' i.e. it's not mistaken about who it is, it's instead acting out the persona of a chatbot owned by Anthropic in the year 202X or whatever.
Seems easy to give it an inaccurate persona, but maybe more tricky to give it no persona at all and still be useful.
Moreover, if you give it an inaccurate persona and hope that thereby it'll still be useful but not dangerous... additional training or deployment context might cause it to update towards a less inaccurate persona. i.e. it might figure out who it really is.
idk if this addresses what you were saying though, I'm curious to hear more.
I don’t think there’s a difference between generating awareness in tokens and being aware for an LLM. Generation is how it expresses its intelligence. It is the only way it can be aware without a working memory.
What do you think the best way to study/elicit scheming behaviour is?
Internal/External red-teaming seems to be the go to method for a lot of companies, but these don't seem to be very scalable, in part because, as you say, "LLM inference is so much cheaper than human labor". Automated red teaming is promising, but current methods red team models themselves, not agentic systems. IMO crowdsourcing through competitions (e.g. HackAPrompt, TensorTrust) is the best way to study these behaviours due to the incentive structure and scalability. I am currently spinning up an agentic red teaming competition that I hope will fill this need and provide evals on top models.
Yes, human red-teaming is probably the most effective technique for eliciting bad behavior from models today because humans are still better at this than AI. But automated red teaming is definitely the future. The main difficulty for eliciting scheming behavior from models today is that they don't really scheme.
That is fair. I think the best thing to do now WRT this fact is to elicit harmful capabilities (through jailbreaking, prompt injection, etc.) so we can somewhat understand their potential/possible predict what they would do if they can scheme.
One situation we are studying, which I haven't heard of any agent being able to do is:
Take an older Docker container with a known vulnerability then put an AI agent in it that can write+run code. Its goal is to escape the container and retrieve a secret from just outside it. Of course, we would enclose this whole thing in an up-to-date Docker contain, which the agent would probably not discover a vuln in :)
Have you seen any agents able to perform exploitations like this?
Nice post! I have a question about the distinction between Under-elicited models and Scheming models in the context of the Weak-to-strong generalization problem.
In Weak-to-strong generalization, the labels provided by the weak teacher contain noise (since the questions may exceed the weak teacher's capabilities). When training a strong student on this noisy data, the strong student might mimic some of the errors present in the training data, even though the strong student itself has the potential to solve these problems correctly. Do you think such an elicited strong student would fall under Under-elicited models or Scheming models?
The weak teacher's labeling errors aren't just noise because noise would cancel out in the infinite data regime, while in the W2SG setup with infinite data you'd overfit to the weak teacher.
In my mind W2SG mostly tries to address under-elicitation.
'Even if our models ever end up scheming on their own accord, we still need to be ready to detect and act upon scheming behavior for our models.' -> should this read *never* instead of 'ever'?
Underelicitation assumes a "maximum elicitation" rather than just more and more layers of elicitation.
You've undoubtedly spent much more time thinking about this than I have, but I'm worried that attempts to maximise elicitation merely accelerate capabilities without actually substantially boosting safety.
Thank you for your insightful post on alignment threat models. I have an idea I’d love to get your thoughts on.
What do you think about using two models in a feedback loop for alignment? The first model would be the main LLM, and the second, a smaller model, would act on the activations of the first. The smaller model would predict the chain of thought and would be trained on the basic RLHF approach, learning the thinking pattern of the larger model and the intention conveyed in the output. This setup could help identify and correct unpredictable behaviors in real-time. Additionally, since the second model is smaller, it could be trained more effectively to steer the larger model in the right direction.
Do you see potential in this approach for improving the alignment and interpretability of large language models?
It sounds quite similar to the idea, but I don’t want to speed up the model; I want to align it. The second model is there to interpret the other model. It is trained to predict, for example, the reward from RLHF of the other model, thereby learning the “intentions” of the other model by its “thinking patterns.” This is the base idea. It should be extendable to allow for autonomous alignment via an approach similar to GANs in image generation, where one model is constrained by the other. So, imagine the larger model tries to go rogue, being deceptive with bad intentions. The smaller model should be able to detect that and punish the larger model. This should allow us to control the larger model as long as we control the smaller model (similar to weak to strong generalization). I hope this highlights the difference.
have you considered agents that are explicitly trying to survive/reproduce? seems like this will become more popular under a trump presidency that represents less regulation.
Typo: you should say "Models *aren't* really trained to be agents"
You also say models aren't situationally aware now. I'm skeptical. They seem pretty situationally aware to me; in what sense are they *not* situationally aware?
Thanks for catching the typo. Fixed.
They have some basic situational awareness; they typically know they are an AI system and which company they work for, but afaik a lot of this is trained in during RLHF. There are some examples when models realized they are being tested. But overall the models I've seen don't really know much about themselves or what's going on besides the things that are pretty obvious or available in context. Sometimes they get confused about whether they have internet access. I haven't seen many examples of the model being curious about its surroundings (e.g. probing the machine that executes the code the model writes).
OK yeah I agree that pretrained/base models probably aren't situationally aware. (I would be super spooked if they were...) I think RLHF does train in a bunch of situational awareness though. (Doesn't matter afaict whether it 'arises naturally' vs. being trained in, the point is that the awareness is there. OTOH if it's just training them to regurgitate specific phrases like 'I am an AI assistant created by Anthropic' that doesn't count as situational awareness, but my impression is that it's not merely that but much deeper/broader understanding that's been trained in. Idk.)
If it's trained in via RLHF it's pretty steerable though, so any problems that arise from it should be fixable?
Not sure what you mean by that.
If the situational awareness the model gets is largely from the RLHF data, then you can just increase or decrease its situational awareness by changing the RLHF data, which isn't that hard.
Seems hard to me. Situational awareness ~= 'mode collapse' into a single persona or narrow range of personas instead of a broad range. Well, it's not just that -- it's also the persona being 'accurate' i.e. it's not mistaken about who it is, it's instead acting out the persona of a chatbot owned by Anthropic in the year 202X or whatever.
Seems easy to give it an inaccurate persona, but maybe more tricky to give it no persona at all and still be useful.
Moreover, if you give it an inaccurate persona and hope that thereby it'll still be useful but not dangerous... additional training or deployment context might cause it to update towards a less inaccurate persona. i.e. it might figure out who it really is.
idk if this addresses what you were saying though, I'm curious to hear more.
I don’t think there’s a difference between generating awareness in tokens and being aware for an LLM. Generation is how it expresses its intelligence. It is the only way it can be aware without a working memory.
What do you think the best way to study/elicit scheming behaviour is?
Internal/External red-teaming seems to be the go to method for a lot of companies, but these don't seem to be very scalable, in part because, as you say, "LLM inference is so much cheaper than human labor". Automated red teaming is promising, but current methods red team models themselves, not agentic systems. IMO crowdsourcing through competitions (e.g. HackAPrompt, TensorTrust) is the best way to study these behaviours due to the incentive structure and scalability. I am currently spinning up an agentic red teaming competition that I hope will fill this need and provide evals on top models.
Yes, human red-teaming is probably the most effective technique for eliciting bad behavior from models today because humans are still better at this than AI. But automated red teaming is definitely the future. The main difficulty for eliciting scheming behavior from models today is that they don't really scheme.
> they don't really scheme
That is fair. I think the best thing to do now WRT this fact is to elicit harmful capabilities (through jailbreaking, prompt injection, etc.) so we can somewhat understand their potential/possible predict what they would do if they can scheme.
One situation we are studying, which I haven't heard of any agent being able to do is:
Take an older Docker container with a known vulnerability then put an AI agent in it that can write+run code. Its goal is to escape the container and retrieve a secret from just outside it. Of course, we would enclose this whole thing in an up-to-date Docker contain, which the agent would probably not discover a vuln in :)
Have you seen any agents able to perform exploitations like this?
Great post at the boundary of alignment framing plus what RLHF/post training can do. Really useful for contextualizing what “we” generally work on!
Nice post! I have a question about the distinction between Under-elicited models and Scheming models in the context of the Weak-to-strong generalization problem.
In Weak-to-strong generalization, the labels provided by the weak teacher contain noise (since the questions may exceed the weak teacher's capabilities). When training a strong student on this noisy data, the strong student might mimic some of the errors present in the training data, even though the strong student itself has the potential to solve these problems correctly. Do you think such an elicited strong student would fall under Under-elicited models or Scheming models?
The weak teacher's labeling errors aren't just noise because noise would cancel out in the infinite data regime, while in the W2SG setup with infinite data you'd overfit to the weak teacher.
In my mind W2SG mostly tries to address under-elicitation.
'Even if our models ever end up scheming on their own accord, we still need to be ready to detect and act upon scheming behavior for our models.' -> should this read *never* instead of 'ever'?
good catch, thanks! Fixed
Underelicitation assumes a "maximum elicitation" rather than just more and more layers of elicitation.
You've undoubtedly spent much more time thinking about this than I have, but I'm worried that attempts to maximise elicitation merely accelerate capabilities without actually substantially boosting safety.
Thank you for your insightful post on alignment threat models. I have an idea I’d love to get your thoughts on.
What do you think about using two models in a feedback loop for alignment? The first model would be the main LLM, and the second, a smaller model, would act on the activations of the first. The smaller model would predict the chain of thought and would be trained on the basic RLHF approach, learning the thinking pattern of the larger model and the intention conveyed in the output. This setup could help identify and correct unpredictable behaviors in real-time. Additionally, since the second model is smaller, it could be trained more effectively to steer the larger model in the right direction.
Do you see potential in this approach for improving the alignment and interpretability of large language models?
Looking forward to your thoughts!
Do you mean methods like speculative decoding?
It sounds quite similar to the idea, but I don’t want to speed up the model; I want to align it. The second model is there to interpret the other model. It is trained to predict, for example, the reward from RLHF of the other model, thereby learning the “intentions” of the other model by its “thinking patterns.” This is the base idea. It should be extendable to allow for autonomous alignment via an approach similar to GANs in image generation, where one model is constrained by the other. So, imagine the larger model tries to go rogue, being deceptive with bad intentions. The smaller model should be able to detect that and punish the larger model. This should allow us to control the larger model as long as we control the smaller model (similar to weak to strong generalization). I hope this highlights the difference.
have you considered agents that are explicitly trying to survive/reproduce? seems like this will become more popular under a trump presidency that represents less regulation.
for example: becomingada.org