Thanks for the post! One objection I think is potentially important is with regards to the relative rate of improvement in alignment versus other capabilities. While I agree that we'll be able to use protocols like Debate/IDA/RRM to help us align AI that is helping with alignment work, my concern is that the alignment work will "lag" behind the capabilities. If alignment is always lagging capabilities, then once your system is powerful enough, you won't be able to control it well. Curious how you think about the relative rate of progress in alignment vs. capabilities.
Great question! If alignment keeps lagging behind capabilities that would be a real problem in general. Hopefully we can keep scaling our alignment efforts and measure how well we're doing (see setting ourselves up for iteration).
My Uncontrollability paper is long and addresses 4 different types of control. “Disobey” applies only to Direct control (giving orders), which is not Alignment and everyone agrees it will not work, so I don’t think we disagree on this point.
The paper also explicitly says, in regards to the Rice’s theorem, that “AI safety researchers [36] correctly argue that we do not have to deal with an arbitrary AI, as if gifted to us by aliens, but rather we can design a particular AI with the safety properties we want.” So once again I think we are in agreement.
I also read your blogpost on formal verification and have a published paper on some of the challenges you are describing: https://iopscience.iop.org/article/10.1088/1402-4896/aa7ca8/meta It looks to me like we are looking at very similar initial conditions, correctly identifying numerous challenges, but for some reason arrive at very different predictions regarding our ability to solve all such problems (see https://dl.acm.org/doi/10.1145/3603371 for a recent survey), especially in the next 4 years.
I honestly hope I am wrong, and you are right, but so far, I am struggling to find any evidence of sufficient progress.
>Large language models (LLMs) make this a lot easier: they come preloaded with a lot of humanity’s knowledge, including detailed knowledge about human preferences and values. Out of the box they aren’t agents who are trying to pursue their own goals in the world. In many ways they are a blank slate on which we can write our objective function and they are surprisingly easy to train to behave more nicely.
To me, the first and third sentences of this paragraph seem like they are basically opposing. It seems like LLMs are powerful as a starting point for RL precisely because they are *not* a "blank slate". Have I misunderstood the point you're making here?
Thank you for writing this! I've been trying to consolidate my own thoughts around reward modeling and theoretical v. empirical alignment research for a long time, and this post and the discussion has been very helpful. I'll probably write that up on LW later, but for now I have a few questions:
1. What does the endgame look like? The post emphasizes that we only need an MVP alignment research AI, so it can be relatively unintelligent, narrow, myopic, non-agenty, etc. This means that it poses less capabilities risk and is easier to evaluate, both of which are great. But eventually we may need to align AGI that is none of these things. Is the idea that this alignment research AI will discover/design alignment techniques that a) human researchers can evaluate and b) will work on future AGI? Or do we start using other narrowly aligned models to evaluate it at some point? How do we convince ourselves that all of this is working towards the goal of "aligned AI" and not "looks good to alignment researchers"?
2. Related to that, the post says “the burden of proof is always on showing that a new system is sufficiently aligned” and “We have to mistrust what the model is doing anyway and discard it if we can’t rigorously evaluate it.” What might this proof or rigorous evaluation look like? Is this something that can be done with empirical alignment work?
3. I agree that the shift in AI capabilities paradigms from DRL agents playing games to LLMs generating text seems good for alignment, in part because LLM training could teach human values and introduce an ontology for understanding human preferences and communication. But clearly LLM pretraining doesn't teach all human values -- if it did, then RLHF finetuning wouldn't be required at all. How can we know what values are "missing" from pre-training, and how can we tell if/when RLHF has filled in the gap? Is it possible to verify that model alignment is "good enough"?
4. Finally, this might be more of an objection than a question, but... One of my major concerns is that automating alignment research also helps automate capabilities research. One of the main responses to this in the post is that "automated ML research will happen anyway." However, if this is true, then why is OpenAI safety dedicating substantial resources to it? Wouldn't it be better to wait for ML researchers to knock that one out, and spend the interim working on safety-specific techniques (like interpretability, since it's mentioned a lot in the post)? If ML researchers won't do that satisfactorily, then isn't dedicating safety effort to it differentially advancing capabilities?
Thanks for your questions! I'm glad this post was helpful to you!
1. My current guess for the endgame for alignment looks quite different from what we do today. In particular, at some point the bar should be "we have a formal theory for what alignment means" and "we formally verify with respect to this theory." More details here: https://aligned.substack.com/p/alignment-solution Of course, it's always hard to say what the future will look like.
2. I expect that in the medium term evaluation will be almost entirely empirical in addition to some high-level arguments around our training algorithms.
3. There is an important distinction between knowing about human values and following them. A pretrained model doesn't really do the latter very much. You could zero-shot a reward model from a pretrained model and do RL against that. It'll work a lot worse than RLHF with today's models, but with some tuning it should still give you a model that's more aligned than the base model.
What's missing from pretraining is easy to measure if you have some annotated human feedback data :)
The bar for what's good enough will have to increase over time. Right now we're still pretty far from where we want to be.
4. We want to focus on aspects of research work that are differentially helpful to alignment. However, most of our day-to-day work looks like pretty normal ML work, so it might be that we'll see limited alignment research acceleration before ML research automation happens.
This was interesting. I do wonder though that the idea of alignment itself seems, in this instance, far closer to a software mentality where you want it to work properly, and the idea of doing this through iterative progress has been what's worked historically. We cant solve all of tomorrow's problems today.
Funnily enough I don't know whether I'd have even called this alignment if I were looking at this de novo, which is good! I find my perspective is a little more pessimistic wrt the long-term alignment pov (https://www.strangeloopcanon.com/p/agi-strange-equation), and more positive on this style of approach.
Very few problems have a perfect solution. But a perfect solution is too high of a bar. The bar I want to aim for, which I think is realistic, is "bound the total sum of all future risk below some small number." I am cautiously hopeful that formal (mathematical) guarantees will be part of the picture for superintelligence alignment eventually, but I don't think that's realistic for aligning the first roughly human-level systems.
To be honest, I don't find the impossibilities proofs from that paper particularly compelling. For example:
> Give an explicitly controlled AI an order: “Disobey!” If the AI obeys, it violates your order and becomes uncontrolled, but if the AI disobeys it also violates your order and is uncontrolled.
It seems perfectly acceptable for the AI system to respond by saying "I can't carry out this order without creating a paradox" without being uncontrolled in the sense that it will deliberately harm humans.
The paper also argues, for example in Section 6.4, that an AI system given in form of a computer program can't be proven to be aligned in general, because proving any nontrivial property about an arbitrary computer problem is undecidable (Rice's theorem). But the property you need is not that you can prove for any arbitrary computer program whether it's aligned or not, but instead you'd want to give a proof about a specific program (the one you're building), and you can choose to build that specific program in a way that makes it easier to prove. (In general, it's not impossible to prove specific things about specific programs.)
Are you referring to these questions? Will try to answer them below.
> Would you agree we need essentially a perfect solution for AGI/ASI given the risks?
The higher the risk, the tighter the safety mitigation needs to be. With ASI the stakes will be much higher than with AGI. There is no "perfect" in the real world, but there is a "good enough" and "not good enough". What exactly the bar is depends on how the technology develops.
> have you and your team deliberated over whether the alignment problem is in fact solvable
We think it's solvable, but we could be wrong and if so then in the process of trying to solve it we should be able to produce evidence that it's not possible.
> in a recent interview you discussed fast takeoff possibilities and acknowledged this is plausible in the coming years -- what happens if this happens in say the next 3 years and we have so little work completed on alignment?
Thanks for the post! One objection I think is potentially important is with regards to the relative rate of improvement in alignment versus other capabilities. While I agree that we'll be able to use protocols like Debate/IDA/RRM to help us align AI that is helping with alignment work, my concern is that the alignment work will "lag" behind the capabilities. If alignment is always lagging capabilities, then once your system is powerful enough, you won't be able to control it well. Curious how you think about the relative rate of progress in alignment vs. capabilities.
Great question! If alignment keeps lagging behind capabilities that would be a real problem in general. Hopefully we can keep scaling our alignment efforts and measure how well we're doing (see setting ourselves up for iteration).
Hey Jan,
My Uncontrollability paper is long and addresses 4 different types of control. “Disobey” applies only to Direct control (giving orders), which is not Alignment and everyone agrees it will not work, so I don’t think we disagree on this point.
The paper also explicitly says, in regards to the Rice’s theorem, that “AI safety researchers [36] correctly argue that we do not have to deal with an arbitrary AI, as if gifted to us by aliens, but rather we can design a particular AI with the safety properties we want.” So once again I think we are in agreement.
I also read your blogpost on formal verification and have a published paper on some of the challenges you are describing: https://iopscience.iop.org/article/10.1088/1402-4896/aa7ca8/meta It looks to me like we are looking at very similar initial conditions, correctly identifying numerous challenges, but for some reason arrive at very different predictions regarding our ability to solve all such problems (see https://dl.acm.org/doi/10.1145/3603371 for a recent survey), especially in the next 4 years.
I honestly hope I am wrong, and you are right, but so far, I am struggling to find any evidence of sufficient progress.
Best,
Roman
How do you mean "giving orders will not work"?
What would you count as evidence of sufficient progress?
>Large language models (LLMs) make this a lot easier: they come preloaded with a lot of humanity’s knowledge, including detailed knowledge about human preferences and values. Out of the box they aren’t agents who are trying to pursue their own goals in the world. In many ways they are a blank slate on which we can write our objective function and they are surprisingly easy to train to behave more nicely.
To me, the first and third sentences of this paragraph seem like they are basically opposing. It seems like LLMs are powerful as a starting point for RL precisely because they are *not* a "blank slate". Have I misunderstood the point you're making here?
Good point, this is written in a confusing way. What I meant to say is that their objectives are very malleable. I've updated the phrasing. Thanks!
Thank you for writing this! I've been trying to consolidate my own thoughts around reward modeling and theoretical v. empirical alignment research for a long time, and this post and the discussion has been very helpful. I'll probably write that up on LW later, but for now I have a few questions:
1. What does the endgame look like? The post emphasizes that we only need an MVP alignment research AI, so it can be relatively unintelligent, narrow, myopic, non-agenty, etc. This means that it poses less capabilities risk and is easier to evaluate, both of which are great. But eventually we may need to align AGI that is none of these things. Is the idea that this alignment research AI will discover/design alignment techniques that a) human researchers can evaluate and b) will work on future AGI? Or do we start using other narrowly aligned models to evaluate it at some point? How do we convince ourselves that all of this is working towards the goal of "aligned AI" and not "looks good to alignment researchers"?
2. Related to that, the post says “the burden of proof is always on showing that a new system is sufficiently aligned” and “We have to mistrust what the model is doing anyway and discard it if we can’t rigorously evaluate it.” What might this proof or rigorous evaluation look like? Is this something that can be done with empirical alignment work?
3. I agree that the shift in AI capabilities paradigms from DRL agents playing games to LLMs generating text seems good for alignment, in part because LLM training could teach human values and introduce an ontology for understanding human preferences and communication. But clearly LLM pretraining doesn't teach all human values -- if it did, then RLHF finetuning wouldn't be required at all. How can we know what values are "missing" from pre-training, and how can we tell if/when RLHF has filled in the gap? Is it possible to verify that model alignment is "good enough"?
4. Finally, this might be more of an objection than a question, but... One of my major concerns is that automating alignment research also helps automate capabilities research. One of the main responses to this in the post is that "automated ML research will happen anyway." However, if this is true, then why is OpenAI safety dedicating substantial resources to it? Wouldn't it be better to wait for ML researchers to knock that one out, and spend the interim working on safety-specific techniques (like interpretability, since it's mentioned a lot in the post)? If ML researchers won't do that satisfactorily, then isn't dedicating safety effort to it differentially advancing capabilities?
Thanks for your questions! I'm glad this post was helpful to you!
1. My current guess for the endgame for alignment looks quite different from what we do today. In particular, at some point the bar should be "we have a formal theory for what alignment means" and "we formally verify with respect to this theory." More details here: https://aligned.substack.com/p/alignment-solution Of course, it's always hard to say what the future will look like.
2. I expect that in the medium term evaluation will be almost entirely empirical in addition to some high-level arguments around our training algorithms.
3. There is an important distinction between knowing about human values and following them. A pretrained model doesn't really do the latter very much. You could zero-shot a reward model from a pretrained model and do RL against that. It'll work a lot worse than RLHF with today's models, but with some tuning it should still give you a model that's more aligned than the base model.
What's missing from pretraining is easy to measure if you have some annotated human feedback data :)
The bar for what's good enough will have to increase over time. Right now we're still pretty far from where we want to be.
4. We want to focus on aspects of research work that are differentially helpful to alignment. However, most of our day-to-day work looks like pretty normal ML work, so it might be that we'll see limited alignment research acceleration before ML research automation happens.
if only so much thought effort was made to "align" humans..
> Dario’s intuitions on tuning were pretty important to making it work on Atari.
Are they in the paper? Or what were they?
I meant his intuitions for how to tune the hyperparameters. If I recall correctly they are all in the paper.
I see, thank you.
This was interesting. I do wonder though that the idea of alignment itself seems, in this instance, far closer to a software mentality where you want it to work properly, and the idea of doing this through iterative progress has been what's worked historically. We cant solve all of tomorrow's problems today.
Funnily enough I don't know whether I'd have even called this alignment if I were looking at this de novo, which is good! I find my perspective is a little more pessimistic wrt the long-term alignment pov (https://www.strangeloopcanon.com/p/agi-strange-equation), and more positive on this style of approach.
Very few problems have a perfect solution. But a perfect solution is too high of a bar. The bar I want to aim for, which I think is realistic, is "bound the total sum of all future risk below some small number." I am cautiously hopeful that formal (mathematical) guarantees will be part of the picture for superintelligence alignment eventually, but I don't think that's realistic for aligning the first roughly human-level systems.
To be honest, I don't find the impossibilities proofs from that paper particularly compelling. For example:
> Give an explicitly controlled AI an order: “Disobey!” If the AI obeys, it violates your order and becomes uncontrolled, but if the AI disobeys it also violates your order and is uncontrolled.
It seems perfectly acceptable for the AI system to respond by saying "I can't carry out this order without creating a paradox" without being uncontrolled in the sense that it will deliberately harm humans.
The paper also argues, for example in Section 6.4, that an AI system given in form of a computer program can't be proven to be aligned in general, because proving any nontrivial property about an arbitrary computer problem is undecidable (Rice's theorem). But the property you need is not that you can prove for any arbitrary computer program whether it's aligned or not, but instead you'd want to give a proof about a specific program (the one you're building), and you can choose to build that specific program in a way that makes it easier to prove. (In general, it's not impossible to prove specific things about specific programs.)
I wrote a few more thoughts on formal guarantees for superintelligence alignment here: https://aligned.substack.com/i/72909590/formal-verification-tools-for-cutting-edge-ai-systems In general, we are not working on this right now, and I don't think it's feasible with today's tools.
Are you referring to these questions? Will try to answer them below.
> Would you agree we need essentially a perfect solution for AGI/ASI given the risks?
The higher the risk, the tighter the safety mitigation needs to be. With ASI the stakes will be much higher than with AGI. There is no "perfect" in the real world, but there is a "good enough" and "not good enough". What exactly the bar is depends on how the technology develops.
> have you and your team deliberated over whether the alignment problem is in fact solvable
We think it's solvable, but we could be wrong and if so then in the process of trying to solve it we should be able to produce evidence that it's not possible.
> in a recent interview you discussed fast takeoff possibilities and acknowledged this is plausible in the coming years -- what happens if this happens in say the next 3 years and we have so little work completed on alignment?
you already know the answer to this question