Some arguments in favor and responses to common objections
Thanks for the post! One objection I think is potentially important is with regards to the relative rate of improvement in alignment versus other capabilities. While I agree that we'll be able to use protocols like Debate/IDA/RRM to help us align AI that is helping with alignment work, my concern is that the alignment work will "lag" behind the capabilities. If alignment is always lagging capabilities, then once your system is powerful enough, you won't be able to control it well. Curious how you think about the relative rate of progress in alignment vs. capabilities.
My Uncontrollability paper is long and addresses 4 different types of control. “Disobey” applies only to Direct control (giving orders), which is not Alignment and everyone agrees it will not work, so I don’t think we disagree on this point.
The paper also explicitly says, in regards to the Rice’s theorem, that “AI safety researchers  correctly argue that we do not have to deal with an arbitrary AI, as if gifted to us by aliens, but rather we can design a particular AI with the safety properties we want.” So once again I think we are in agreement.
I also read your blogpost on formal verification and have a published paper on some of the challenges you are describing: https://iopscience.iop.org/article/10.1088/1402-4896/aa7ca8/meta It looks to me like we are looking at very similar initial conditions, correctly identifying numerous challenges, but for some reason arrive at very different predictions regarding our ability to solve all such problems (see https://dl.acm.org/doi/10.1145/3603371 for a recent survey), especially in the next 4 years.
I honestly hope I am wrong, and you are right, but so far, I am struggling to find any evidence of sufficient progress.
Nice essay and I appreciate your taking the time to write and share these important ideas. I have serious concerns, however, with your approach and the general field of alignment research. I'm currently finishing an essay with Roman Yampolskiy discussing the strong likelihood that there is no solution to the alignment problem.
I imagine you would agree that tackling any serious issue, particularly one that may literally be the most dangerous problem humanity has ever faced, should start with an assessment of possible solution spaces. All problems can be categorized as solvable, partially solvable, unsolvable, or undecidable. I haven't seen this initial determination in your essays or OpenAI's published work thus far.
We argue that the alignment problem is fundamentally unsolvable because it will require perfect solutions that will last over centuries and millennia (not just years or decades), in a realm (alignment is a species of computer security) that has increasingly provided only probabilistic certainty in recent decades. And as we see frequently there are numerous significant computer security breaches each year due to human error.
With AI on the verge of AGI and shortly thereafter (by definition) ASI, we have basically only one chance to get it right. We cannot provide any certainty at all of long-term alignment with ASI in the world, only probabilistic hope.
Another way of putting the problem is this: in the set of all possible morphospaces of the future universe, there is a very small subset where ASI and humanity can coexist. Numerically it is all but certain that we will not find ourselves in that subset because it is literally trillions to one in terms of morphospaces with ASI and no humanity vs. AGI+humanity coexisting.
>Large language models (LLMs) make this a lot easier: they come preloaded with a lot of humanity’s knowledge, including detailed knowledge about human preferences and values. Out of the box they aren’t agents who are trying to pursue their own goals in the world. In many ways they are a blank slate on which we can write our objective function and they are surprisingly easy to train to behave more nicely.
To me, the first and third sentences of this paragraph seem like they are basically opposing. It seems like LLMs are powerful as a starting point for RL precisely because they are *not* a "blank slate". Have I misunderstood the point you're making here?
Thank you for writing this! I've been trying to consolidate my own thoughts around reward modeling and theoretical v. empirical alignment research for a long time, and this post and the discussion has been very helpful. I'll probably write that up on LW later, but for now I have a few questions:
1. What does the endgame look like? The post emphasizes that we only need an MVP alignment research AI, so it can be relatively unintelligent, narrow, myopic, non-agenty, etc. This means that it poses less capabilities risk and is easier to evaluate, both of which are great. But eventually we may need to align AGI that is none of these things. Is the idea that this alignment research AI will discover/design alignment techniques that a) human researchers can evaluate and b) will work on future AGI? Or do we start using other narrowly aligned models to evaluate it at some point? How do we convince ourselves that all of this is working towards the goal of "aligned AI" and not "looks good to alignment researchers"?
2. Related to that, the post says “the burden of proof is always on showing that a new system is sufficiently aligned” and “We have to mistrust what the model is doing anyway and discard it if we can’t rigorously evaluate it.” What might this proof or rigorous evaluation look like? Is this something that can be done with empirical alignment work?
3. I agree that the shift in AI capabilities paradigms from DRL agents playing games to LLMs generating text seems good for alignment, in part because LLM training could teach human values and introduce an ontology for understanding human preferences and communication. But clearly LLM pretraining doesn't teach all human values -- if it did, then RLHF finetuning wouldn't be required at all. How can we know what values are "missing" from pre-training, and how can we tell if/when RLHF has filled in the gap? Is it possible to verify that model alignment is "good enough"?
4. Finally, this might be more of an objection than a question, but... One of my major concerns is that automating alignment research also helps automate capabilities research. One of the main responses to this in the post is that "automated ML research will happen anyway." However, if this is true, then why is OpenAI safety dedicating substantial resources to it? Wouldn't it be better to wait for ML researchers to knock that one out, and spend the interim working on safety-specific techniques (like interpretability, since it's mentioned a lot in the post)? If ML researchers won't do that satisfactorily, then isn't dedicating safety effort to it differentially advancing capabilities?
if only so much thought effort was made to "align" humans..
> Dario’s intuitions on tuning were pretty important to making it work on Atari.
Are they in the paper? Or what were they?
This was interesting. I do wonder though that the idea of alignment itself seems, in this instance, far closer to a software mentality where you want it to work properly, and the idea of doing this through iterative progress has been what's worked historically. We cant solve all of tomorrow's problems today.
Funnily enough I don't know whether I'd have even called this alignment if I were looking at this de novo, which is good! I find my perspective is a little more pessimistic wrt the long-term alignment pov (https://www.strangeloopcanon.com/p/agi-strange-equation), and more positive on this style of approach.