Our use of cookies
We use necessary cookies to make our site work. We also set performance and functionality cookies that help us make improvements by measuring traffic on our site. For more detailed information about the cookies we use, please see our
privacy policy.
✖
Thanks for the post! One objection I think is potentially important is with regards to the relative rate of improvement in alignment versus other capabilities. While I agree that we'll be able to use protocols like Debate/IDA/RRM to help us align AI that is helping with alignment work, my concern is that the alignment work will "lag" behind the capabilities. If alignment is always lagging capabilities, then once your system is powerful enough, you won't be able to control it well. Curious how you think about the relative rate of progress in alignment vs. capabilities.
>Large language models (LLMs) make this a lot easier: they come preloaded with a lot of humanity’s knowledge, including detailed knowledge about human preferences and values. Out of the box they aren’t agents who are trying to pursue their own goals in the world. In many ways they are a blank slate on which we can write our objective function and they are surprisingly easy to train to behave more nicely.
To me, the first and third sentences of this paragraph seem like they are basically opposing. It seems like LLMs are powerful as a starting point for RL precisely because they are *not* a "blank slate". Have I misunderstood the point you're making here?
Thank you for writing this! I've been trying to consolidate my own thoughts around reward modeling and theoretical v. empirical alignment research for a long time, and this post and the discussion has been very helpful. I'll probably write that up on LW later, but for now I have a few questions:
1. What does the endgame look like? The post emphasizes that we only need an MVP alignment research AI, so it can be relatively unintelligent, narrow, myopic, non-agenty, etc. This means that it poses less capabilities risk and is easier to evaluate, both of which are great. But eventually we may need to align AGI that is none of these things. Is the idea that this alignment research AI will discover/design alignment techniques that a) human researchers can evaluate and b) will work on future AGI? Or do we start using other narrowly aligned models to evaluate it at some point? How do we convince ourselves that all of this is working towards the goal of "aligned AI" and not "looks good to alignment researchers"?
2. Related to that, the post says “the burden of proof is always on showing that a new system is sufficiently aligned” and “We have to mistrust what the model is doing anyway and discard it if we can’t rigorously evaluate it.” What might this proof or rigorous evaluation look like? Is this something that can be done with empirical alignment work?
3. I agree that the shift in AI capabilities paradigms from DRL agents playing games to LLMs generating text seems good for alignment, in part because LLM training could teach human values and introduce an ontology for understanding human preferences and communication. But clearly LLM pretraining doesn't teach all human values -- if it did, then RLHF finetuning wouldn't be required at all. How can we know what values are "missing" from pre-training, and how can we tell if/when RLHF has filled in the gap? Is it possible to verify that model alignment is "good enough"?
4. Finally, this might be more of an objection than a question, but... One of my major concerns is that automating alignment research also helps automate capabilities research. One of the main responses to this in the post is that "automated ML research will happen anyway." However, if this is true, then why is OpenAI safety dedicating substantial resources to it? Wouldn't it be better to wait for ML researchers to knock that one out, and spend the interim working on safety-specific techniques (like interpretability, since it's mentioned a lot in the post)? If ML researchers won't do that satisfactorily, then isn't dedicating safety effort to it differentially advancing capabilities?
> Dario’s intuitions on tuning were pretty important to making it work on Atari.
Are they in the paper? Or what were they?
This was interesting. I do wonder though that the idea of alignment itself seems, in this instance, far closer to a software mentality where you want it to work properly, and the idea of doing this through iterative progress has been what's worked historically. We cant solve all of tomorrow's problems today.
Funnily enough I don't know whether I'd have even called this alignment if I were looking at this de novo, which is good! I find my perspective is a little more pessimistic wrt the long-term alignment pov (https://www.strangeloopcanon.com/p/agi-strange-equation), and more positive on this style of approach.