25 Comments
Dec 6, 2022Liked by Jan Leike

Thanks for the post! One objection I think is potentially important is with regards to the relative rate of improvement in alignment versus other capabilities. While I agree that we'll be able to use protocols like Debate/IDA/RRM to help us align AI that is helping with alignment work, my concern is that the alignment work will "lag" behind the capabilities. If alignment is always lagging capabilities, then once your system is powerful enough, you won't be able to control it well. Curious how you think about the relative rate of progress in alignment vs. capabilities.

Expand full comment
author

Great question! If alignment keeps lagging behind capabilities that would be a real problem in general. Hopefully we can keep scaling our alignment efforts and measure how well we're doing (see setting ourselves up for iteration).

Expand full comment

Hey Jan,

My Uncontrollability paper is long and addresses 4 different types of control. “Disobey” applies only to Direct control (giving orders), which is not Alignment and everyone agrees it will not work, so I don’t think we disagree on this point.

The paper also explicitly says, in regards to the Rice’s theorem, that “AI safety researchers [36] correctly argue that we do not have to deal with an arbitrary AI, as if gifted to us by aliens, but rather we can design a particular AI with the safety properties we want.” So once again I think we are in agreement.

I also read your blogpost on formal verification and have a published paper on some of the challenges you are describing: https://iopscience.iop.org/article/10.1088/1402-4896/aa7ca8/meta It looks to me like we are looking at very similar initial conditions, correctly identifying numerous challenges, but for some reason arrive at very different predictions regarding our ability to solve all such problems (see https://dl.acm.org/doi/10.1145/3603371 for a recent survey), especially in the next 4 years.

I honestly hope I am wrong, and you are right, but so far, I am struggling to find any evidence of sufficient progress.

Best,

Roman

Expand full comment
author

How do you mean "giving orders will not work"?

What would you count as evidence of sufficient progress?

Expand full comment

Nice essay and I appreciate your taking the time to write and share these important ideas. I have serious concerns, however, with your approach and the general field of alignment research. I'm currently finishing an essay with Roman Yampolskiy discussing the strong likelihood that there is no solution to the alignment problem.

I imagine you would agree that tackling any serious issue, particularly one that may literally be the most dangerous problem humanity has ever faced, should start with an assessment of possible solution spaces. All problems can be categorized as solvable, partially solvable, unsolvable, or undecidable. I haven't seen this initial determination in your essays or OpenAI's published work thus far.

We argue that the alignment problem is fundamentally unsolvable because it will require perfect solutions that will last over centuries and millennia (not just years or decades), in a realm (alignment is a species of computer security) that has increasingly provided only probabilistic certainty in recent decades. And as we see frequently there are numerous significant computer security breaches each year due to human error.

With AI on the verge of AGI and shortly thereafter (by definition) ASI, we have basically only one chance to get it right. We cannot provide any certainty at all of long-term alignment with ASI in the world, only probabilistic hope.

Another way of putting the problem is this: in the set of all possible morphospaces of the future universe, there is a very small subset where ASI and humanity can coexist. Numerically it is all but certain that we will not find ourselves in that subset because it is literally trillions to one in terms of morphospaces with ASI and no humanity vs. AGI+humanity coexisting.

Expand full comment
author

Very few problems have a perfect solution. But a perfect solution is too high of a bar. The bar I want to aim for, which I think is realistic, is "bound the total sum of all future risk below some small number." I am cautiously hopeful that formal (mathematical) guarantees will be part of the picture for superintelligence alignment eventually, but I don't think that's realistic for aligning the first roughly human-level systems.

Expand full comment

A few thoughts:

-- would you agree we need essentially a perfect solution for AGI/ASI given the risks?

-- have you and your team deliberated over whether the alignment problem is in fact solvable as we trend toward AGI/ASI?

-- in a recent interview you discussed fast takeoff possibilities and acknowledged this is plausible in the coming years -- what happens if this happens in say the next 3 years and we have so little work completed on alignment?

-- can you flesh out what you envision for possible formal mathematical guarantees for ASI?

And FYI here is Roman's 2022 paper arguing that alignment is fundamentally unsolvable: https://journals.riverpublishers.com/index.php/JCSANDM/article/view/16219

Expand full comment
author

To be honest, I don't find the impossibilities proofs from that paper particularly compelling. For example:

> Give an explicitly controlled AI an order: “Disobey!” If the AI obeys, it violates your order and becomes uncontrolled, but if the AI disobeys it also violates your order and is uncontrolled.

It seems perfectly acceptable for the AI system to respond by saying "I can't carry out this order without creating a paradox" without being uncontrolled in the sense that it will deliberately harm humans.

The paper also argues, for example in Section 6.4, that an AI system given in form of a computer program can't be proven to be aligned in general, because proving any nontrivial property about an arbitrary computer problem is undecidable (Rice's theorem). But the property you need is not that you can prove for any arbitrary computer program whether it's aligned or not, but instead you'd want to give a proof about a specific program (the one you're building), and you can choose to build that specific program in a way that makes it easier to prove. (In general, it's not impossible to prove specific things about specific programs.)

I wrote a few more thoughts on formal guarantees for superintelligence alignment here: https://aligned.substack.com/i/72909590/formal-verification-tools-for-cutting-edge-ai-systems In general, we are not working on this right now, and I don't think it's feasible with today's tools.

Expand full comment

hi Jan, would still appreciate your responses to my questions above. They remain incredibly timely and I fear we may well have our AI crisis coming up soon in the Middle East as all actors in those conflicts are surely rushing madly to find a way to employ AI for advantage, whether in strategy and/or weapons. I fear it may end very badly.

Expand full comment
author

Are you referring to these questions? Will try to answer them below.

> Would you agree we need essentially a perfect solution for AGI/ASI given the risks?

The higher the risk, the tighter the safety mitigation needs to be. With ASI the stakes will be much higher than with AGI. There is no "perfect" in the real world, but there is a "good enough" and "not good enough". What exactly the bar is depends on how the technology develops.

> have you and your team deliberated over whether the alignment problem is in fact solvable

We think it's solvable, but we could be wrong and if so then in the process of trying to solve it we should be able to produce evidence that it's not possible.

> in a recent interview you discussed fast takeoff possibilities and acknowledged this is plausible in the coming years -- what happens if this happens in say the next 3 years and we have so little work completed on alignment?

you already know the answer to this question

Expand full comment

Yes, those questions. I'll respond shortly. Would also appreciate your views on the potential for a near-term AI crisis in the Middle East. Obviously, no one can know what private actors are plotting, but do you agree that it's all but certain all relevant actors in the region are working on AI weapons and using AI in strategy?

Expand full comment

Jan, might be better for a private email discussion? I'm at tam.hunt@psych.ucsb.edu

Expand full comment

See below for Roman's responses to this sub-thread

Expand full comment

Thanks Jan. I or Roman will respond to your comments about Roman's paper. Would appreciate your responses to my first three comments/questions above when you get a chance. These issues seem to me to be the crux of the broader debate about alignment.

Expand full comment

>Large language models (LLMs) make this a lot easier: they come preloaded with a lot of humanity’s knowledge, including detailed knowledge about human preferences and values. Out of the box they aren’t agents who are trying to pursue their own goals in the world. In many ways they are a blank slate on which we can write our objective function and they are surprisingly easy to train to behave more nicely.

To me, the first and third sentences of this paragraph seem like they are basically opposing. It seems like LLMs are powerful as a starting point for RL precisely because they are *not* a "blank slate". Have I misunderstood the point you're making here?

Expand full comment
author
Jan 21, 2023·edited Jan 21, 2023Author

Good point, this is written in a confusing way. What I meant to say is that their objectives are very malleable. I've updated the phrasing. Thanks!

Expand full comment
Dec 10, 2022Liked by Jan Leike

Thank you for writing this! I've been trying to consolidate my own thoughts around reward modeling and theoretical v. empirical alignment research for a long time, and this post and the discussion has been very helpful. I'll probably write that up on LW later, but for now I have a few questions:

1. What does the endgame look like? The post emphasizes that we only need an MVP alignment research AI, so it can be relatively unintelligent, narrow, myopic, non-agenty, etc. This means that it poses less capabilities risk and is easier to evaluate, both of which are great. But eventually we may need to align AGI that is none of these things. Is the idea that this alignment research AI will discover/design alignment techniques that a) human researchers can evaluate and b) will work on future AGI? Or do we start using other narrowly aligned models to evaluate it at some point? How do we convince ourselves that all of this is working towards the goal of "aligned AI" and not "looks good to alignment researchers"?

2. Related to that, the post says “the burden of proof is always on showing that a new system is sufficiently aligned” and “We have to mistrust what the model is doing anyway and discard it if we can’t rigorously evaluate it.” What might this proof or rigorous evaluation look like? Is this something that can be done with empirical alignment work?

3. I agree that the shift in AI capabilities paradigms from DRL agents playing games to LLMs generating text seems good for alignment, in part because LLM training could teach human values and introduce an ontology for understanding human preferences and communication. But clearly LLM pretraining doesn't teach all human values -- if it did, then RLHF finetuning wouldn't be required at all. How can we know what values are "missing" from pre-training, and how can we tell if/when RLHF has filled in the gap? Is it possible to verify that model alignment is "good enough"?

4. Finally, this might be more of an objection than a question, but... One of my major concerns is that automating alignment research also helps automate capabilities research. One of the main responses to this in the post is that "automated ML research will happen anyway." However, if this is true, then why is OpenAI safety dedicating substantial resources to it? Wouldn't it be better to wait for ML researchers to knock that one out, and spend the interim working on safety-specific techniques (like interpretability, since it's mentioned a lot in the post)? If ML researchers won't do that satisfactorily, then isn't dedicating safety effort to it differentially advancing capabilities?

Expand full comment
author

Thanks for your questions! I'm glad this post was helpful to you!

1. My current guess for the endgame for alignment looks quite different from what we do today. In particular, at some point the bar should be "we have a formal theory for what alignment means" and "we formally verify with respect to this theory." More details here: https://aligned.substack.com/p/alignment-solution Of course, it's always hard to say what the future will look like.

2. I expect that in the medium term evaluation will be almost entirely empirical in addition to some high-level arguments around our training algorithms.

3. There is an important distinction between knowing about human values and following them. A pretrained model doesn't really do the latter very much. You could zero-shot a reward model from a pretrained model and do RL against that. It'll work a lot worse than RLHF with today's models, but with some tuning it should still give you a model that's more aligned than the base model.

What's missing from pretraining is easy to measure if you have some annotated human feedback data :)

The bar for what's good enough will have to increase over time. Right now we're still pretty far from where we want to be.

4. We want to focus on aspects of research work that are differentially helpful to alignment. However, most of our day-to-day work looks like pretty normal ML work, so it might be that we'll see limited alignment research acceleration before ML research automation happens.

Expand full comment

if only so much thought effort was made to "align" humans..

Expand full comment

FYI, a follow up to my earlier comments below. I've come to the view that all current AI safety research is simply enabling collective suicide as we rush headlong toward AGI/ASI under the assumption that someone, somewhere, will figure out how to make it safe by the time we have it. You couldn't find a better example of collective foolhardiness IMHO. https://www.scientificamerican.com/article/ai-safety-research-only-enables-the-dangers-of-runaway-superintelligence/

Expand full comment

And at this point it seems likely we're going to have to see a very major disaster from irresponsible AI before the world gets serious about regulating it or even pausing development.

Expand full comment

> Dario’s intuitions on tuning were pretty important to making it work on Atari.

Are they in the paper? Or what were they?

Expand full comment
author

I meant his intuitions for how to tune the hyperparameters. If I recall correctly they are all in the paper.

Expand full comment

I see, thank you.

Expand full comment

This was interesting. I do wonder though that the idea of alignment itself seems, in this instance, far closer to a software mentality where you want it to work properly, and the idea of doing this through iterative progress has been what's worked historically. We cant solve all of tomorrow's problems today.

Funnily enough I don't know whether I'd have even called this alignment if I were looking at this de novo, which is good! I find my perspective is a little more pessimistic wrt the long-term alignment pov (https://www.strangeloopcanon.com/p/agi-strange-equation), and more positive on this style of approach.

Expand full comment