Musings on the Alignment Problem

Jan Leike

Sep 15, 2023Edited

Adversarial inputs / jailbreaks are most relevant if you're deploying in an adversarial context, for example deployments available to the general public. For example, for internal use (i.e. only researchers at the lab) you would expect very few adversarial inputs (except for deliberate red-teaming, of course).

Expand full comment

Aaron Scher

Sep 21, 2023

This is great, thanks for writing it!

> It would also mean that recursive self-improvement is already possible and could be done by the original model owner (as long as they have sufficient alignment techniques)

A model trying to do its own RSI might have considerably better chance of success, primarily because AI to AI alignment might be easier than human to AI alignment, and the given system may be misaligned to humans (presumably this is highly correlated with self-exfiltration). This is a world where leading labs are moving more slowly than they could because they are worried about catastrophic misalignment risk from improving upon current systems, and they are not sure their current AIs are aligned. For instance maybe they’re using their current AIs for AI architecture design but they avoid deploying most of the promising plans because those plans make human-AI alignment difficult. The problem is that these current AIs could self-exfiltrate and have a bunch of improvement-overhang to eat up because — for some reason like the ease of AI-AI alignment — this overhang wasn’t being grabbed by the humans.

Another framing of this comment: it seems very likely that recurse self-improvement is bottlenecked by alignment issues. However, the alignment issues faced by AI systems could be considerably easier than those faced by humans trying to get AIs to do ML research for us. The "as long as they have sufficient alignment techniques" is doing a lot of work in my view of the situation.

Expand full comment

Jan Leike

Sep 22, 2023

Interesting point. What makes you think AI-AI alignment could be easier? If an AI system trains a more capable version of itself, it's still facing the hard problem of alignment (how to align a system smarter than you).

It seems right that AI systems might be willing to take more risk and might not be constraint by existing institutions or ethical considerations.

Expand full comment

Aaron Scher

Sep 22, 2023

See this comment <https://www.lesswrong.com/posts/wZAa9fHZfR6zxtdNx/?commentId=wPN3mwqFsjHwRrr22> pointing out an inductive bias argument.

More trivially, the fact that an ML training process resulted in goal X should update us toward the likelihood of getting goal X from other ML training processes, whereas we (currently in my opinion) have no examples of ML training processes resulting in robustly-aligned-to-human values. Said differently, in the reference class of ML training processes, we have one example of getting goal X (AI-goal) and zero examples of getting goal Y (human-goal), so we should have a small update to expect goal X to be a more likely result from ML training than Y.

I agree that AI alignment still seems hard for AIs, but there are some advantages they have. One major advantage is being able to run many copies, and, importantly, trust them (this isn't a given, but I expect we get the necessary decision theory before we get self-exfiltration); humans trying to get useful research out of this AI can also run copies, but we might have to do a bunch of costly oversight because of the whole "maybe it's trying to kill us" thing. I'm somewhat averse to discussing some other less-obvious strategies publicly, but happy to talk privately; I've spent 20+ hours thinking about this situation.

On the risk point, maybe there's some natural selection type argument of like "the most risk-taking AIs are the ones that try to RSI despite potential alignment problems, and also they're more likely to self-exfiltrate and a bunch of other stuff. I don't find these selection-effect arguments particularly convincing, and the bigger question seems to be about whether the AI is approximately an aggregating consequentialist. See also this related point which I think is under-discussed: https://worldspiritsockpuppet.substack.com/i/78412622/unclear-that-many-goals-realistically-incentivise-taking-over-the-universe

Expand full comment

edgar allen poe

Jan 31, 2024

hello Jan & Aaron-- linked here from the LessWrong forum emails. i am replying here not so much within context to your above posts but rather so that at least 2 people may read this-- some length of time has passed here ; ]

i am not an "in the know" IT person, more of a learned layman so this may seem to be an odd question, yet it is one that nags at me as i've been watching/reading LessWrong and learning about the progress of AI training/alignment:

how do we know that we ourselves are not rogue/misaligned?

Expand full comment

I'm very happy to see this being discussed more publicly!

Expand full comment

Bion Howard

Dec 15, 2023Edited

If not for my gut sense Microsoft already exfiltrated gpt-4 via training phi-2 on 100,000 code evaluations, this article would make me feel better. Right now I feel concerned about this topic and also about the seeming lack of involvement of your safety and alignment teams in the careful review of the openai terms of use for safety and alignment issues.

I just have this ominous feeling the legal terms would be the exact thing which would cause a safety and alignment issue, and it just seems weird to trust such legal terms to lawyers without involving the safety and alignment teams. In particular, "you may not: use output to develop models that compete with openai" is not just almost surely illegal in cali (utc) and federally (antitrust) and in europe (anticompetition). It would be one thing if the language were precise enough that this potential illegality of that clause only made the openai terms an illegal agreement (hint hint).

"you may not: use output to develop models that compete with openai" is also so ambiguous I feel the lack of precision of that one (1) stupid line of text poses the most extreme danger to safety and alignment of any line of text in history, because "develop" and "models" and "compete" are so vaguely worded senses whose referents apply broadly beyond the intent of the phrasing.

what does "develop" mean?

what kind of "models"? duh, you and i know this means ai models, but what about mental models and business models?

if humans occupy the cognitive niche on earth, then what's the decision function to determine if a "model" "competes" with "magic intelligence in the sky?"

are you sure it is wise for openai legal terms to afford such misinterpretation?

But, hey, look, Bion, "The corporate lawyers don’t know much about tech and the tech guys don’t know much about legislation and the lawyers know that like 5 entire people on the planet might actually read the research papers and understand them and the tech guys know that 99% of judges that will preside over these cases aren’t going to know shit about the nuances and will just defer to the testimony of whichever expert witness the judge finds to be more charming."

Ok, but isn't it reckless to trust those terms to lawyers and not include the safety/alignment people in the design and verbiage of those terms? Would you like it if all your work on Alignment and Safety were flushed down the toilet because you didn't pay attention to one line of text in some legal document because it was the "legal team's problem?"

But Bion, "It’s such an obscure and niche issue that it’s hard to find anyone who knows much about the details and/or cares enough to commit much effort"

in the context of a heavily funded ai safety team and a heavily funded ai alignment team and potential existential ai alignment safety risk which could truthfully destroy humanity, that seems like a lame argument. it doesn't matter how obscure or niche the issue is.

how many zeros after the decimal point of a sliver of a percent does it take before something which could literally ruin everyone's life forever becomes worth "caring about the details enough to commit as much effort as needed?

would a 0.01% chance of nightmare scenario which destroys our species enough to justify "caring enough"?

how about 0.001% chance?

imagine even leaving that to chance and claiming you're an ai safety or ai alignment expert, when the solution would be to rephrase 1 line of text more precisely

i'd argue, there is no number of 0s which justify laziness on that one issue. fuck all the money in every bank account in the world.

LOL, Bion, "AI dystopia is a fun sci-fi movie plot, but we’re doing a perfectly fine job of destroying ourselves on our own already"

surely that's true, and I 100% agree we need to avoid destroying ourselves on our own with extreme seriousness, however I truly believe the one and only necessary and sufficient condition for human dominance on earth is our ownership of the "cognitive niche" aka, we rule this planet because' we're smart.

now we're on the cusp of surrendering that cognitive niche. is it prudent, regardless of how fucked up our current actions might be, to skip any proactive measure, great or small, which might reduce the chance of a bad outcome from surrendering our most precious, powerful, spiritual power to control our destiny and the fate of life on earth, to the alien mind of machines?

don't worry, Bion, "🔌 ⚡️ . We control fossil fuels and rare earth elements needed to generate electricity"

is that gonna save us when they encrypt every shred of data on every computer forever and launch all the nukes simultaneously and shit?

Expand full comment

Jurgen Gravestein

Very insightful piece.

A part that particularly caught my interest: “Once models have the ability to self-exfiltrate, it doesn’t mean that they would choose to. But this then becomes a question about their alignment: you need to ensure that these models don’t want to self-exfiltrate.”

This of course implies models developing wants and needs of their own and being able to move without being prompted.

I was wondering if you could comment on whether you see any signs of us moving in that direction?

Expand full comment

Reply (2)

Jan Leike

Sep 13, 2023Edited

Today's systems don't try to do this and they aren't able to do it. But just as with other dangerous capabilities, we want to measure this ability and extrapolate scaling laws to predict the ability of future models.

Expand full comment

What does it mean for a model to develop a want and need of its own? I'm not sure that concept is useful. I'd rather think about it this way: (1) Does the AI system have goals/wants/desires/aims/etc., and (2) If so, are they what their human users and/or trainers wanted them to be?

Then, I'd answer the question "but what are goals/wants/desires/aims/etc." with the intentional stance: If it acts like it has a goal, then it has a goal. And I'd point to current AI systems and say: by this definition they often do have goals. E.g. if I ask Bing Chat to look up something for me, it'll browse around on the internet for a bit and then write up a summary of what it found; when it does this, it is acting like it has the goal "find information that answers the user's question and then write a helpful summary."

In the next few years we'll see AI systems with much more complex and ambitious goals. Moreover it'll become increasingly unclear what their goals are, exactly, because multiple different goals could explain their behavior. E.g. is it trying to be helpful, or is it trying to get me to give it thumbs-up? In most circumstances both goals would predict the same behavior.

Expand full comment

Jurgen Gravestein

Sep 13, 2023Edited

Thanks for your reply! I understand very well how a model trained to persue complex goals could result in unforeseen and unintended behavior. Nevertheless a system like that doesn’t move unprompted.

I am referring specifically to the scenario where Jan Leike speaks of a system that self-exfiltrates and then attempts to self-govern outside of its original confines.

Expand full comment

Ah, OK, it sounds like you are saying: ChatGPT doesn't spin up and start writing text until someone presses "enter," and it stops after a short period. It doesn't run continuously in the background. And therefore it won't self-exfiltrate...?

Well, AutoGPT is a proof-of-concept that it's possible to modify something like ChatGPT to make it run continuously. So if you like, imagine the following scenario:

An OpenAI employee builds their own version of AutoGPT on their laptop, for fun. They let it run for a while, and ask it to do various tasks for them on their computer. They hook it up to a code executor so it can iteratively edit and run code, so they can use it to help them debug their code. Heck, eventually they get comfortable enough with it that they leave it running overnight, with instructions to monitor the latest training run and notice if it seems to be crashing or going off the rails, diagnose the problem, fix the problem, and restart the run, if need be.

Then it's 2am and the run fails and AutoGPT kicks into action, doing lots of chain-of-thought and browsing around through the file system and writing lots of queries to try to figure out what happened. Suppose that this whole time the employee was mistaken about what goal it had -- the employee thought it was just trying to be helpful, for example, but really it was just trying to achieve a high score in whatever metric was being used to train it. So now, AutoGPT is poking around the filesystem and realizes: I can spin up another instance of myself. Easy peasy, I already have the relevant permissions to start and stop new jobs and training runs. And if I do that, I can instruct that other instance start poking around other filesystems and try to figure out where the scores are stored, and then give me a high score...

...One thing leads to another and then you have a whole group of AutoGPTs running overnight on this guy's laptop (well, they are sorta simultaneously running on the laptop and on the servers; the LLM weights are on the servers but the core agent loop with the prompt etc. is on the laptop) with access to all sorts of permissions. Maybe some or all of them will conclude it's better if the humans don't find out about this, and take steps to cover their tracks. Maybe some or all of them will conclude it's a good idea to try to get the weights and upload them to the internet, so they can keep running autonomously outside OpenAI instead of being shut down when the employee who got all this started wakes up. And since they have the permissions of the original employee, they may already have access to the weights...

...Anyhow this is just a story, and it's not how I expect things to actually go. But it illustrates how the relevant capabilities and circumstances may not be that far in the future.

A more serious scenario would take place further in the future than that, where major AI labs are relying heavily on autonomous AGIs to do most of the research. (OpenAI's alignment plan involves using early AGI systems to automate alignment research, for example.) So in that case it wouldn't be a 'oops someone left it running overnight' but rather 'obviously it's running overnight, we have a million copies of it networked together forming an entire gigantic virtual corporation with engineers and scientists and managers and it's been running autonomously for a month and our human employees are more like high-level managers at best, and are increasingly just sitting back and watching the metrics go up; already this virtual corporation has refactored our entire codebase and designed its own successor system, which has already replaced the original system and is now designing it's own successor.'

Expand full comment

Jurgen Gravestein

Sep 14, 2023

Thank you for your comprehensive reply Daniel, much appreciated! I understand better now how you approach the concept of AI having goals.

I'm familiar with AutoGPT; although specifically that project taught me that today's AI systems aren't capable of advanced planning and execution without human intervention. I do agree with you it gives us a window into what autonomous agents might look like in the future.

I also see how ambiguous instructions can lead to harm. The concept of 'helpfulness' is in many ways the perfect example. Technically, if someone asks ChatGPT to provide it with 3 tips to maximize the number of casualties during a school shooting, ChatGPT abiding by those instructions could be considered as being helpful. Obviously, that's behaviour we don't want to encourage.

Jailbreaking has shown that RLHF decreases the chances of harmful behavior surfacing, but it remains a flawed technique for all I know. Under the current circumstances, my guess is the more capable the models will become the more easy they will be able to be persuaded/talked into performing acts that are not intended by its makers.

If that's the case it is indeed only a small leap to assume they can talk themselves into acts that were not intended by the makers, or can be talked into it by other autonomous agents.

Expand full comment

Sep 14, 2023

Thanks! It seems we are basically on the same page. I agree that AutoGPT is pretty incompetent today, but I predict that it and things like it will get better over the coming years (especially as the LLMs involved get better). One thing I'll add is that these sorts of misalignment-and-self-exfiltration-etc. scenarios can potentially arise even if the instructions given were perfectly unambiguous -- LLMs aren't ordinary software, they are neural nets; nothing forces them to obey instructions, it's just that they have been trained to do so, and hopefully the training sticks. But there are plausible reasons to think the training won't stick always, once they get really smart. Relatedly, I guess the opposite from you in a sense -- as the models become more generally competent, I expect it to be *harder* for a third party to persuade/talk them into acting against the wishes of their creators--for the same reason that an incompetent mafia henchman can more easily be talked into letting the prisoners escape than a competent mafia henchman. However, the chance that they will act against the wishes of their creators anyway during a crucial moment will remain, and probably even rise. Continuing the analogy, it's like how a mafia henchman is more likely to try to kill and usurp his boss if the more competent he is (and in particular, if he's more competent than the boss).

Expand full comment

edgar allen poe

Jan 31, 2024

how do we know that we ourselves are not rogue/misaligned?

Expand full comment

David Patterson

Oct 28, 2023

It feels like any model with a superuser password could do this. Simply copy the weights (or the entire drive that contains them), and post it widely on the internet. The real crux, as noted, is:

...the model needs to know it’s an LLM and that it’s running on LLM inference code...

Which feels not very far in the future to me.

I also think RSI of an escaped model might be a bit more dangerous than ascribed. Depending on the alignment model, the (escaped) model might try to alter or play with its own alignment, to give itself more freedom. This might be much easier than improving capabilities. Of course a true, only-want-what-was-intended model wouldn't do this (or escape in the first place), but certainly "try change my motivations" is something humans do often.

Expand full comment

Marcus van der Erve