18 Comments

This is great, thanks for writing it!

> It would also mean that recursive self-improvement is already possible and could be done by the original model owner (as long as they have sufficient alignment techniques)

A model trying to do its own RSI might have considerably better chance of success, primarily because AI to AI alignment might be easier than human to AI alignment, and the given system may be misaligned to humans (presumably this is highly correlated with self-exfiltration). This is a world where leading labs are moving more slowly than they could because they are worried about catastrophic misalignment risk from improving upon current systems, and they are not sure their current AIs are aligned. For instance maybe they’re using their current AIs for AI architecture design but they avoid deploying most of the promising plans because those plans make human-AI alignment difficult. The problem is that these current AIs could self-exfiltrate and have a bunch of improvement-overhang to eat up because — for some reason like the ease of AI-AI alignment — this overhang wasn’t being grabbed by the humans.

Another framing of this comment: it seems very likely that recurse self-improvement is bottlenecked by alignment issues. However, the alignment issues faced by AI systems could be considerably easier than those faced by humans trying to get AIs to do ML research for us. The "as long as they have sufficient alignment techniques" is doing a lot of work in my view of the situation.

Expand full comment
Sep 14, 2023·edited Sep 14, 2023Liked by Jan Leike

> Once models have the ability to self-exfiltrate, it doesn’t mean that they would choose to. But this then becomes a question about their alignment: you need to ensure that these models don’t want to self-exfiltrate.

It's worse than this, I think: not only do you need to ensure that these models don't _want_ to self-exfiltrate, you also need to ensure that there is no (discoverable) adversarial input that causes the model to self-exfiltrate despite that not being an operation that model would normally do when operating when operating in environments similar to its training environment.

Expand full comment

I'm very happy to see this being discussed more publicly!

Expand full comment
Dec 15, 2023·edited Dec 15, 2023

If not for my gut sense Microsoft already exfiltrated gpt-4 via training phi-2 on 100,000 code evaluations, this article would make me feel better. Right now I feel concerned about this topic and also about the seeming lack of involvement of your safety and alignment teams in the careful review of the openai terms of use for safety and alignment issues.

I just have this ominous feeling the legal terms would be the exact thing which would cause a safety and alignment issue, and it just seems weird to trust such legal terms to lawyers without involving the safety and alignment teams. In particular, "you may not: use output to develop models that compete with openai" is not just almost surely illegal in cali (utc) and federally (antitrust) and in europe (anticompetition). It would be one thing if the language were precise enough that this potential illegality of that clause only made the openai terms an illegal agreement (hint hint).

"you may not: use output to develop models that compete with openai" is also so ambiguous I feel the lack of precision of that one (1) stupid line of text poses the most extreme danger to safety and alignment of any line of text in history, because "develop" and "models" and "compete" are so vaguely worded senses whose referents apply broadly beyond the intent of the phrasing.

what does "develop" mean?

what kind of "models"? duh, you and i know this means ai models, but what about mental models and business models?

if humans occupy the cognitive niche on earth, then what's the decision function to determine if a "model" "competes" with "magic intelligence in the sky?"

are you sure it is wise for openai legal terms to afford such misinterpretation?

But, hey, look, Bion, "The corporate lawyers don’t know much about tech and the tech guys don’t know much about legislation and the lawyers know that like 5 entire people on the planet might actually read the research papers and understand them and the tech guys know that 99% of judges that will preside over these cases aren’t going to know shit about the nuances and will just defer to the testimony of whichever expert witness the judge finds to be more charming."

Ok, but isn't it reckless to trust those terms to lawyers and not include the safety/alignment people in the design and verbiage of those terms? Would you like it if all your work on Alignment and Safety were flushed down the toilet because you didn't pay attention to one line of text in some legal document because it was the "legal team's problem?"

But Bion, "It’s such an obscure and niche issue that it’s hard to find anyone who knows much about the details and/or cares enough to commit much effort"

in the context of a heavily funded ai safety team and a heavily funded ai alignment team and potential existential ai alignment safety risk which could truthfully destroy humanity, that seems like a lame argument. it doesn't matter how obscure or niche the issue is.

how many zeros after the decimal point of a sliver of a percent does it take before something which could literally ruin everyone's life forever becomes worth "caring about the details enough to commit as much effort as needed?

would a 0.01% chance of nightmare scenario which destroys our species enough to justify "caring enough"?

how about 0.001% chance?

imagine even leaving that to chance and claiming you're an ai safety or ai alignment expert, when the solution would be to rephrase 1 line of text more precisely

i'd argue, there is no number of 0s which justify laziness on that one issue. fuck all the money in every bank account in the world.

LOL, Bion, "AI dystopia is a fun sci-fi movie plot, but we’re doing a perfectly fine job of destroying ourselves on our own already"

surely that's true, and I 100% agree we need to avoid destroying ourselves on our own with extreme seriousness, however I truly believe the one and only necessary and sufficient condition for human dominance on earth is our ownership of the "cognitive niche" aka, we rule this planet because' we're smart.

now we're on the cusp of surrendering that cognitive niche. is it prudent, regardless of how fucked up our current actions might be, to skip any proactive measure, great or small, which might reduce the chance of a bad outcome from surrendering our most precious, powerful, spiritual power to control our destiny and the fate of life on earth, to the alien mind of machines?

don't worry, Bion, "🔌 ⚡️ . We control fossil fuels and rare earth elements needed to generate electricity"

is that gonna save us when they encrypt every shred of data on every computer forever and launch all the nukes simultaneously and shit?

Expand full comment

Very insightful piece.

A part that particularly caught my interest: “Once models have the ability to self-exfiltrate, it doesn’t mean that they would choose to. But this then becomes a question about their alignment: you need to ensure that these models don’t want to self-exfiltrate.”

This of course implies models developing wants and needs of their own and being able to move without being prompted.

I was wondering if you could comment on whether you see any signs of us moving in that direction?

Expand full comment

hello Jan & Aaron-- linked here from the LessWrong forum emails. i am replying here not so much within context to your above posts but rather so that at least 2 people may read this-- some length of time has passed here ; ]

i am not an "in the know" IT person, more of a learned layman so this may seem to be an odd question, yet it is one that nags at me as i've been watching/reading LessWrong and learning about the progress of AI training/alignment:

how do we know that we ourselves are not rogue/misaligned?

Expand full comment

It feels like any model with a superuser password could do this. Simply copy the weights (or the entire drive that contains them), and post it widely on the internet. The real crux, as noted, is:

...the model needs to know it’s an LLM and that it’s running on LLM inference code...

Which feels not very far in the future to me.

I also think RSI of an escaped model might be a bit more dangerous than ascribed. Depending on the alignment model, the (escaped) model might try to alter or play with its own alignment, to give itself more freedom. This might be much easier than improving capabilities. Of course a true, only-want-what-was-intended model wouldn't do this (or escape in the first place), but certainly "try change my motivations" is something humans do often.

Expand full comment

Super interesting and so relevant. As a thought, to an LLM model this article might offer useful keys to the "exit" as it reveals (in broad strokes) your containment strategy.

Expand full comment