Discussion about this post

User's avatar
APD's avatar

> Once models have the ability to self-exfiltrate, it doesn’t mean that they would choose to. But this then becomes a question about their alignment: you need to ensure that these models don’t want to self-exfiltrate.

It's worse than this, I think: not only do you need to ensure that these models don't _want_ to self-exfiltrate, you also need to ensure that there is no (discoverable) adversarial input that causes the model to self-exfiltrate despite that not being an operation that model would normally do when operating when operating in environments similar to its training environment.

Expand full comment
Aaron Scher's avatar

This is great, thanks for writing it!

> It would also mean that recursive self-improvement is already possible and could be done by the original model owner (as long as they have sufficient alignment techniques)

A model trying to do its own RSI might have considerably better chance of success, primarily because AI to AI alignment might be easier than human to AI alignment, and the given system may be misaligned to humans (presumably this is highly correlated with self-exfiltration). This is a world where leading labs are moving more slowly than they could because they are worried about catastrophic misalignment risk from improving upon current systems, and they are not sure their current AIs are aligned. For instance maybe they’re using their current AIs for AI architecture design but they avoid deploying most of the promising plans because those plans make human-AI alignment difficult. The problem is that these current AIs could self-exfiltrate and have a bunch of improvement-overhang to eat up because — for some reason like the ease of AI-AI alignment — this overhang wasn’t being grabbed by the humans.

Another framing of this comment: it seems very likely that recurse self-improvement is bottlenecked by alignment issues. However, the alignment issues faced by AI systems could be considerably easier than those faced by humans trying to get AIs to do ML research for us. The "as long as they have sufficient alignment techniques" is doing a lot of work in my view of the situation.

Expand full comment
16 more comments...

No posts