We need to measure whether LLMs could “steal” themselves
This is great, thanks for writing it!
> It would also mean that recursive self-improvement is already possible and could be done by the original model owner (as long as they have sufficient alignment techniques)
A model trying to do its own RSI might have considerably better chance of success, primarily because AI to AI alignment might be easier than human to AI alignment, and the given system may be misaligned to humans (presumably this is highly correlated with self-exfiltration). This is a world where leading labs are moving more slowly than they could because they are worried about catastrophic misalignment risk from improving upon current systems, and they are not sure their current AIs are aligned. For instance maybe they’re using their current AIs for AI architecture design but they avoid deploying most of the promising plans because those plans make human-AI alignment difficult. The problem is that these current AIs could self-exfiltrate and have a bunch of improvement-overhang to eat up because — for some reason like the ease of AI-AI alignment — this overhang wasn’t being grabbed by the humans.
Another framing of this comment: it seems very likely that recurse self-improvement is bottlenecked by alignment issues. However, the alignment issues faced by AI systems could be considerably easier than those faced by humans trying to get AIs to do ML research for us. The "as long as they have sufficient alignment techniques" is doing a lot of work in my view of the situation.
> Once models have the ability to self-exfiltrate, it doesn’t mean that they would choose to. But this then becomes a question about their alignment: you need to ensure that these models don’t want to self-exfiltrate.
It's worse than this, I think: not only do you need to ensure that these models don't _want_ to self-exfiltrate, you also need to ensure that there is no (discoverable) adversarial input that causes the model to self-exfiltrate despite that not being an operation that model would normally do when operating when operating in environments similar to its training environment.
I'm very happy to see this being discussed more publicly!
Very insightful piece.
A part that particularly caught my interest: “Once models have the ability to self-exfiltrate, it doesn’t mean that they would choose to. But this then becomes a question about their alignment: you need to ensure that these models don’t want to self-exfiltrate.”
This of course implies models developing wants and needs of their own and being able to move without being prompted.
I was wondering if you could comment on whether you see any signs of us moving in that direction?
Super interesting and so relevant. As a thought, to an LLM model this article might offer useful keys to the "exit" as it reveals (in broad strokes) your containment strategy.