How to scale alignment techniques to hard tasks
It's interesting how you're thinking about the limits of direct RLHF. What if our 'robust reward model' gets so good at generalizing human preferences it starts to anticipate desires we haven't even formed yet, like an overeager personal asistant?
It's interesting how you're thinking about the limits of direct RLHF. What if our 'robust reward model' gets so good at generalizing human preferences it starts to anticipate desires we haven't even formed yet, like an overeager personal asistant?