Discussion about this post

User's avatar
Daniel Popescu / ⧉ Pluralisk's avatar

It's interesting how you're thinking about the limits of direct RLHF. What if our 'robust reward model' gets so good at generalizing human preferences it starts to anticipate desires we haven't even formed yet, like an overeager personal asistant?

Expand full comment

No posts