"We want to train our models to always tell us the truth, even when we can’t check. The more ways we have to catch our models lying, the harder it is for them to lie."
Just something that I'm curious after reading this post and the paper that was published: part of the goal of alignment is making sure models are truthful (i.e. don't hallucinate). Another is to have it adhere to/ascribe to certain values, like equality, justice etc. Does this fall under the same category for you, and is W2SG also effective in that sense?
In some sense training a model to be truthful is universal: if it's truthful you can ask the model whether something was just or fair etc. and then mix that into an RM. But you need a broader sense of truthfulness than just "don't hallucinate"; it also needs to not withhold information and leverage it's full capabilities to answer the question.
How do you define human values and goals in the context of the non-dual nature of the universe, and how do you discern whether they are beneficial or malevolent without resorting to dualistic perspectives?
This comment or at least the line of thought behind it needs to be engaged with. I wonder if we live in a world where building safe TAI is infeasible yet building TAI isn’t. It seems like the consensus view is not only we will build TAI but we will be able to build it soon which makes the uncertainty that we’ll be able to build it safely more prescient.
What if the only way we don’t perish is by not building at all? Or at least not building until we can prove that these systems are safe to the best of our ability?
Not building TAI until we have certainty that it can be done safely seems like the reasonable thing to do since a significantly misaligned system could destroy short and long term value for the virtually all of us.
If it’s the case that controlling the rate at which tech progress is infeasible at least more so than building safe TAI, then I guess we just have to hope we aren’t in the worst case. But currently, it doesn’t seem like we are doing all we can to prevent a potential race to the bottom. Pushing for governance to enforce pauses, responsible development, or even spreading awareness on the issue to the public seem like steps that might help us even in the worst case scenario.
"We want to train our models to always tell us the truth, even when we can’t check. The more ways we have to catch our models lying, the harder it is for them to lie."
Just something that I'm curious after reading this post and the paper that was published: part of the goal of alignment is making sure models are truthful (i.e. don't hallucinate). Another is to have it adhere to/ascribe to certain values, like equality, justice etc. Does this fall under the same category for you, and is W2SG also effective in that sense?
In some sense training a model to be truthful is universal: if it's truthful you can ask the model whether something was just or fair etc. and then mix that into an RM. But you need a broader sense of truthfulness than just "don't hallucinate"; it also needs to not withhold information and leverage it's full capabilities to answer the question.
Thanks for your swift reply.
Okay you can start up again now JL.
How do you define human values and goals in the context of the non-dual nature of the universe, and how do you discern whether they are beneficial or malevolent without resorting to dualistic perspectives?
This comment or at least the line of thought behind it needs to be engaged with. I wonder if we live in a world where building safe TAI is infeasible yet building TAI isn’t. It seems like the consensus view is not only we will build TAI but we will be able to build it soon which makes the uncertainty that we’ll be able to build it safely more prescient.
What if the only way we don’t perish is by not building at all? Or at least not building until we can prove that these systems are safe to the best of our ability?
Not building TAI until we have certainty that it can be done safely seems like the reasonable thing to do since a significantly misaligned system could destroy short and long term value for the virtually all of us.
If it’s the case that controlling the rate at which tech progress is infeasible at least more so than building safe TAI, then I guess we just have to hope we aren’t in the worst case. But currently, it doesn’t seem like we are doing all we can to prevent a potential race to the bottom. Pushing for governance to enforce pauses, responsible development, or even spreading awareness on the issue to the public seem like steps that might help us even in the worst case scenario.