Discussion about this post

User's avatar
Jurgen Gravestein's avatar

"We want to train our models to always tell us the truth, even when we can’t check. The more ways we have to catch our models lying, the harder it is for them to lie."

Just something that I'm curious after reading this post and the paper that was published: part of the goal of alignment is making sure models are truthful (i.e. don't hallucinate). Another is to have it adhere to/ascribe to certain values, like equality, justice etc. Does this fall under the same category for you, and is W2SG also effective in that sense?

Expand full comment
Tam Hunt's avatar

Appreciate the work you're doing, Jan, and others also who are focused on AI safety/alignment.

However, I again urge you and your colleagues to engage in the task of a big picture assessment of the possible solution spaces available here. Is there a possible solution to superalignment that even begins to approach the certainty we'll need to actually have safe AGI/ASI?

How could we know? Do we know enough now to know the answer? As you know from our previous discussions I am in the camp of "we know enough already to know that there is no solution to superalignment since it's logically impossible for a vastly more-than-human intelligent entity to be controlled in any significant way by humans."

As such, I am now of the view that efforts on AI safety, conducted without a concurrent global pause in frontier model development, are simply enabling irresponsible AI development.

The recent turmoil at OpenAI is a pertinent example of the dangers of human messiness when dealing with massively dangerous tools.

An essay of mine is coming out on these issues in Scientific American shortly. I'll post it here when it comes out. I'd appreciate any further responses you have to my thoughts here.

Expand full comment
11 more comments...

No posts