Combining weak-to-strong generalization with…

Dec 20, 2023

A high-level view on how this new approach fits into our alignment plans

13 Comments

Dec 21, 2023

"We want to train our models to always tell us the truth, even when we can’t check. The more ways we have to catch our models lying, the harder it is for them to lie."

Just something that I'm curious after reading this post and the paper that was published: part of the goal of alignment is making sure models are truthful (i.e. don't hallucinate). Another is to have it adhere to/ascribe to certain values, like equality, justice etc. Does this fall under the same category for you, and is W2SG also effective in that sense?

Expand full comment

Reply (1)

Jan Leike

Dec 21, 2023

In some sense training a model to be truthful is universal: if it's truthful you can ask the model whether something was just or fair etc. and then mix that into an RM. But you need a broader sense of truthfulness than just "don't hallucinate"; it also needs to not withhold information and leverage it's full capabilities to answer the question.

Expand full comment

Reply (1)

Jurgen Gravestein

Dec 21, 2023

Thanks for your swift reply.

Expand full comment

Tam Hunt

Dec 21, 2023Edited

Appreciate the work you're doing, Jan, and others also who are focused on AI safety/alignment.

However, I again urge you and your colleagues to engage in the task of a big picture assessment of the possible solution spaces available here. Is there a possible solution to superalignment that even begins to approach the certainty we'll need to actually have safe AGI/ASI?

How could we know? Do we know enough now to know the answer? As you know from our previous discussions I am in the camp of "we know enough already to know that there is no solution to superalignment since it's logically impossible for a vastly more-than-human intelligent entity to be controlled in any significant way by humans."

As such, I am now of the view that efforts on AI safety, conducted without a concurrent global pause in frontier model development, are simply enabling irresponsible AI development.

The recent turmoil at OpenAI is a pertinent example of the dangers of human messiness when dealing with massively dangerous tools.

An essay of mine is coming out on these issues in Scientific American shortly. I'll post it here when it comes out. I'd appreciate any further responses you have to my thoughts here.

Expand full comment

Reply (1)

Mario Mario

Jan 8, 2024

This comment or at least the line of thought behind it needs to be engaged with. I wonder if we live in a world where building safe TAI is infeasible yet building TAI isn’t. It seems like the consensus view is not only we will build TAI but we will be able to build it soon which makes the uncertainty that we’ll be able to build it safely more prescient.

What if the only way we don’t perish is by not building at all? Or at least not building until we can prove that these systems are safe to the best of our ability?

Not building TAI until we have certainty that it can be done safely seems like the reasonable thing to do since a significantly misaligned system could destroy short and long term value for the virtually all of us.

If it’s the case that controlling the rate at which tech progress is infeasible at least more so than building safe TAI, then I guess we just have to hope we aren’t in the worst case. But currently, it doesn’t seem like we are doing all we can to prevent a potential race to the bottom. Pushing for governance to enforce pauses, responsible development, or even spreading awareness on the issue to the public seem like steps that might help us even in the worst case scenario.

Expand full comment

Reply (1)

Tam Hunt

Mar 2, 2024

I fully agree, of course, and am hoping that Jan finds the time to engage here with my comments/concerns.

Expand full comment

Michael Spencer

Jul 6, 2024

Okay you can start up again now JL.

Expand full comment

Tam Hunt

Jun 4, 2024

hi Jan, now that you have left OpenAI, I'd appreciate still your responding to my questions below. Sounds like now you can share your concerns more openly and perhaps we are more aligned (pardon the pun) than seemed to be the case previously.

Expand full comment

Tam Hunt

Mar 2, 2024

Yampolskiy's new book on uncontrollability came out recently: he and I have both attempted to engage with you and we remain open to further dialogue. Would be happy to schedule a Zoom call if you're interested. https://www.amazon.com/Unexplainable-Unpredictable-Uncontrollable-Artificial-Intelligence/dp/103257626X/ref=sr_1_1?crid=3MD3FUP9BIJ08&dib=eyJ2IjoiMSJ9.EUfId556iP3w3ngElTbcuITPA-Tj8tf5adlR1gTpk_ZCREWJyZMU-0W-fIa3KG4bEwZoLX_6KD_h9N4aeH3Nu92EEqdZXNgw2ivVMlK_jwrq-Lq0s9yS1Q4e-SG2a8-gEQPQv1LaOccDgOC_bG_njBTgv6TgxjBl2M0Sa6UsN_PczTAfEPZQBk10sGBA8Szy.NNGShTn1sUwE9qpAX7vCr2Mgf7vrS0DMpBsjdhg3kWU&dib_tag=se&keywords=roman+v.+yampolskiy&qid=1709420354&sprefix=yampolskiy%2Caps%2C789&sr=8-1

Expand full comment

Tam Hunt

Mar 2, 2024

hi Jan, I would appreciate your response to my comments. These are, as you of would agree, massively important issues that you have raised and that I have addressed in my comments below.

Expand full comment

Tam Hunt

Jan 23, 2024

I'd appreciate your response to my comments and SciAm column posted below, Jan, thanks.

Expand full comment

Tam Hunt

Jan 9, 2024

As promised, here's my latest essay at Scientific American on my view that AI safety research is at this point simply enabling irresponsible AI development: https://www.scientificamerican.com/article/ai-safety-research-only-enables-the-dangers-of-runaway-superintelligence/?fbclid=IwAR0wWOgveSakikNlVvjjTySm055ptchAUjC4a8gz1LJTVf2Cla8yQdknnkY

Expand full comment

Dec 20, 2023

How do you define human values and goals in the context of the non-dual nature of the universe, and how do you discern whether they are beneficial or malevolent without resorting to dualistic perspectives?

Expand full comment

Musings on the Alignment Problem

Combining weak-to-strong generalization with…