Archive - Musings on the Alignment Problem

Should we control AI instead of aligning it?

Jan 24 •

November 2024

Crisp and fuzzy tasks

Why fuzzy tasks matter and how to align models on them

Nov 22, 2024 •

Two alignment threat models

Why under-elicitation and scheming are both important to address

Nov 8, 2024 •

December 2023

Combining weak-to-strong generalization with scalable oversight

A high-level view on how this new approach fits into our alignment plans

Dec 20, 2023 •

September 2023

Self-exfiltration is a key dangerous capability

We need to measure whether LLMs could “steal” themselves

Sep 13, 2023 •

March 2023

A proposal for importing society’s values

Building towards Coherent Extrapolated Volition with language models

Mar 9, 2023 •

December 2022

Distinguishing three alignment taxes

The impact of different alignment taxes depends on the context

Dec 19, 2022 •

Why I’m optimistic about our alignment approach

Some arguments in favor and responses to common objections

Dec 5, 2022 •

September 2022

What could a solution to the alignment problem look like?

A high-level view on the elusive once-and-for-all solution

Sep 27, 2022 •

May 2022

What is inner alignment?

An explanation using the language of machine learning

May 8, 2022 •

March 2022

A minimal viable product for alignment

Bootstrapping a solution to the alignment problem

Mar 29, 2022 •

Why I’m excited about AI-assisted human feedback

How to scale alignment techniques to hard tasks

Mar 29, 2022 •

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts