Musings on the Alignment Problem
Subscribe
Sign in
Home
Archive
About
Latest
Top
Discussions
Should we control AI instead of aligning it?
(Spoiler: no)
Jan 24
•
Jan Leike
40
9
November 2024
Crisp and fuzzy tasks
Why fuzzy tasks matter and how to align models on them
Nov 22, 2024
•
Jan Leike
45
6
Two alignment threat models
Why under-elicitation and scheming are both important to address
Nov 8, 2024
•
Jan Leike
33
21
December 2023
Combining weak-to-strong generalization with scalable oversight
A high-level view on how this new approach fits into our alignment plans
Dec 20, 2023
•
Jan Leike
28
6
September 2023
Self-exfiltration is a key dangerous capability
We need to measure whether LLMs could “steal” themselves
Sep 13, 2023
•
Jan Leike
26
18
March 2023
A proposal for importing society’s values
Building towards Coherent Extrapolated Volition with language models
Mar 9, 2023
•
Jan Leike
33
11
December 2022
Distinguishing three alignment taxes
The impact of different alignment taxes depends on the context
Dec 19, 2022
•
Jan Leike
12
5
Why I’m optimistic about our alignment approach
Some arguments in favor and responses to common objections
Dec 5, 2022
•
Jan Leike
56
16
September 2022
What could a solution to the alignment problem look like?
A high-level view on the elusive once-and-for-all solution
Sep 27, 2022
•
Jan Leike
16
10
May 2022
What is inner alignment?
An explanation using the language of machine learning
May 8, 2022
•
Jan Leike
15
March 2022
A minimal viable product for alignment
Bootstrapping a solution to the alignment problem
Mar 29, 2022
•
Jan Leike
25
Why I’m excited about AI-assisted human feedback
How to scale alignment techniques to hard tasks
Mar 29, 2022
•
Jan Leike
32
1
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts