This is a taxonomy for task space that I find useful when thinking about what we need to do to solve alignment.
Crisp tasks are more reasoning/system-2 based. Whether a response is good is typically precisely defined. Reasonable, knowledgeable people don’t disagree about it.
Examples: math, coding competitions, physics questions, logic puzzles, basic factual knowledge, …
Fuzzy tasks are more intuition/system-1 based. Whether a response is good is typically somewhat vaguely defined or could fall within a range. Reasonable, knowledgeable people may disagree about it.
Examples: distinguishing cats and dogs in pictures, recognizing strong go moves, poetry writing, assistant helpfulness ratings, …
In contrast to fuzzy logic, where this name is borrowed from, here crisp and fuzzy don’t refer to the output space: preferences comparisons have a small number of discrete allowed values, and image classification typically uses a discrete space. Crisp tasks like coding competitions have a output space that involves hundreds of discrete tokens, and this is the same output space of poetry writing, a fuzzy task.
There are many crisp tasks that we can evaluate very reliably, and thus it’s feasible to train on them extensively. In contrast, fuzzy tasks are often particularly difficult to evaluate reliably, and we currently don’t know how to do this at a superhuman level.1
In the context of alignment and safety we often want AI to perform well on fuzzy tasks, thus we need to do evaluations like “Is this assistant response helpful?“, “Will this action cause harm to humans?”, “Is this behavior illegal?”, “Is this agent delegation system following the spirit of the high-level instructions?”, “Is this a good alignment research idea?”, and so on.
What makes tasks fuzzy?
Tasks can be fuzzy for a combination of at least four reasons:
Aleatoric uncertainty. Which response is better depends on random processes in the real world.
Examples: predicting future events, making moves in games that involve randomness (poker, blackjack, Settlers of Catan, etc.), predicting the output of nondeterministic code, …
Remedy: using distributions instead of point values, using expected values, averaging over many samples, training models to be calibrated
Subjectivity. Which response is better depends on the target audience for the response.
Examples: “which poems/essays is better?”, “is this response appropriate?”, “is this video game fun?”, “what is this question is trying to ask about?”, …
Remedy: condition on the target audience for the response, write a constitution that makes the task more objective, ask follow-up questions
Ambiguity. Which response is better depends on context that isn’t provided.
Examples: asking a chat assistant that doesn’t have much context for advice, asking a question that could be interpreted in several different ways, asking for help on a project whose state isn’t visible to the model, …
Remedy: ask follow-up questions, use tools to gather more context
Requiring strong intuitions. Which response is better is downstream or upstream of some complicated process that can be approximated, but is expensive or difficult to get direct access to the underlying result.
Examples:
Whether a picture shows a cat or a dog involves a complicated process that takes genetic code, produces a live animal that photons bounce off of, some of which hit a lens and get converted into a 2D array of bits.
Future events are determined by complex and chaotic processes in the real world
Making strong moves in a board game involves the complicated process the opponent uses to determine their moves and navigating a vast search space
Predicting how people will act involves a complicated process that depends on the information they will receive that they process with billions of neurons in their brain.
Predicting the result of an ML experiment requires approximating an expensive computational process
Whether a research idea will be impactful involves a complicated process of refining the idea, executing on it, gathering evidence, and observing how the research idea is received and ages as the state of the field advances.
Deciding on the best software architecture requires a complicated process that depends on future requirements and abstractions that age well
…
Remedy: see below
The first 3 aspects of fuzziness seem more or less straightforward to deal with more prosaic machine learning methods, and I feel optimistic that the listed remedies would mostly address the fuzziness.
However, the 4th aspect of fuzziness seems particularly interesting and tricky to deal with. Let’s zoom in.
How to deal with fuzzy tasks that require strong intuitions
A lot of important fuzzy tasks that require approximating complicated processes can be done to a satisfactory performance by humans because we’ve learned reasonably effective intuitions that are effectively approximations of the complicated underlying process. Humans aren’t 100% accurate in distinguishing cats from dogs, but we can do so well enough in practice.
We should expect that in the future models have much stronger intuitions than humans on a range of tasks that we care about, and their intuitions might generalize better out-of-distribution than ours. Therefore we’d like to leverage their intuitions towards our goals.
For example, language models today have much better “intuitions” than humans for what the most likely next word is in a sentence, because they have been practicing this for trillions of words.
But how do we know whether we can trust a model’s intuitions?2
We’re either restricted to using human-level intuitions (training the model to simulate our intuitions), which sandbags (= under-elicits) a superhuman model, or we need some ways to evaluate whether our model’s intuitions are good. On a high level, there are few options for doing this:
Run the expensive process to get access to the ground truth. This will usually be too expensive for most hard-to-evaluate tasks and we don’t want to only rely on this for important safety properties because in many cases this ground truth result will only come in once harm has been done.
Train the model to explain its intuitions to us. This is what scalable oversight is trying to accomplish, but a key challenge is to ensure that we aren’t fooled by fake explanations.
Demonstrate that the model generalizes correctly from tasks where we can evaluate. For example, weak-to-strong, easy-to-hard, or crisp-to-fuzzy generalization could be paths to address this problem. A key challenge is how to trust that our insights about generalization will generalize.
Unsupervised honesty. Can we derive and unsupervised algorithm to elicit the model’s best intuitions?
The last three of these research areas seem tractable, and making progress on them seem very promising to address this problem.
Automated alignment research involves hard fuzzy tasks
The task I care most about for the purposes of solving alignment is the task “do alignment research”. Many aspects of this task are pretty fuzzy, e.g. “is this a good research idea?”, “are these ML experiments sufficiently rigorous?”, “is this a valid interpretation of this experimental data?”, etc. We need to get highly reliable evaluation signals on those tasks if we want to train our models to actually do them well and trust the result.
If all we wanted to do is automated ML capability research, we could probably come a long way by just training our models on hard crisp tasks, because there plenty of performance improvements can be measured quite reliably with outcome-based feedback (e.g. did you improve test set performance?). You have to be pretty careful to set it up so the model can’t cheat, but in practice you can just iterate and fix all the hacks it has discovered. In fact, the biggest successes in deep learning involved cleverly hill-climbing some carefully defined metric: ImageNet, go, Dota, Starcraft, perplexity on internet text, winrate according to human ratings, etc.
However, alignment research looks a lot more like other scientific research than hill-climbing on a metric, with the occasional philosophy sprinkled in. Evaluating alignment work in progress is much harder than evaluating it once it’s completed, and that can take many (human) months. Senior researchers typically have intuitions and research tastes that they have refined over the course of their entire career and these intuitions allow them to cut down the search space, prioritize projects much more effectively than junior researchers, and know which evidence to pay attention to and which to ignore.
Future models will have better intuitions on alignment research than us, but we still have to figure out when to trust them.
Thanks to numerous people who discussed this with me or gave feedback on a previous version: Collin Burns, Ansh Radhakrishnan, Tamera Lanham, Pavel Izmailov, Carson Denison, Jared Kaplan, Monte MacDiarmid, Samuel Marks, Alex Tamkin, Roger Grosse, Sam Bowman, Akbir Khan, and many others.
The breakdown of crisp and fuzzy tasks is orthogonal to task difficulty. An easy and practical way to quantify task difficult is “how much money do you have to pay to get high quality evaluation labels for the task?” Not all fuzzy tasks are hard; for example, getting high quality labels for distinguishing cats and dogs is not very expensive, and comparing helpfulness of chat assistants is easy enough that there are many humans who can give high enough quality labels in practice.
This problem is especially salient for questions that require moral or ethical intuitions. Moral philosophy often tries to grapple with out-of-distribution scenarios where human intuition and reasoning disagree. In a future where AI unlocks many new technologies, we’ll have to grapple with many moral questions that are out-of-distribution from today’s. Models could help us here if their intuitions generalize better than ours, but how can we trust them if their moral intuitions disagree with ours?
Strict to fuzzy generalization may be my new favorite area.
I think this is a very helpful way to think about the challenges ahead. The point that alignment progress requires more progress on fuzzy tasks is both intuitive (and obviously Jan has a lot of experience on this). This is troubling because it's already implying that automating capabilities research is *easier* than alignment research. I'd argue that this is even more the case for aspects of alignment research that aren't correlated with capabilities (as discussed in Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?)
I'd also emphasize that being effective at the intuitive task of evaluating AI safety requires intuitions about harmful behavior and motives (the kind of ability to find loopholes/adversarial strategies like a super-Elizier or a security mindset) .. this is both a very hard task and incredibly dangerous to build into models (it becomes a powerful capability that can be accidentally triggered with devasting consequences).