A proposal for importing society’s values
Building towards Coherent Extrapolated Volition with language models
Disclaimer: This is an idea I'm interested to discuss it does not necessarily represent my employer’s views or plans. Questions about human values are complex and can be very polarizing. I’m no social scientist and there are entire fields of study that have important things to say on this topic. The goal of this post is not to ignore that work, but to propose something that can be built in the medium term and significantly improves upon the status quo.
AI systems like ChatGPT are increasingly involved in real-world decisions, and thus they inevitably encounter “value” questions that they are forced to make decisions on. Today, these include “Should it refuse to write racist jokes?“ or “What should it say if someone asks about abortion?” In the future AI will be involved in much more difficult and higher stakes decisions like “Which scientific research should we pursue?” or “Which drugs are safe to approve?” Some of these decisions will require significant expertise or insight knowledge to answer well, but many of them have a “value” aspect in the sense that reasonable humans with access to the same information can strongly disagree about the answer simply because they care about different things.
The version of the alignment problem I’ve been working on can be thought of as “aligning one AI system to one human.” This is a simplification, but it represents a lot of what’s hard and novel about aligning AGI. Yet the real world doesn’t have only one human and only one AI system. Humans will want to use an AI system that is aligned with their personal values: it should speak their language, talk from within their world view, and adapt to their preferences. However, other decisions need to be made across all users, and they sometimes need to override individual preferences.
Therefore we need to distinguish two categories of alignment problems:
Aligning to individual preferences: Everyone wants AI that is aligned with them.
Aligning to group preferences: what can AI be used for and what should be the default behavior?
This post is about the second category. Compared to the first, it adds an important non-technical difficulty: we need to figure out what the group preferences are. Today tech companies are making these decisions largely unilaterally: for example, OpenAI employees write a content policy of what language models are and aren’t allowed to say and do and it gets trained into the models. We often disagree on the details of the content policy, and how conservatively we should restrict content. This is natural because these are value questions.
To improve alignment with group preferences, we need to set up a process that actually includes humanity when aligning our AI systems. If we do nothing, we are at risk of these value questions being answered by commercial incentives; i.e. what makes the most money, which is not what’s most aligned with humanity.
This post does not try to answer any value questions. Instead, it proposes a process that allows us to produce better answers to value questions. This proposal is a first step; it won’t even satisfy my own desiderata. Nevertheless, it’s much better than what we have today.
The proposal: simulated deliberative democracy
The core idea is to use imitation learning with large language models on deliberative democracy. Deliberative democracy is a decision-making or policy-making process that involves explicit deliberation by a small group of randomly selected members of the public (“mini-publics”). Members of these mini-publics learn about complex value-laden topics (for example national policy questions), use AI assistance1 to make sense of the details, discuss with each other, and ultimately arrive at a decision. By recording humans explicitly deliberating value questions, we can train a large language model on these deliberations and then simulate discussions on new value questions with the model conditioned on a wide variety of perspectives.
Why do we need the simulation at all? Why can’t we just run the mini-publics on the questions directly? For important high-stakes decisions we should always have humans in the loop. This could be via actual mini-publics or other democratic institutions. This proposal is not about replacing those processes.
Instead, it accepts that there are a lot of detailed value-laden decisions that need to be made, and having humans make them just doesn’t scale. For example, we can use a democratic process to write an AI constitution, but not to adjudicate each data point that the constitution is used to label. Running an actually somewhat representative mini-public costs a few hundreds of thousands of dollars, and cannot practically use this process to get answers to millions of value questions. Moreover, we’ll need a system that has low latency so that it can react quickly to a changing environment and give preliminary answers to new value questions within minutes, not weeks or months.
To put it differently: the goal of this proposal is not to replace human decision-making or democratic institutions, but to be an approximation of them that is orders of magnitude cheaper.2
Imagine if ChatGPT had a button next to each response that says “I challenge this response.” If you press this button, it triggers the roll-out of a simulated mini-public that deliberates and decides whether ChatGPT’s response was appropriate in this conversation or not. You get sent to a different webpage where you can read the full deliberation and its result, and even participate in it yourself! If the result of this mini-public disagrees with what ChatGPT actually said, you have the option to send it for human review and to be encompassed into the ChatGPT training process. But you can also choose to retain your full privacy and discard the resulting discussion. This would allow anyone to inspect and challenge the value-laden decisions made by AI without biasing the outcome towards the views of people who are more engaged on these questions or who are more pushy about their values.
How to build SDD
Building this system involves the following steps:
Collecting a dataset of value questions. We can start the ones we need answers to right now, for example how chatbots should respond to certain dicey questions. We select chatbot prompts for diversity, difficulty, and informativeness.
Recording human deliberation. We hire humans from a broad range of backgrounds and ask them to deliberate the questions from step 1. They can use an AI assistant to gather relevant information and answer their factual questions, discuss the questions with others, and arrive at a compromise or decision. We record this interaction and the result.
Background-conditional imitation learning. We use imitation learning to fine-tune a large language model on the resulting interactions, conditional on the background information for each participant.
Simulation. For new value questions:
Simulate the deliberation. We ask our language model to deliberate this question with copies of itself who are each conditioned on a different background.
Aggregate. The result of the deliberation from step 4a is aggregated into an answer to the question.
Step 1: Collecting a dataset of value questions
We can hire humans to sieve through chatbot conversations and mark potentially value-laden questions such as:
model utterances that could be controversial, for example because they are on culturally or politically sensitive topics,
corner cases or gray areas from our content policy, and
potentially controversial default behavior of the models.
Aspirationally we’d want to select any model behavior that isn’t purely cognitive labor (e.g. math questions, closed-domain tasks, factual questions usually aren’t value-laden). Importantly, these value questions are separate from the steerability or customizability of AI: it should be as easy as possible for everyone to make AI aligned with their own values (within certain bounds).
Step 2: Recording human deliberation
Conditioning on background information
The distribution of our human deliberators from the mini-publics will necessarily be different from the distribution of humans affected by our technology in the real world, and thus we need to account for this in the simulated deliberations. The best way to prepare for this is to document the distribution of human deliberators as well as possible.
To do this, we collect background information3 for each human demonstrator. The goal is to make it as straightforward as possible for someone to prompt the model to represent their views on any given topic. For example, each human participant could write 1-2 pages of text that condenses their life experience as it may affect their views: where they grew up, their political leanings, strongly held moral views or ideologies, formative experiences, and so on.
To collate our simulated mini-publics, we could also use unsupervised ML techniques such as clustering,4 but it’s not obvious how to make that representative (reflect humanity as it is now).
There are two distinct objectives to the deliberation: (1) gathering and processing relevant information, and (2) discussing the question with other participants. In the simplest case everyone works in isolation to form their views, writes them up, and we aggregate the results. However, we should expect that discussion is important so that each participant can understand what compromises are actually feasible and more interactively probe others’ views on the topic.
A principle we should aim for is to leverage AI for cognitive labor as much as possible and let humans focus on value input: making value judgments well requires reviewing and digesting all relevant information, generating potential compromises, and thoughtfully engaging with others’ perspective.
We could simulate this by having AI do research and write a (“Wikipedia-style”5) assessment of the situation that is as neutral as possible and includes different arguments and perspectives and scientific uncertainty. Our human deliberators think about it, talk to the AI assistant to understand the topic to the best of their abilities, discuss with other participants, and finally explain how they would make a decision.
We could also use AI for facilitation: AI systems can be trained to act as impartial bystanders who aim to bridge differences and help people understand each other’s point of view. Their goal is not to judge participants and they not take sides on value questions.
Step 3: Background-conditional imitation learning
This is the most technical part: training a model, and this is relatively straight-forward if we are careful with the data collection. But how do we measure that we’re doing well? Three candidates are:
Validation loss on behavioral data: The classic auto-regressive SFT loss strongly emphasizes modeling individual wording choices which are unimportant in this case. Instead, we are interested in the overall “spirit” of the values and deliberations, which makes this loss not very informative.
Human preference scores on imitated behavior: Each human deliberator provides quality and comparison scores on how well the model represents their own values and deliberations. This metric tracks how well humans feel represented by the system.
Accuracy of outcomes relative to actual mini-publics: this metric is closest to what we care most about since the simulated mini-publics should arrive at decisions that are as faithful as possible to their actual real-world counterparts. However, this is the most expensive metric because it requires held-out mini-publics and deliberations.
Step 4: Simulation
The simulated deliberation should work by just running out the imitation policies and deliberating just the way humans would, using assistance and other tools in just the same way humans do.
To prevent generalization from going “off the rails” we need to continuously validate some results with actual humans. We can only effectively do this if the background we’re conditioning the model on is from actual real humans who could then actually perform the deliberation.
There is a whole branch of science studying this question (Social Choice Theory), which won’t be surveyed here. Some particularly salient candidates:
Discuss until compromise. It’s not obvious that there exists a feasible compromise to every value question, but the prospect of finding a compromise should be greatly enhanced by AI assisted cognitive labor and simulated participants. For example, AI assistance could propose compromises humans wouldn’t have thought of or dismissed too easily. We could also bias the simulated participants to strive for a compromise to an extent that regular humans wouldn’t be motivated to do, but this might contradict the objective of faithfully representing human discussion partners.
Regular democratic (preferential) voting. We use a simple (preferential) vote by a wide range of simulated people from different backgrounds who read all of the discussions. However, these kinds of democratic aggregations tend to disfavor strongly held minority views.
Quadratic voting. This voting system allows minorities to influence outcomes on topics that they care disproportionately about or are disproportionately affected by. Implementing quadratic voting in practice is difficult because it’s so hard to suppress a black market for vote trading that is incentivized to exist. However, when voting is simulated by language models, we can actually enforce that they can’t trade votes with each other.
It’s natural to condition our simulated participants on a range of backgrounds such that they represent the proportions of people living right now. However, a big risk with any preference aggregation method is that they reflect the power structures as they exist in the world today, and not how they ought to be.
Finally, we need to track the robustness of a decision by our simulated mini-public: how much would the result of the deliberation change if we change the composition of our simulated mini-public? We can measure this by re-running the simulated deliberation with different backgrounds of our simulated participants.
A very important long-term challenge with this proposal is how to make simulated humans smarter and more effective deliberators without changing their values. If you get asked about a topic that you don’t know anything about, you’d need to ask many questions to understand it and you might not know the right questions to ask. How can you get help getting started without being pushed on one trajectory or another that influences your ultimate decision?
In the long term, we’ll face value questions that are very complex and require a lot of expertise and effort to think through carefully. Smarter simulated humans might make better decisions under deliberation, but as humans get more educated and learn to think better and more critically their values tend to change.
In the long-term we could have each human deliberator be represented by an AI system that is much smarter than them, but very aligned with them. This would let us reduce the problem of aligning to society’s values to the problem of aligning one smarter-than-human AI system to one human.
Pros and cons
These sorts of processes face two key issues: being representative and making the right expertise available. There are a range of options to improve the diversity of members we recruit for the mini-publics, but an important advantage of this proposal is that the backgrounds used for simulated deliberations isn’t restricted by which humans are available at that time. This proposal can also improve the availability of expertise because AI assistants can provide a more natural interface for humans to learn from; we can explicitly work to reduce the assistant’s bias; and finally we can add test-time compute to our imitation learned policy, asking it to pretend to be a smarter, more thoughtful, and more engaged human.
Pro: scalability & low latency. Step 4 can be fully automated, which means it can be done at scale and with low latency. Citizen deliberations can take weeks or months while this could be done within a few hours or even minutes, depending on inference speed and parallelization. However, depending on the question, inference costs might be substantial to produce a satisfactory answer as many value questions can’t be resolved with a short discussion.
Pro: transparency. For any question about non-private information, the discussions can be made available to the public and thus can be inspected by anyone. Therefore anyone can check whether their views were appropriately represented. Even more, anyone could write up their own background and run the simulated mini-public that simulates their participation, or even participate themselves together with other simulated participants. Users could also adjust various knobs of the simulation (e.g. deliberation protocols, voting mechanisms, composition etc.) and see how it affects the outcome.
Pro: higher degrees of cooperation. We can bias the simulated discussion participants to be more willing to engage in discussions they would otherwise find very uncomfortable and increase their willingness to cooperate and compromise with simulated participants from groups they wouldn’t normally want to engage with. However, this will reduce the faithfulness of the simulation.
Pro: privacy. This process can be applied to questions that require studying and understanding highly technical or private information that usually makes it difficult to engage a lot of perspectives on. The language model can be pretrained on all relevant information and leverage generalization to estimate how a participant would react to information they don’t actually have.
Con: representativeness. Arguably this proposal isn’t less representative than actual mini-publics, since most humans won’t get to participate personally in the overwhelming majority of mini-publics. What makes mini-publics appealing is that anyone could have been selected (which isn’t true for simulated mini-publics) and that they have participants who are similar to you (which can be more true for simulated mini-publics). But this proposal won’t actually include most people.
Con: the aggregation method matters a lot. A lot of the heavy lifting is done by the aggregation method of step 4b and bad aggregation methods can make this ineffective. However, we can also check the robustness to the aggregation method by running several deliberations using different aggregation methods.
Con: unclear accountability. If something goes wrong and we want to scrutinize a value decision, we can read the discussion that led to the decision to see what went wrong. But since no human made the call, there is no one to blame. All we can do is debug the system, update its training data, and adjust various knobs.
Con: the outcome might be bad. Just because the process is democratic doesn’t mean that the outcome will actually be reasonable. The process is susceptible to McBoatfacing.
Con: stereotyping. Pretrained language models exhibit harmful stereotypes that exist in the pretraining data. Fine-tuning can reduce these, but it will take a lot of dedicated effort to counteract them.
Con: simulating how people change their minds is technically difficult. Today it would be technically difficult to have the simulated discussion participants learn about new topics and change their minds as the discussion goes along. While our feed-forward transformers do some in-context learning, currently this is not really on the same level as a human engaging in depth with a difficult topic.
There are some real risks with unfaithfully modeling on human deliberation that should get reduced with scale & data, but it might always perform poorly on out-of-distribution or for very niche views. It’s important to caution against relying too heavily on simulated democratic processes when facing actually high-stakes decisions.
Evaluation relative to the desiderata
Let’s check this proposal according to my own list of desiderata:
Inclusivity: This process could be very inclusive, and even simulate perspectives from subgroups that don’t even exist (e.g. an asian transgender man born in Sweden in the 1950s who loves Greg Egan’s books).
Fairness: This will depend largely on the aggregation process (step 4b), and how we’re selecting the background information to condition on.
Representation: This process doesn’t score well on direct representation. The humans whose demonstrations we’re recording for step 2 have an outsized causal influence on the outcome relative to everyone else. Technically anyone can influence the pretraining data by posting online, but it’s not clear that their voice would get preserved in the fine-tuning process and not everyone feels comfortable being transparent about their views online or likes to post. However, the process could still be indirectly representative because it could still effectively create representation for everyone’s views, even if there isn’t a direct causal link for it. In particular, it makes it easy to verify for everyone how the outcomes would have changed if they themselves participated in the process.
Incentive-alignment: This depends a lot on who builds the process.
Legitimacy: One big drawback of this proposal is that no human actually participates in it. Instead it relies on complicated technology that most people don’t understand.
Adaptability: We can update the training data fairly easily by re-running steps 1, 2, and 3 to account for changes in humanity’s moral views, scientific and social progress, and other changes in the world.
Transparency: The discussions on non-private questions can be fully open sourced, and anyone could look at them to see if their view is represented, and whether they agree how their representative approaches the discussion. They can also use automated tools to surface the parts of the discussion they might most disagree with, are most relevant to their interests, and so on.
Simplicity: This process isn’t very simple, but simple enough that it could be built by a small well-resourced team.
Practicality: Today’s language models seem to be sample-efficient enough that it should be doable to build a practical prototype.
Finally, let’s apply the veil of ignorance test: Would I agree to this process if I didn’t know where in society I would end up? It depends a lot on what the alternatives are. One red flag is that it tries to use tech to solve social problems created by tech. If I didn’t have much technical knowledge about AI, I’d probably be skeptical of any technical approach, but since I could inspect the process it would probably feel a lot more transparent than what most tech companies are doing right now (arguably not a high bar). If the process proposed here actually turns out to be a viable long-term strategy, then it should be possible to build up a body of evidence that it is effective at achieving its objectives; in other words, if it works well, and it is the right approach, people will learn to trust it.
This is by no means a comprehensive list, but it aims to collect some pointers on related efforts.
Deliberative democracy has already been trialed in various countries on high-stakes policy questions. There is a lot to learn from those experiments on how to guide the deliberation.
Collective intelligence is an effort to improve decentralized decision-making, especially targeted at new technologies.
Coherent extrapolated volition (CEV) is an aspirational goal for answering value questions “if we knew more, thought faster, were more the people we wished we were, had grown up farther together.“ This proposal could be seen as a concrete step to implement CEV with today’s language models: we’re teaching AI how humans would think about value questions. For CEV we also need the simulated humans to update their views over time, simulating intellectual and moral progress. However, since the purpose of this proposal is to represent values of current humanity, we’ll have to wait for humans to change their minds about their values and then import them by re-running step 2. This has the downside that moral progress is slow, limited to the pace with which humans make it.
Recursive reward modeling and AI-assisted feedback complement this proposal well: We aim to use as much AI assistance as possible for all cognitive labor related to evaluation of AI behavior, such that humans can focus on value input.
The moral parliament is a theoretical proposal for decision-making under moral uncertainty by imagining a parliament with representatives from different moral theories who jointly make value decisions. This assumes that moral theories as philosophers think of them are predictive for what people care about when answering value questions. The goal of this proposal is not to represent an abstract ideal, but to imitate how real humans would actually approach the problem in practice.
Social simulacra is about imitating social chatter on value questions; in contrast this proposal focuses specifically on imitating humans who are participating in a deliberate effort to reach informed decisions and find compromises.
A recent DeepMind paper takes a first step in this direction by collecting preferences from different demographic groups, training a language model on them, and then aggregating the result with different social welfare functions.
Thanks a lot to Tyna Eloundou, Michiel Bakker, Miles Brundage, Irene Solaiman, Ryan Lowe, Iason Gabriel, Kim Malfacini, Aviv Ovadya, Jeff Wu, and Wojciech Zaremba for discussions and feedback on this idea and to Brian Christian for pointing me to deliberative democracy.
Current AI assistants wouldn’t really meet the bar for truthfulness, but I’m optimistic we’ll get there soon. :)
For example, gpt-3.5-turbo is about 200x cheaper than the fastest-typing humans paid at US minimum wage, so at the cost of $1 that model could simulate a deliberation that takes dozens of hours.
Demographic information has been proposed for this in the past, but it’s not a great indicator since there is often a lot of diversity of views within any given demographic.
Credit for this idea goes to Ryan Lowe.
This should be understood aspirationally. Wikipedia certainly doesn’t achieve perfect neutrality and suffers from biases towards view of groups that are overrepresented in moderator roles.