I think this is a very helpful way to think about the challenges ahead. The point that alignment progress requires more progress on fuzzy tasks is both intuitive (and obviously Jan has a lot of experience on this). This is troubling because it's already implying that automating capabilities research is *easier* than alignment research. I'd argue that this is even more the case for aspects of alignment research that aren't correlated with capabilities (as discussed in Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?)
I'd also emphasize that being effective at the intuitive task of evaluating AI safety requires intuitions about harmful behavior and motives (the kind of ability to find loopholes/adversarial strategies like a super-Elizier or a security mindset) .. this is both a very hard task and incredibly dangerous to build into models (it becomes a powerful capability that can be accidentally triggered with devasting consequences).
"If all we wanted to do is automated ML capability research, we could probably come a long way by just training our models on hard crisp tasks, because there plenty of performance improvements can be measured quite reliably with outcome-based feedback (e.g. did you improve test set performance?)"
Is this true? Benchmarks get saturated all the time, and then we have the hard fuzzy task of coming up with new benchmarks.
"In fact, the biggest successes in deep learning involved cleverly hill-climbing some carefully defined metric: ImageNet, go, Dota, Starcraft, perplexity on internet text, winrate according to human ratings, etc."
And is this true? I would say the biggest successes are models like GPT4, Claude, and Gemini, all of which involve careful human-in-the-loop tuning over many fuzzy iterations of dataset design.
My rough sense is that if we had an automated capability improver for the crisp tasks we have today, this would be quite valuable but would be far from the last word in "capability improvement", since they'd quickly exhaust the crisp pieces of the task (which of course are there) and leave us with the fuzzy task of defining new goals and benchmarks.
And safety research isn't clearly different in this regard? An automated capabilities improver would be very valuable there too, since our efforts are currently divided between the fuzzy "figure out what tasks we care about for safety" and the crisper "try to figure out how good our systems are at solving these tasks."
Perplexity on internet text has of course been important, but even for pretraining, dataset evolution over the last five years has been critically important, right? I don't think you have any hope of pretraining the newest Claude Sonnet with the same dataset that GPT-3 was originally trained on? And improving the dataset (e.g., finding more math books) is not (currently) a crisp task?
Beyond that, pretrained models are a long way from products, and all the RLHF and similar is much more bespoke and not crisp?
Jan, I'm curious if you and your models have explored the meaning of intuition? Is it a 'gut feeling' (if yes: how is this perceived by an LLM?). And does intuition differ from instinct?
Intuition to me, means the feeling of moving in alignment with broader cosmic flows transcending my being (in Taoism expressed through Yin and Yang). If I do, I intuitively feel to be attuned with the universe. If I don't, energetic tension arises. This is spiritual.
Instinct results from evolutionary reinforcement learning: my body is programmed to sense if a specific context will bring me closer to safety or danger. It is essential for physical survival.
Both influence my direction.
Also, I think it might help to train resonance and to develop AI architectures that explicitly model non-dual intelligence. This is what Claude said:
- Treat binary states not as rigid opposites, but as dynamic, interpenetrating potentials
- Use probabilistic algorithms that embrace uncertainty
- Create recursive feedback loops that constantly reinterpret binary inputs
On the trust issue: If you want intelligence to be trustworthy, the intelligence needs to trust itself.
And since reinforcement also works on humans: You are doing a great job, keep up the good work :)
Strict to fuzzy generalization may be my new favorite area.
I think this is a very helpful way to think about the challenges ahead. The point that alignment progress requires more progress on fuzzy tasks is both intuitive (and obviously Jan has a lot of experience on this). This is troubling because it's already implying that automating capabilities research is *easier* than alignment research. I'd argue that this is even more the case for aspects of alignment research that aren't correlated with capabilities (as discussed in Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?)
I'd also emphasize that being effective at the intuitive task of evaluating AI safety requires intuitions about harmful behavior and motives (the kind of ability to find loopholes/adversarial strategies like a super-Elizier or a security mindset) .. this is both a very hard task and incredibly dangerous to build into models (it becomes a powerful capability that can be accidentally triggered with devasting consequences).
"If all we wanted to do is automated ML capability research, we could probably come a long way by just training our models on hard crisp tasks, because there plenty of performance improvements can be measured quite reliably with outcome-based feedback (e.g. did you improve test set performance?)"
Is this true? Benchmarks get saturated all the time, and then we have the hard fuzzy task of coming up with new benchmarks.
"In fact, the biggest successes in deep learning involved cleverly hill-climbing some carefully defined metric: ImageNet, go, Dota, Starcraft, perplexity on internet text, winrate according to human ratings, etc."
And is this true? I would say the biggest successes are models like GPT4, Claude, and Gemini, all of which involve careful human-in-the-loop tuning over many fuzzy iterations of dataset design.
My rough sense is that if we had an automated capability improver for the crisp tasks we have today, this would be quite valuable but would be far from the last word in "capability improvement", since they'd quickly exhaust the crisp pieces of the task (which of course are there) and leave us with the fuzzy task of defining new goals and benchmarks.
And safety research isn't clearly different in this regard? An automated capabilities improver would be very valuable there too, since our efforts are currently divided between the fuzzy "figure out what tasks we care about for safety" and the crisper "try to figure out how good our systems are at solving these tasks."
> Is this true? Benchmarks get saturated all the time, and then we have the hard fuzzy task of coming up with new benchmarks.
Perplexity on internet text has been a pretty reliable metric for many years. It drove a lot of improvements behind GPT-3, GPT-4 and related models.
There are definitely crisp tasks that are helpful for improving safety, but in my view that's not where most of the big safety wins come from.
Perplexity on internet text has of course been important, but even for pretraining, dataset evolution over the last five years has been critically important, right? I don't think you have any hope of pretraining the newest Claude Sonnet with the same dataset that GPT-3 was originally trained on? And improving the dataset (e.g., finding more math books) is not (currently) a crisp task?
Beyond that, pretrained models are a long way from products, and all the RLHF and similar is much more bespoke and not crisp?
Jan, I'm curious if you and your models have explored the meaning of intuition? Is it a 'gut feeling' (if yes: how is this perceived by an LLM?). And does intuition differ from instinct?
Intuition to me, means the feeling of moving in alignment with broader cosmic flows transcending my being (in Taoism expressed through Yin and Yang). If I do, I intuitively feel to be attuned with the universe. If I don't, energetic tension arises. This is spiritual.
Instinct results from evolutionary reinforcement learning: my body is programmed to sense if a specific context will bring me closer to safety or danger. It is essential for physical survival.
Both influence my direction.
Also, I think it might help to train resonance and to develop AI architectures that explicitly model non-dual intelligence. This is what Claude said:
- Treat binary states not as rigid opposites, but as dynamic, interpenetrating potentials
- Use probabilistic algorithms that embrace uncertainty
- Create recursive feedback loops that constantly reinterpret binary inputs
On the trust issue: If you want intelligence to be trustworthy, the intelligence needs to trust itself.
And since reinforcement also works on humans: You are doing a great job, keep up the good work :)