Discussion about this post

User's avatar
Nathan Lambert's avatar

Strict to fuzzy generalization may be my new favorite area.

Expand full comment
Ron Bodkin's avatar

I think this is a very helpful way to think about the challenges ahead. The point that alignment progress requires more progress on fuzzy tasks is both intuitive (and obviously Jan has a lot of experience on this). This is troubling because it's already implying that automating capabilities research is *easier* than alignment research. I'd argue that this is even more the case for aspects of alignment research that aren't correlated with capabilities (as discussed in Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?)

I'd also emphasize that being effective at the intuitive task of evaluating AI safety requires intuitions about harmful behavior and motives (the kind of ability to find loopholes/adversarial strategies like a super-Elizier or a security mindset) .. this is both a very hard task and incredibly dangerous to build into models (it becomes a powerful capability that can be accidentally triggered with devasting consequences).

Expand full comment
4 more comments...

No posts