Discussion about this post

User's avatar
Jacob Pfau's avatar

I'm also optimistic on near-term AI doing much to accelerate and automate alignment research. But I think there's a structural risk in the transition to automated alignment research that your post doesn't address, and is easy to miss when focusing on the capabilities question of whether models can do fuzzy tasks.

I claim automated alignment faces correlated fuzzy errors that are (a) produced by correlated models with no independent error correction to catch them, (b) embedded throughout the research workflow so you can't isolate them, (c) increasingly shaped by unreliable supervision so they're becoming more likely. I think this is a more serious obstacle to automated alignment research than the capabilities question of whether models can pass a research taste bar on the average case.

(a) Correlated error: Human science is trustworthy not because individual researchers have great fuzzy judgment, but because independent errors get corrected by institutional and individual diversity. Different groups work independently, make different mistakes, and cross-critique each other — scientific progress self-corrects errors. Automated alignment research with one model (or a few closely related models) introduces a field-wide correlation breaking the error-correction mechanism. So the fuzzy-task problem is more than "can the model do it well enough", it's "can we detect correlated subtle failures in judgment when there's no independent cross-check?"

(b) Non-modularity of fuzzy judgments: This gets worse when you notice that fuzzy judgment isn't a separably auditable module, it's threaded through all parts of the automated research process. When a model writes a new codebase, it makes dozens of fuzzy calls: how to structure the code for legibility, what to plot, what patterns look interesting enough to flag. Every long-horizon "crisp" task is filled with this kind of judgment. Your proposed mitigations implicitly assume you can identify where the fuzzy judgment lives and check it. But if it's pervasive, you can't afford to double-check all of it. The attack surface for correlated subtle errors is the entire research workflow, not a bounded subset of "fuzzy tasks."

(c) Increasing fuzzy error rate: I worry that generalisation to good judgments is getting less likely. Right now the mapping between verifiable work and fuzzier judgment (e.g., between a proof and a natural-language explanation of that proof) is reasonably well-anchored by human data--humans write up their proofs for communication--not just lean! We know what good code structure looks like; we know what a clear plot looks like. As models operate past human capability frontiers, that anchoring weakens and preference data plays an increasingly large role in shaping fuzzy generalization — a much shakier training signal.

Post-Alignment's avatar

I feel that major companies are now facing Bernard Williams' "George the Chemist" dilemma: they cannot help but race ahead and take shortcuts with AI Safety. It reminds me of what Rosenblatt pointed out last year "current alignment techniques like RLHF and post-training don't change what the model IS - they just teach it what not to say".

23 more comments...

No posts

Ready for more?