Discussion about this post

User's avatar
Daniel Kokotajlo's avatar

Typo: you should say "Models *aren't* really trained to be agents"

You also say models aren't situationally aware now. I'm skeptical. They seem pretty situationally aware to me; in what sense are they *not* situationally aware?

Expand full comment
Sander Schulhoff's avatar

What do you think the best way to study/elicit scheming behaviour is?

Internal/External red-teaming seems to be the go to method for a lot of companies, but these don't seem to be very scalable, in part because, as you say, "LLM inference is so much cheaper than human labor". Automated red teaming is promising, but current methods red team models themselves, not agentic systems. IMO crowdsourcing through competitions (e.g. HackAPrompt, TensorTrust) is the best way to study these behaviours due to the incentive structure and scalability. I am currently spinning up an agentic red teaming competition that I hope will fill this need and provide evals on top models.

Expand full comment
19 more comments...

No posts