May 2023: Welcome to the alpha release of TYPE III AUDIO.
Expect very rough edges and very broken stuff—and daily improvements.
Please share your thoughts, but don't share this link on social media, for now.

Homearrow rightTopics

EA - Accidentally teaching AI models to deceive us (Ajeya Cotra on The 80,000 Hours Podcast) by 80000 Hours

AI Safety: Technical


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Accidentally teaching AI models to deceive us (Ajeya Cotra on The 80,000 Hours Podcast), published by 80000 Hours on May 15, 2023 on The Effective Altruism Forum.
Over at The 80,000 Hours Podcast we just published an interview that is likely to be of particular interest to people who identify as involved in the effective altruism community: Ajeya Cotra on accidentally teaching AI models to deceive us.
You can click through for the audio, a full transcript, and related links. Below is the episode summary and some key excerpts.
Episode Summary
I don’t know yet what suite of tests exactly you could show me, and what arguments you could show me, that would make me actually convinced that this model has a sufficiently deeply rooted motivation to not try to escape human control. I think that’s, in some sense, the whole heart of the alignment problem.
And I think for a long time, labs have just been racing ahead, and they’ve had the justification — which I think was reasonable for a while — of like, “Come on, of course these systems we’re building aren’t going to take over the world.” As soon as that starts to change, I want a forcing function that makes it so that the labs now have the incentive to come up with the kinds of tests that should actually be persuasive.
Ajeya Cotra
Imagine you are an orphaned eight-year-old whose parents left you a $1 trillion company, and no trusted adult to serve as your guide to the world. You have to hire a smart adult to run that company, guide your life the way that a parent would, and administer your vast wealth. You have to hire that adult based on a work trial or interview you come up with. You don’t get to see any resumes or do reference checks. And because you’re so rich, tonnes of people apply for the job — for all sorts of reasons.
Today’s guest Ajeya Cotra — senior research analyst at Open Philanthropy — argues that this peculiar setup resembles the situation humanity finds itself in when training very general and very capable AI models using current deep learning methods.
As she explains, such an eight-year-old faces a challenging problem. In the candidate pool there are likely some truly nice people, who sincerely want to help and make decisions that are in your interest. But there are probably other characters too — like people who will pretend to care about you while you’re monitoring them, but intend to use the job to enrich themselves as soon as they think they can get away with it.
Like a child trying to judge adults, at some point humans will be required to judge the trustworthiness and reliability of machine learning models that are as goal-oriented as people, and greatly outclass them in knowledge, experience, breadth, and speed. Tricky!
Can’t we rely on how well models have performed at tasks during training to guide us? Ajeya worries that it won’t work. The trouble is that three different sorts of models will all produce the same output during training, but could behave very differently once deployed in a setting that allows their true colours to come through. She describes three such motivational archetypes:
Saints — models that care about doing what we really want
Sycophants — models that just want us to say they’ve done a good job, even if they get that praise by taking actions they know we wouldn’t want them to
Schemers — models that don’t care about us or our interests at all, who are just pleasing us so long as that serves their own agenda
In principle, a machine learning training process based on reinforcement learning could spit out any of these three attitudes, because all three would perform roughly equally well on the tests we give them, and ‘performs well on tests’ is how these models are selected.
But while that’s true in principle, maybe it’s not something that could plausibly happen in the real world. Af...