June 2023: Welcome to the alpha release of TYPE III AUDIO.
Expect very rough edges and very broken stuff—and daily improvements. Please share your thoughts.

Homearrow rightPlaylists

[Week 3] “Goal Misgeneralisation: Why Correct Specifications Aren’t Enough For Correct Goals” by Shah et al.

AGI Safety Fundamentals: Alignment

Readings from the AI Safety Fundamentals: Alignment course.



Apple PodcastsSpotifyGoogle PodcastsRSS

As we build increasingly advanced AI systems, we want to make sure they don’t pursue undesired goals. This is the primary concern of the AI alignment community. Undesired behaviour in an AI agent is often the result of specification gaming —when the AI exploits an incorrectly specified reward. However, if we take on the perspective of the agent we’re training, we see other reasons it might pursue undesired goals, even when trained with a correct specification. Imagine that you are the agent (the blue blob) being trained with reinforcement learning (RL) in the following 3D environment: The environment also contains another blob like yourself, but coloured red instead of blue, that also moves around. The environment also appears to have some tower obstacles, some coloured spheres, and a square on the right that sometimes flashes. You don’t know what all of this means, but you can figure it out during training! You start exploring the environment to see how everything works and to see what you do and don’t get rewarded for.

For more details, check out our paper. By Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, and Zac Kenton.

Original text:


Narrated for AGI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

Share feedback on this narration.