May 2023: Welcome to the alpha release of TYPE III AUDIO.
Expect very rough edges and very broken stuff—and daily improvements.
Please share your thoughts, but don't share this link on social media, for now.
We only have recent episodes right now, and there are some false positives. Will be fixed soon!
In my previous post, I talked through the process of identifying the fears underlying internal conflicts. In some cases, just listening to and understanding those scared parts is enough to make them feel better—just as, when venting to friends or partners, we often primarily want to be heard rather than helped. In other cases, though, parts may have more persistent worries—in particular, about being coerced by other parts. The opposite of coercion is trust: letting another agent do as they wish, without trying to control their behavior, because you believe that they’ll take your interests into account. How can we build trust between different parts of ourselves?
I’ll start by talking about how to cultivate trust between different people, since we already have many intuitions about how that works; and then apply those ideas to the task of cultivating self-trust. Although it's tempting to think of trust in terms of grand gestures and big sacrifices, it typically requires many small interactions over time to build trust in a way that all the different parts of both people are comfortable with. I’ll focus on two types of interactions: making bids and setting boundaries.
By “making bids” I mean doing something which invites a response from the other person, where a positive response would bring you closer together. Sometimes bids are explicit, like asking somebody out on a date. But far more often they’re implicit—perhaps greeting someone more warmly than usual, or dropping a hint that your birthday is coming up. One reason that people make their bids subtle and ambiguous is because they’re scared of the bid being rejected, and subtle bids can be rejected gently by pretending to not notice them. Another is that many outcomes (e.g. being given a birthday present) feel more meaningful when you haven't asked for them directly. Some more examples of bids that are optimized for ambiguity:
Teenagers on a first date, with one subtly pressing their arm against the other person’s, trying to gauge if they press back.
Telling your parents about your latest achievements, in the hope that they’ll express pride.
Making snarky comments about your spouse being messy, in the hope that they’ll start being more proactive in taking care of your preferences.
Asking to “grab a coffee” with someone, but trying to leave ambiguous whether you’re thinking of it as a date.
Of course, the downside of making ambiguous bids is that the other person often doesn't notice that you're making a bid for connection—or, worse, interprets the bid itself as a rejection. As in the example above, a complaint about messiness is a kind of bid for care, but one which often creates anger rather than connection. So overcoming the fear of expressing bids directly is a crucial skill. Even when making a bid explicit renders the response less meaningful (like directly asking your parents whether they're proud of you), you can often get the best of both worlds by telling them explicitly about the emotion underneath the bid (e.g. that you often feel judged by them), rather than the bid itself .
There's a third major reason we make ambiguous bids, though. The more direct our bids, the more pressure recipients feel to accept them—and it’s scary to think that they might accept but resent us for asking. The best way to avoid that is to be genuinely unafraid of the bid being turned down, in a way that the recipient can read from your voice and demeanor. Of course, you can’t just decide to not be scared—but the more explicit bids you make, the easier it is to learn that rejection isn’t the end of the world. In the meantime, you can give the other person alternative options when making the bid, or tell them expli...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Trust develops gradually via making bids and setting boundaries, published by Richard Ngo on May 19, 2023 on LessWrong. In my previous post, I talked through the process of identifying the fears underlying internal conflicts. In some cases, just listening to and understanding those scared parts is enough to make them feel better—just as, when venting to friends or partners, we often primarily want to be heard rather than helped. In other cases, though, parts may have more persistent worries—in particular, about being coerced by other parts. The opposite of coercion is trust: letting another agent do as they wish, without trying to control their behavior, because you believe that they’ll take your interests into account. How can we build trust between different parts of ourselves? I’ll start by talking about how to cultivate trust between different people, since we already have many intuitions about how that works; and then apply those ideas to the task of cultivating self-trust. Although it's tempting to think of trust in terms of grand gestures and big sacrifices, it typically requires many small interactions over time to build trust in a way that all the different parts of both people are comfortable with. I’ll focus on two types of interactions: making bids and setting boundaries. By “making bids” I mean doing something which invites a response from the other person, where a positive response would bring you closer together. Sometimes bids are explicit, like asking somebody out on a date. But far more often they’re implicit—perhaps greeting someone more warmly than usual, or dropping a hint that your birthday is coming up. One reason that people make their bids subtle and ambiguous is because they’re scared of the bid being rejected, and subtle bids can be rejected gently by pretending to not notice them. Another is that many outcomes (e.g. being given a birthday present) feel more meaningful when you haven't asked for them directly. Some more examples of bids that are optimized for ambiguity: Teenagers on a first date, with one subtly pressing their arm against the other person’s, trying to gauge if they press back. Telling your parents about your latest achievements, in the hope that they’ll express pride. Making snarky comments about your spouse being messy, in the hope that they’ll start being more proactive in taking care of your preferences. Asking to “grab a coffee” with someone, but trying to leave ambiguous whether you’re thinking of it as a date. Of course, the downside of making ambiguous bids is that the other person often doesn't notice that you're making a bid for connection—or, worse, interprets the bid itself as a rejection. As in the example above, a complaint about messiness is a kind of bid for care, but one which often creates anger rather than connection. So overcoming the fear of expressing bids directly is a crucial skill. Even when making a bid explicit renders the response less meaningful (like directly asking your parents whether they're proud of you), you can often get the best of both worlds by telling them explicitly about the emotion underneath the bid (e.g. that you often feel judged by them), rather than the bid itself . There's a third major reason we make ambiguous bids, though. The more direct our bids, the more pressure recipients feel to accept them—and it’s scary to think that they might accept but resent us for asking. The best way to avoid that is to be genuinely unafraid of the bid being turned down, in a way that the recipient can read from your voice and demeanor. Of course, you can’t just decide to not be scared—but the more explicit bids you make, the easier it is to learn that rejection isn’t the end of the world. In the meantime, you can give the other person alternative options when making the bid, or tell them expli...
We can resolve internal conflicts by understanding what underlying fears are driving the conflict, then providing evidence that those fears won't happen, thereby reconsolidating the memories which caused them. A simple example of this process comes from exposure therapy for phobias, which works by demonstrating that the phobia is much less dangerous than the person had learned to believe. A wide range of different therapeutic approaches apply the same core mechanism to deal with more complex internal conflicts. I'll focus in particular on the internal family systems (IFS) framework—which, despite the slightly kooky name, is one of the most powerful methods for dealing with internal conflict.
The core ideas of IFS are essentially the ones I've outlined in the last few posts: that you should think of yourself as being composed of many parts, some of which are implementing protective strategies based on your previous experiences (especially from childhood). IFS particularly highlights the idea that there are “no bad parts”—we should treat all parts as deserving of sympathy, even when the strategies they’re using are harmful and deeply misguided. The four posts in this section of the sequence will talk through how and why to apply these ideas to build trust between different parts. In this post I'll focus on the first step: identifying the underlying parts at play and what they want.
Our starting point can be any phenomenon that triggers an emotional response from some part of you. You might find one by thinking about an emotionally-loaded topic, like your work or relationships (especially with your parents); or paying attention to how your body feels; or paying attention to the way you choose your words or thoughts, and which ones you're suppressing; or to your dreams; or to character archetypes or symbolic motifs that particularly resonate with you. Many types of triggers work far better for some people than others—a fact which helps explain why so many different psychotherapy techniques exist.
However you find an emotional trigger, the next step is to work through the protective or defensive strategies associated with the part causing the response, to figure out what its underlying fear or need is. In doing so, it’s often helpful to name or visualize the active part (e.g. by identifying it with a younger version of you) and imagine having a conversation with it. The practice of Focusing can also be useful here—this involves saying a possible articulation of what the part wants out loud, seeing if it resonates, and adjusting it if not. Again, different techniques will work for different people; the important thing is finding some introspective technique for narrowing in on what the part "wants to say".
Doing so often requires navigating two types of defensiveness: from the part that's trying to articulate its perspective, and from other parts reacting to criticism of themselves. For example, suppose that you face a conflict between a part that wants to donate more to charity and a part that wants to spend more on holidays with your friends. The former might be partially driven by a fear of others thinking you’re selfish; the latter might be partially driven by a fear of not seeming cool enough. For each of them, criticising the other part helps it get more of what it wants, while admitting its own fear gives the other part ammunition to use against it. So each part might become defensive both when it's prompted to articulate its underlying motivations, and when criticized by the other part.
What defensiveness looks like varies by person, but it often involves angry pushback, refusal to engage, or redirection towards less sensitive topics (e.g. via ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Resolving internal conflicts requires listening to what parts want, published by Richard Ngo on May 19, 2023 on LessWrong. We can resolve internal conflicts by understanding what underlying fears are driving the conflict, then providing evidence that those fears won't happen, thereby reconsolidating the memories which caused them. A simple example of this process comes from exposure therapy for phobias, which works by demonstrating that the phobia is much less dangerous than the person had learned to believe. A wide range of different therapeutic approaches apply the same core mechanism to deal with more complex internal conflicts. I'll focus in particular on the internal family systems (IFS) framework—which, despite the slightly kooky name, is one of the most powerful methods for dealing with internal conflict. The core ideas of IFS are essentially the ones I've outlined in the last few posts: that you should think of yourself as being composed of many parts, some of which are implementing protective strategies based on your previous experiences (especially from childhood). IFS particularly highlights the idea that there are “no bad parts”—we should treat all parts as deserving of sympathy, even when the strategies they’re using are harmful and deeply misguided. The four posts in this section of the sequence will talk through how and why to apply these ideas to build trust between different parts. In this post I'll focus on the first step: identifying the underlying parts at play and what they want. Our starting point can be any phenomenon that triggers an emotional response from some part of you. You might find one by thinking about an emotionally-loaded topic, like your work or relationships (especially with your parents); or paying attention to how your body feels; or paying attention to the way you choose your words or thoughts, and which ones you're suppressing; or to your dreams; or to character archetypes or symbolic motifs that particularly resonate with you. Many types of triggers work far better for some people than others—a fact which helps explain why so many different psychotherapy techniques exist. However you find an emotional trigger, the next step is to work through the protective or defensive strategies associated with the part causing the response, to figure out what its underlying fear or need is. In doing so, it’s often helpful to name or visualize the active part (e.g. by identifying it with a younger version of you) and imagine having a conversation with it. The practice of Focusing can also be useful here—this involves saying a possible articulation of what the part wants out loud, seeing if it resonates, and adjusting it if not. Again, different techniques will work for different people; the important thing is finding some introspective technique for narrowing in on what the part "wants to say". Doing so often requires navigating two types of defensiveness: from the part that's trying to articulate its perspective, and from other parts reacting to criticism of themselves. For example, suppose that you face a conflict between a part that wants to donate more to charity and a part that wants to spend more on holidays with your friends. The former might be partially driven by a fear of others thinking you’re selfish; the latter might be partially driven by a fear of not seeming cool enough. For each of them, criticising the other part helps it get more of what it wants, while admitting its own fear gives the other part ammunition to use against it. So each part might become defensive both when it's prompted to articulate its underlying motivations, and when criticized by the other part. What defensiveness looks like varies by person, but it often involves angry pushback, refusal to engage, or redirection towards less sensitive topics (e.g. via ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We learn long-lasting strategies to protect ourselves from danger and rejection, published by Richard Ngo on May 16, 2023 on LessWrong. Take a second to imagine what being a child was like throughout most of human history. You were born with a huge and underdeveloped brain, designed for soaking in information from your surroundings like a sponge. But you weren’t able to freely follow your curiosity: even if you had loving, caring parents, you still faced frequent physical danger from nature and other people, severe scarcity, and rigid cultural norms that governed acceptable behavior within your community, with harsh penalties for stepping out of line. You had to learn fast and reliably how to stay safe, and in particular to stay on the good side of the adults around you. Even after you grew up and passed the period of most acute danger, you’d still face many threats of violence and scarcity. Your ability to avoid these depended in large part on your relationships: holding a respected position within your tribe was the key pathway to a good life, whereas exclusion from your tribe was tantamount to execution. So “danger and rejection” isn’t an ad-hoc combination: our brains are primed to think of them as the same thing; and conversely, to equate safety and love. I’ll call the latter combination “security” (which I think of as a combination of “physical security” and “emotional security”, although I’ll mostly be focusing on the latter). Children are learning machines, and what they learn above all is strategies for achieving security; because the opinions of other people are so powerful, “being good” in ways which receive approval from the group is one of the central strategies they learn. How literally should we take this story? It’s clear that describing humans as optimizing for a single goal is a big oversimplification. But it’s hard to overstate how powerful the drive for security is. Think of the many girls who override the drive to eat because part of their brain is convinced that being skinnier will make others desire and love them. Think of the many boys who override their sex drives because part of their brain is convinced that hitting on girls would lead to broader social rejection. Think of the many suicidal adults who override their literal survival drive in response to problems with their relationships or careers. If these drives can be quashed by the drive for security, then anything can be. I think there are two main reasons that our drive for security is hard to see. One is that, as adults, we’re often not optimizing directly for security, but instead following heuristics and strategies developed to achieve proxies for security in our childhood environment. In other words: on an emotional level, our brains often “cache” conclusions from childhood, which become hard to override as adults. I like the way that Malcolm Ocean describes the result of this caching: “everyone is basically living in a dream mashup of their current external situation and their old emotional meanings”. But those old emotional meanings are adapted for a childhood environment very different from our adult environment, and are therefore often deeply counterproductive. For example, once we are able to stand up to people yelling at us, or we are able to leave abusive relationships, we may still be slow to do so because we’ve cached the conclusion that we’re helpless in situations like these. The second reason our drive for security is hard to see: because we don’t directly understand what’s behind them, counterproductive strategies often get stuck in self-reinforcing feedback loops. Consider a child who becomes fixated on the belief: “if I’m skinnier then people will love me”. She becomes skinnier, and of course it doesn’t fix anything. The reasonable move is to discard that belief. But...
Take a second to imagine what being a child was like throughout most of human history. You were born with a huge and underdeveloped brain, designed for soaking in information from your surroundings like a sponge. But you weren’t able to freely follow your curiosity: even if you had loving, caring parents, you still faced frequent physical danger from nature and other people, severe scarcity, and rigid cultural norms that governed acceptable behavior within your community, with harsh penalties for stepping out of line. You had to learn fast and reliably how to stay safe, and in particular to stay on the good side of the adults around you. Even after you grew up and passed the period of most acute danger, you’d still face many threats of violence and scarcity.
Your ability to avoid these depended in large part on your relationships: holding a respected position within your tribe was the key pathway to a good life, whereas exclusion from your tribe was tantamount to execution. So “danger and rejection” isn’t an ad-hoc combination: our brains are primed to think of them as the same thing; and conversely, to equate safety and love. I’ll call the latter combination “security” (which I think of as a combination of “physical security” and “emotional security”, although I’ll mostly be focusing on the latter). Children are learning machines, and what they learn above all is strategies for achieving security; because the opinions of other people are so powerful, “being good” in ways which receive approval from the group is one of the central strategies they learn.
How literally should we take this story? It’s clear that describing humans as optimizing for a single goal is a big oversimplification. But it’s hard to overstate how powerful the drive for security is. Think of the many girls who override the drive to eat because part of their brain is convinced that being skinnier will make others desire and love them. Think of the many boys who override their sex drives because part of their brain is convinced that hitting on girls would lead to broader social rejection. Think of the many suicidal adults who override their literal survival drive in response to problems with their relationships or careers. If these drives can be quashed by the drive for security, then anything can be.
I think there are two main reasons that our drive for security is hard to see. One is that, as adults, we’re often not optimizing directly for security, but instead following heuristics and strategies developed to achieve proxies for security in our childhood environment. In other words: on an emotional level, our brains often “cache” conclusions from childhood, which become hard to override as adults. I like the way that Malcolm Ocean describes the result of this caching: “everyone is basically living in a dream mashup of their current external situation and their old emotional meanings”. But those old emotional meanings are adapted for a childhood environment very different from our adult environment, and are therefore often deeply counterproductive. For example, once we are able to stand up to people yelling at us, or we are able to leave abusive relationships, we may still be slow to do so because we’ve cached the conclusion that we’re helpless in situations like these.
The second reason our drive for security is hard to see: because we don’t directly understand what’s behind them, counterproductive strategies often get stuck in self-reinforcing feedback loops. Consider a child who becomes fixated on the belief: “if I’m skinnier then people will love me”. She becomes skinnier, and of course it doesn’t fix anything. The reasonable move is to discard that belief. But...
Originally released in December 2022.
Large language models like GPT-3, and now ChatGPT, are neural networks trained on a large fraction of all text available on the internet to do one thing: predict the next word in a passage. This simple technique has led to something extraordinary — black boxes able to write TV scripts, explain jokes, produce satirical poetry, answer common factual questions, argue sensibly for political positions, and more. Every month their capabilities grow.
But do they really 'understand' what they're saying, or do they just give the illusion of understanding?
Today's guest, Richard Ngo, thinks that in the most important sense they understand many things. Richard is a researcher at OpenAI — the company that created ChatGPT — who works to foresee where AI advances are going and develop strategies that will keep these models from 'acting out' as they become more powerful, are deployed and ultimately given power in society.
Links to learn more, summary and full transcript.
One way to think about 'understanding' is as a subjective experience. Whether it feels like something to be a large language model is an important question, but one we currently have no way to answer.
However, as Richard explains, another way to think about 'understanding' is as a functional matter. If you really understand an idea you're able to use it to reason and draw inferences in new situations. And that kind of understanding is observable and testable.
Richard argues that language models are developing sophisticated representations of the world which can be manipulated to draw sensible conclusions — maybe not so different from what happens in the human mind. And experiments have found that, as models get more parameters and are trained on more data, these types of capabilities consistently improve.
We might feel reluctant to say a computer understands something the way that we do. But if it walks like a duck and it quacks like a duck, we should consider that maybe we have a duck, or at least something sufficiently close to a duck it doesn't matter.
In today's conversation we discuss the above, as well as:
• Could speeding up AI development be a bad thing?
• The balance between excitement and fear when it comes to AI advances
• What OpenAI focuses its efforts where it does
• Common misconceptions about machine learning
• How many computer chips it might require to be able to do most of the things humans do
• How Richard understands the 'alignment problem' differently than other people
• Why 'situational awareness' may be a key concept for understanding the behaviour of AI models
• What work to positively shape the development of AI Richard is and isn't excited about
• The AGI Safety Fundamentals course that Richard developed to help people learn more about this field
Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type 80,000 Hours into your podcasting app.
Producer: Keiran Harris
Audio mastering: Milo McGuire and Ben Cordell
Transcriptions: Katy Moore
Our brains often use sleight-of-hand to hide fear-based motivation behind a guise of objectivity. This is particularly linked to the word “good”, which does a lot of work in a lot of people’s psychologies. For example, people often think that they, or their work, is “not good enough”. By itself, that sentence doesn’t make sense: good enough for what? Imagine going on a hike and commenting along the way “this rock isn’t heavy enough” or “this stream isn’t wide enough” without any background context. That sounds bizarre, and rightly so—the relevant threshold is very different depending on the context of the judgment. In other words, judgments are inherently two-place functions: they take in both some property and some threshold, and evaluate whether the property is above the threshold.
Of course, people often don’t need to make the threshold explicit—if the reason you’re gathering rocks is to anchor down your tent, you can just say “this rock isn’t heavy enough” without further elaboration (although even then, miscommunications are common—heavy enough to withstand a stiff breeze? Or a gale? Or a storm?). But most judgments that people make of each other or themselves don’t have a clear threshold attached. Think of a girl standing in front of a mirror, saying to herself “I’m not beautiful enough”. Not beautiful enough to win a modeling competition? Or to convince a specific crush to go out with her? Or to appear in public without people making mean comments? The part of her mind which is making this evaluation doesn’t include that criterion, because it would weaken the forcefulness of its conclusion—it just spits out a judgment which feels like an objective evaluation, because the threshold is hidden.
(The same is true if she just thinks “I’m not very beautiful”—not top 1%? 10%? 50%? What makes any of these thresholds important anyway?)
Making the threshold explicit isn’t always going to change the judgment, but it’ll often make us realize that we’re holding ourselves to an unreasonably high standard. Here’s an exercise which might help, by nudging you to do the opposite. Think of the world—the whole thing, the beauty and the horror, the joy and the tragedy—and say out loud to yourself “everything’s okay, in comparison to how bad things could be”. No matter how bad you think it is now, it could all be much worse, right? Think of the satisfaction you’d feel if you thought you were on track for that much worse world, and suddenly learned that you were in our current world instead!
Now say the same about your life—“everything’s okay, in comparison to how bad it could be”. Imagine the version of yourself who’d love to be in your position, and how they’d feel if they learned that they could. Lastly, try both of those again, but thinking about the future: “everything’s okay, no matter what happens from now on”. When I do this, I visualize moving the threshold for what counts as “okay” up and down, first measuring down from perfection, then up from hell, until the one-place judgment of “is this okay?” feels like a totally different type of thing from the two-place judgments which are actually relevant to the decisions I face.
Does that feel weird? For me it does—I feel a sense of internal resistance. A part of me says “if you believe this, you’ll stop trying to make your life better!” I think that part is kinda right, but also a little hyperactive. I’m not committing to the view that everything’s okay, I’m just. trying it on for a second; trying to separate them enough to notice that it’s possible in principle. I notice that this resistance part feels a kind of frantic nervous energy at the thought of not applying high standards to myself—in other words, it feels fear-based mo...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Judgments often smuggle in implicit standards, published by Richard Ngo on May 15, 2023 on LessWrong. Our brains often use sleight-of-hand to hide fear-based motivation behind a guise of objectivity. This is particularly linked to the word “good”, which does a lot of work in a lot of people’s psychologies. For example, people often think that they, or their work, is “not good enough”. By itself, that sentence doesn’t make sense: good enough for what? Imagine going on a hike and commenting along the way “this rock isn’t heavy enough” or “this stream isn’t wide enough” without any background context. That sounds bizarre, and rightly so—the relevant threshold is very different depending on the context of the judgment. In other words, judgments are inherently two-place functions: they take in both some property and some threshold, and evaluate whether the property is above the threshold. Of course, people often don’t need to make the threshold explicit—if the reason you’re gathering rocks is to anchor down your tent, you can just say “this rock isn’t heavy enough” without further elaboration (although even then, miscommunications are common—heavy enough to withstand a stiff breeze? Or a gale? Or a storm?). But most judgments that people make of each other or themselves don’t have a clear threshold attached. Think of a girl standing in front of a mirror, saying to herself “I’m not beautiful enough”. Not beautiful enough to win a modeling competition? Or to convince a specific crush to go out with her? Or to appear in public without people making mean comments? The part of her mind which is making this evaluation doesn’t include that criterion, because it would weaken the forcefulness of its conclusion—it just spits out a judgment which feels like an objective evaluation, because the threshold is hidden. (The same is true if she just thinks “I’m not very beautiful”—not top 1%? 10%? 50%? What makes any of these thresholds important anyway?) Making the threshold explicit isn’t always going to change the judgment, but it’ll often make us realize that we’re holding ourselves to an unreasonably high standard. Here’s an exercise which might help, by nudging you to do the opposite. Think of the world—the whole thing, the beauty and the horror, the joy and the tragedy—and say out loud to yourself “everything’s okay, in comparison to how bad things could be”. No matter how bad you think it is now, it could all be much worse, right? Think of the satisfaction you’d feel if you thought you were on track for that much worse world, and suddenly learned that you were in our current world instead! Now say the same about your life—“everything’s okay, in comparison to how bad it could be”. Imagine the version of yourself who’d love to be in your position, and how they’d feel if they learned that they could. Lastly, try both of those again, but thinking about the future: “everything’s okay, no matter what happens from now on”. When I do this, I visualize moving the threshold for what counts as “okay” up and down, first measuring down from perfection, then up from hell, until the one-place judgment of “is this okay?” feels like a totally different type of thing from the two-place judgments which are actually relevant to the decisions I face. Does that feel weird? For me it does—I feel a sense of internal resistance. A part of me says “if you believe this, you’ll stop trying to make your life better!” I think that part is kinda right, but also a little hyperactive. I’m not committing to the view that everything’s okay, I’m just. trying it on for a second; trying to separate them enough to notice that it’s possible in principle. I notice that this resistance part feels a kind of frantic nervous energy at the thought of not applying high standards to myself—in other words, it feels fear-based mo...
Richard Ngo compiles a number of resources for thinking about careers in alignment research.
Original text:
Narrated for AGI Safety Fundamentals by TYPE III AUDIO.
Richard Ngo compiles a number of resources for thinking about careers in alignment research.
Original text:
Narrated for AGI Safety Fundamentals by TYPE III AUDIO.
The field of AI has undergone a revolution over the last decade, driven by the success of deep learning techniques. This post aims to convey three ideas using a series of illustrative examples:
- There have been huge jumps in the capabilities of AIs over the last decade, to the point where it’s becoming hard to specify tasks that AIs can’t do.
- This progress has been primarily driven by scaling up a handful of relatively simple algorithms (rather than by developing a more principled or scientific understanding of deep learning).
- Very few people predicted that progress would be anywhere near this fast; but many of those who did also predict that we might face existential risk from AGI in the coming decades.
I’ll focus on four domains: vision, games, language-based tasks, and science. The first two have more limited real-world applications, but provide particularly graphic and intuitive examples of the pace of progress.
Original article:
https://medium.com/@richardcngo/visualizing-the-deep-learning-revolution-722098eb9c5
Author:
Richard Ngo
---
This article is featured on the AGI Safety Fundamentals: Alignment course curriculum.
Narrated by TYPE III AUDIO on behalf of BlueDot Impact.
The field of AI has undergone a revolution over the last decade, driven by the success of deep learning techniques. This post aims to convey three ideas using a series of illustrative examples:
- There have been huge jumps in the capabilities of AIs over the last decade, to the point where it’s becoming hard to specify tasks that AIs can’t do.
- This progress has been primarily driven by scaling up a handful of relatively simple algorithms (rather than by developing a more principled or scientific understanding of deep learning).
- Very few people predicted that progress would be anywhere near this fast; but many of those who did also predict that we might face existential risk from AGI in the coming decades.
I’ll focus on four domains: vision, games, language-based tasks, and science. The first two have more limited real-world applications, but provide particularly graphic and intuitive examples of the pace of progress.
Original article:
https://medium.com/@richardcngo/visualizing-the-deep-learning-revolution-722098eb9c5
Author:
Richard Ngo
---
This article is featured on the AGI Safety Fundamentals: Alignment course curriculum.
Narrated by TYPE III AUDIO on behalf of BlueDot Impact.
This report explores the core case for why the development of artificial general intelligence (AGI) might pose an existential threat to humanity. It stems from my dissatisfaction with existing arguments on this topic: early work is less relevant in the context of modern machine learning, while more recent work is scattered and brief. This report aims to fill that gap by providing a detailed investigation into the potential risk from AGI misbehaviour, grounded by our current knowledge of machine learning, and highlighting important uncertain ties. It identifies four key premises, evaluates existing arguments about them, and outlines some novel considerations for each.
Source:
https://drive.google.com/file/d/1uK7NhdSKprQKZnRjU58X7NLA1auXlWHt/view
Narrated for AGI Safety Fundamentals by TYPE III AUDIO.
Despite the current popularity of machine learning, I haven’t found any short introductions to it which quite match the way I prefer to introduce people to the field. So here’s my own. Compared with other introductions, I’ve focused less on explaining each concept in detail, and more on explaining how they relate to other important concepts in AI, especially in diagram form. If you're new to machine learning, you shouldn't expect to fully understand most of the concepts explained here just after reading this post - the goal is instead to provide a broad framework which will contextualise more detailed explanations you'll receive from elsewhere. I'm aware that high-level taxonomies can be controversial, and also that it's easy to fall into the illusion of transparency when trying to introduce a field; so suggestions for improvements are very welcome! The key ideas are contained in this summary diagram: First, some quick clarifications: None of the boxes are meant to be comprehensive; we could add more items to any of them. So you should picture each list ending with “and others”. The distinction between tasks and techniques is not a firm or standard categorisation; it’s just the best way I’ve found so far to lay things out. The summary is explicitly from an AI-centric perspective. For example, statistical modeling and optimization are fields in their own right; but for our current purposes we can think of them as machine learning techniques.
Original text:
https://www.alignmentforum.org/posts/qE73pqxAZmeACsAdF/a-short-introduction-to-machine-learning
Narrated for AGI Safety Fundamentals by TYPE III AUDIO.
Despite the current popularity of machine learning, I haven’t found any short introductions to it which quite match the way I prefer to introduce people to the field. So here’s my own. Compared with other introductions, I’ve focused less on explaining each concept in detail, and more on explaining how they relate to other important concepts in AI, especially in diagram form. If you're new to machine learning, you shouldn't expect to fully understand most of the concepts explained here just after reading this post - the goal is instead to provide a broad framework which will contextualise more detailed explanations you'll receive from elsewhere. I'm aware that high-level taxonomies can be controversial, and also that it's easy to fall into the illusion of transparency when trying to introduce a field; so suggestions for improvements are very welcome! The key ideas are contained in this summary diagram: First, some quick clarifications: None of the boxes are meant to be comprehensive; we could add more items to any of them. So you should picture each list ending with “and others”. The distinction between tasks and techniques is not a firm or standard categorisation; it’s just the best way I’ve found so far to lay things out. The summary is explicitly from an AI-centric perspective. For example, statistical modeling and optimization are fields in their own right; but for our current purposes we can think of them as machine learning techniques.
Original text:
https://www.alignmentforum.org/posts/qE73pqxAZmeACsAdF/a-short-introduction-to-machine-learning
Narrated for AGI Safety Fundamentals by TYPE III AUDIO.
Explainer podcast for Richard Ngo's "Clarifying and predicting AGI" post on Lesswrong, which introduces the t-AGI framework to evaluate AI progress. A system is considered t-AGI if it can outperform most human experts, given time t, on most cognitive tasks. This is a new format, quite different from the interviews and podcasts I have been recording in the past. If you enjoyed this, let me know in the YouTube comments, or on twitter, @MichaelTrazzi.
Youtube: https://youtu.be/JXYcLQItZsk
Clarifying and predicting AGI: https://www.alignmentforum.org/posts/BoA3agdkAzL6HQtQP/clarifying-and-predicting-agi
This post is a slightly-adapted summary of two twitter threads, here and here.
The t-AGI framework
As we get closer to AGI, it becomes less appropriate to treat it as a binary threshold. Instead, I prefer to treat it as a continuous spectrum defined by comparison to time-limited humans. I call a system a t-AGI if, on most cognitive tasks, it beats most human experts who are given time t to perform the task.
What does that mean in practice?
A 1-second AGI would need to beat humans at tasks like quickly answering trivia questions, basic intuitions about physics (e.g. "what happens if I push a string?"), recognizing objects in images, recognizing whether sentences are grammatical, etc.
A 1-minute AGI would need to beat humans at tasks like answering questions about short text passages or videos, common-sense reasoning (e.g. Yann LeCun's gears problems), simple computer tasks (e.g. use photoshop to blur an image), justifying an opinion, looking up facts, etc.
A 1-hour AGI would need to beat humans at tasks like doing problem sets/exams, writing short articles or blog posts, most tasks in white-collar jobs (e.g. diagnosing patients, giving legal opinions), doing therapy, doing online errands, learning rules of new games, etc.
A 1-day AGI would need to beat humans at tasks like writing insightful essays, negotiating business deals, becoming proficient at playing new games or using new software, developing new apps, running scientific experiments, reviewing scientific papers, summarizing books, etc.
A 1-month AGI would need to beat humans at coherently carrying out medium-term plans (e.g. founding a startup), supervising large projects, becoming proficient in new fields, writing large software applications (e.g. a new OS), making novel scientific discoveries, etc.
A 1-year AGI would need to beat humans at... basically everything. Some projects take humans much longer (e.g. proving Fermat's last theorem) but they can almost always be decomposed into subtasks that don't require full global context (even tho that's often helpful for humans).
Some clarifications:
I'm abstracting away from the question of how much test-time compute AIs get (i.e. how many copies are run, for how long). A principled way to think about this is probably something like: "what fraction of the world's compute is needed?". But in most cases I expect that the bottleneck is being able to perform a task at all; if they can then they'll almost always be able to do it with a negligible proportion of the world's compute.
Similarly, I doubt the specific "expert" theshold will make much difference. But it does seem important that we use experts not laypeople, because the amount of experience that laypeople have with most tasks is so small. It's not really well-defined to talk about beating "most humans" at coding or chess; and it's not particularly relevant either.
I expect that, for any t, the first 100t-AGIs will be way better than any human on tasks which only take time t. To reason about superhuman performance we can extend this framework to talk about (t,n)-AGIs which beat any group of n humans working together on tasks for time t. When I think about superintelligence I'm typically thinking about (1 year, 8 billion)-AGIs.
The value of this framework is ultimately an empirical matter. But it seems useful so far: I think existing systems are 1-second AGIs, are close to 1-minute AGIs, and are a couple of years off from 1-hour AGIs. (FWIW I formulated this framework 2 years ago, but never shared it widely. From your perspective there's selection bias—I wouldn't have shared it if I'd changed my mind. But at least from my perspective, it gets points for being useful for describing events since then.)
And very briefly, some of th...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Clarifying and predicting AGI, published by Richard Ngo on May 4, 2023 on LessWrong. This post is a slightly-adapted summary of two twitter threads, here and here. The t-AGI framework As we get closer to AGI, it becomes less appropriate to treat it as a binary threshold. Instead, I prefer to treat it as a continuous spectrum defined by comparison to time-limited humans. I call a system a t-AGI if, on most cognitive tasks, it beats most human experts who are given time t to perform the task. What does that mean in practice? A 1-second AGI would need to beat humans at tasks like quickly answering trivia questions, basic intuitions about physics (e.g. "what happens if I push a string?"), recognizing objects in images, recognizing whether sentences are grammatical, etc. A 1-minute AGI would need to beat humans at tasks like answering questions about short text passages or videos, common-sense reasoning (e.g. Yann LeCun's gears problems), simple computer tasks (e.g. use photoshop to blur an image), justifying an opinion, looking up facts, etc. A 1-hour AGI would need to beat humans at tasks like doing problem sets/exams, writing short articles or blog posts, most tasks in white-collar jobs (e.g. diagnosing patients, giving legal opinions), doing therapy, doing online errands, learning rules of new games, etc. A 1-day AGI would need to beat humans at tasks like writing insightful essays, negotiating business deals, becoming proficient at playing new games or using new software, developing new apps, running scientific experiments, reviewing scientific papers, summarizing books, etc. A 1-month AGI would need to beat humans at coherently carrying out medium-term plans (e.g. founding a startup), supervising large projects, becoming proficient in new fields, writing large software applications (e.g. a new OS), making novel scientific discoveries, etc. A 1-year AGI would need to beat humans at... basically everything. Some projects take humans much longer (e.g. proving Fermat's last theorem) but they can almost always be decomposed into subtasks that don't require full global context (even tho that's often helpful for humans). Some clarifications: I'm abstracting away from the question of how much test-time compute AIs get (i.e. how many copies are run, for how long). A principled way to think about this is probably something like: "what fraction of the world's compute is needed?". But in most cases I expect that the bottleneck is being able to perform a task at all; if they can then they'll almost always be able to do it with a negligible proportion of the world's compute. Similarly, I doubt the specific "expert" theshold will make much difference. But it does seem important that we use experts not laypeople, because the amount of experience that laypeople have with most tasks is so small. It's not really well-defined to talk about beating "most humans" at coding or chess; and it's not particularly relevant either. I expect that, for any t, the first 100t-AGIs will be way better than any human on tasks which only take time t. To reason about superhuman performance we can extend this framework to talk about (t,n)-AGIs which beat any group of n humans working together on tasks for time t. When I think about superintelligence I'm typically thinking about (1 year, 8 billion)-AGIs. The value of this framework is ultimately an empirical matter. But it seems useful so far: I think existing systems are 1-second AGIs, are close to 1-minute AGIs, and are a couple of years off from 1-hour AGIs. (FWIW I formulated this framework 2 years ago, but never shared it widely. From your perspective there's selection bias—I wouldn't have shared it if I'd changed my mind. But at least from my perspective, it gets points for being useful for describing events since then.) And very briefly, some of th...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Clarifying and predicting AGI, published by Richard Ngo on May 4, 2023 on The AI Alignment Forum. This post is a slightly-adapted summary of two twitter threads, here and here. The t-AGI framework As we get closer to AGI, it becomes less appropriate to treat it as a binary threshold. Instead, I prefer to treat it as a continuous spectrum defined by comparison to time-limited humans. I call a system a t-AGI if, on most cognitive tasks, it beats most human experts who are given time t to perform the task. What does that mean in practice? A 1-second AGI would need to beat humans at tasks like quickly answering trivia questions, basic intuitions about physics (e.g. "what happens if I push a string?"), recognizing objects in images, recognizing whether sentences are grammatical, etc. A 1-minute AGI would need to beat humans at tasks like answering questions about short text passages or videos, common-sense reasoning (e.g. Yann LeCun's gears problems), simple computer tasks (e.g. use photoshop to blur an image), justifying an opinion, looking up facts, etc. A 1-hour AGI would need to beat humans at tasks like doing problem sets/exams, writing short articles or blog posts, most tasks in white-collar jobs (e.g. diagnosing patients, giving legal opinions), doing therapy, doing online errands, learning rules of new games, etc. A 1-day AGI would need to beat humans at tasks like writing insightful essays, negotiating business deals, becoming proficient at playing new games or using new software, developing new apps, running scientific experiments, reviewing scientific papers, summarizing books, etc. A 1-month AGI would need to beat humans at coherently carrying out medium-term plans (e.g. founding a startup), supervising large projects, becoming proficient in new fields, writing large software applications (e.g. a new OS), making novel scientific discoveries, etc. A 1-year AGI would need to beat humans at... basically everything. Some projects take humans much longer (e.g. proving Fermat's last theorem) but they can almost always be decomposed into subtasks that don't require full global context (even tho that's often helpful for humans). Some clarifications: I'm abstracting away from the question of how much test-time compute AIs get (i.e. how many copies are run, for how long). A principled way to think about this is probably something like: "what fraction of the world's compute is needed?". But in most cases I expect that the bottleneck is being able to perform a task at all; if they can then they'll almost always be able to do it with a negligible proportion of the world's compute. Similarly, I doubt the specific "expert" theshold will make much difference. But it does seem important that we use experts not laypeople, because the amount of experience that laypeople have with most tasks is so small. It's not really well-defined to talk about beating "most humans" at coding or chess; and it's not particularly relevant either. I expect that, for any t, the first 100t-AGIs will be way better than any human on tasks which only take time t. To reason about superhuman performance we can extend this framework to talk about (t,n)-AGIs which beat any group of n humans working together on tasks for time t. When I think about superintelligence I'm typically thinking about (1 year, 8 billion)-AGIs. The value of this framework is ultimately an empirical matter. But it seems useful so far: I think existing systems are 1-second AGIs, are close to 1-minute AGIs, and are a couple of years off from 1-hour AGIs. (FWIW I formulated this framework 2 years ago, but never shared it widely. From your perspective there's selection bias—I wouldn't have shared it if I'd changed my mind. But at least from my perspective, it gets points for being useful for describing events since then.) And very briefl...
This post is a slightly-adapted summary of two twitter threads, here and here.
The t-AGI framework
As we get closer to AGI, it becomes less appropriate to treat it as a binary threshold. Instead, I prefer to treat it as a continuous spectrum defined by comparison to time-limited humans. I call a system a t-AGI if, on most cognitive tasks, it beats most human experts who are given time t to perform the task.
What does that mean in practice?
A 1-second AGI would need to beat humans at tasks like quickly answering trivia questions, basic intuitions about physics (e.g. "what happens if I push a string?"), recognizing objects in images, recognizing whether sentences are grammatical, etc.
A 1-minute AGI would need to beat humans at tasks like answering questions about short text passages or videos, common-sense reasoning (e.g. Yann LeCun's gears problems), simple computer tasks (e.g. use photoshop to blur an image), justifying an opinion, looking up facts, etc.
A 1-hour AGI would need to beat humans at tasks like doing problem sets/exams, writing short articles or blog posts, most tasks in white-collar jobs (e.g. diagnosing patients, giving legal opinions), doing therapy, doing online errands, learning rules of new games, etc.
A 1-day AGI would need to beat humans at tasks like writing insightful essays, negotiating business deals, becoming proficient at playing new games or using new software, developing new apps, running scientific experiments, reviewing scientific papers, summarizing books, etc.
A 1-month AGI would need to beat humans at coherently carrying out medium-term plans (e.g. founding a startup), supervising large projects, becoming proficient in new fields, writing large software applications (e.g. a new OS), making novel scientific discoveries, etc.
A 1-year AGI would need to beat humans at... basically everything. Some projects take humans much longer (e.g. proving Fermat's last theorem) but they can almost always be decomposed into subtasks that don't require full global context (even tho that's often helpful for humans).
Some clarifications:
I'm abstracting away from the question of how much test-time compute AIs get (i.e. how many copies are run, for how long). A principled way to think about this is probably something like: "what fraction of the world's compute is needed?". But in most cases I expect that the bottleneck is being able to perform a task at all; if they can then they'll almost always be able to do it with a negligible proportion of the world's compute.
Similarly, I doubt the specific "expert" theshold will make much difference. But it does seem important that we use experts not laypeople, because the amount of experience that laypeople have with most tasks is so small. It's not really well-defined to talk about beating "most humans" at coding or chess; and it's not particularly relevant either.
I expect that, for any t, the first 100t-AGIs will be way better than any human on tasks which only take time t. To reason about superhuman performance we can extend this framework to talk about (t,n)-AGIs which beat any group of n humans working together on tasks for time t. When I think about superintelligence I'm typically thinking about (1 year, 8 billion)-AGIs.
The value of this framework is ultimately an empirical matter. But it seems useful so far: I think existing systems are 1-second AGIs, are close to 1-minute AGIs, and are a couple of years off from 1-hour AGIs. (FWIW I formulated this framework 2 years ago, but never shared it widely. From your perspective there's selection bias—I wouldn't have shared it if I'd changed my mind. But at least from my perspective, it gets points for being useful for describing events since then.)
And very briefl...
People often ask me for career advice related to AGI safety. This post summarizes the advice I most commonly give. I’ve split it into three sections: general mindset, alignment research and governance work. For each of the latter two, I start with high-level advice aimed primarily at students and those early in their careers, then dig into more details of the field. See also this post I wrote two years ago, containing a bunch of fairly general career advice. ## General mindset In order to have a big impact on the world you need to find a big lever. This document assumes that you think, as I do, that AGI safety is the biggest such lever. There are many ways to pull on that lever, though—from research and engineering to operations and field-building to politics and communications. I encourage you to choose between these based primarily on your personal fit—a combination of what you're really good at and what you really enjoy. In my opinion the difference between being a great versus a mediocre fit swamps other differences in the impactfulness of most pairs of AGI-safety-related jobs.
Original article:
https://forum.effectivealtruism.org/posts/xg7gxsYaMa6F3uH8h/agi-safety-career-advice
Narrated for the Effective Altruism Forum by TYPE III AUDIO.
People often ask me for career advice related to AGI safety. This post summarizes the advice I most commonly give. I’ve split it into three sections: general mindset, alignment research and governance work. For each of the latter two, I start with high-level advice aimed primarily at students and those early in their careers, then dig into more details of the field. See also this post I wrote two years ago, containing a bunch of fairly general career advice. ## General mindset In order to have a big impact on the world you need to find a big lever. This document assumes that you think, as I do, that AGI safety is the biggest such lever. There are many ways to pull on that lever, though—from research and engineering to operations and field-building to politics and communications. I encourage you to choose between these based primarily on your personal fit—a combination of what you're really good at and what you really enjoy. In my opinion the difference between being a great versus a mediocre fit swamps other differences in the impactfulness of most pairs of AGI-safety-related jobs.
Original article:
https://forum.effectivealtruism.org/posts/xg7gxsYaMa6F3uH8h/agi-safety-career-advice
Narrated for the Effective Altruism Forum by TYPE III AUDIO.
People often ask me for career advice related to AGI safety. This post summarizes the advice I most commonly give. I’ve split it into three sections: general mindset, alignment research and governance work. For each of the latter two, I start with high-level advice aimed primarily at students and those early in their careers, then dig into more details of the field. See also this post I wrote two years ago, containing a bunch of fairly general career advice. ## General mindset In order to have a big impact on the world you need to find a big lever. This document assumes that you think, as I do, that AGI safety is the biggest such lever. There are many ways to pull on that lever, though—from research and engineering to operations and field-building to politics and communications. I encourage you to choose between these based primarily on your personal fit—a combination of what you're really good at and what you really enjoy. In my opinion the difference between being a great versus a mediocre fit swamps other differences in the impactfulness of most pairs of AGI-safety-related jobs.
Original article:
https://forum.effectivealtruism.org/posts/xg7gxsYaMa6F3uH8h/agi-safety-career-advice
Narrated for the Effective Altruism Forum by TYPE III AUDIO.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AGI safety career advice, published by Richard Ngo on May 2, 2023 on The AI Alignment Forum. People often ask me for career advice related to AGI safety. This post summarizes the advice I most commonly give. I’ve split it into three sections: general mindset, alignment research and governance work. See also this post I wrote two years ago, containing a bunch of fairly general career advice. General mindset In order to have a big impact on the world you need to find a big lever. This document assumes that you think, as I do, that AGI safety is the biggest such lever. There are many ways to pull on that lever, though—from research and engineering to operations and field-building to politics and communications. I encourage you to choose between these based primarily on your personal fit—a combination of what you're really good at and what you really enjoy. In my opinion the difference between being a great versus a mediocre fit swamps other differences in the impactfulness of most pairs of AGI-safety-related jobs. How should you find your personal fit? To start, you should focus on finding work where you can get fast feedback loops. That will typically involve getting hands-on or doing some kind of concrete project (rather than just reading and learning) and seeing how quickly you can make progress. Eventually, once you've had a bunch of experience, you might notice a feeling of confusion or frustration: why is everyone else missing the point, or doing so badly at this? For some people that involves investigating a specific topic (for me, the question “what’s the best argument that AGI will be misaligned?“); for others it's about applying skills like conscientiousness (e.g. "why can't others just go through all the obvious steps?") Being excellent seldom feels like you’re excellent, because your own abilities set your baseline for what feels normal. (Though note that a few top researchers commented on a draft to say that they disagreed with this point.) What if you have that experience for something you don't enjoy doing? I expect that this is fairly rare, because being good at something is often very enjoyable. But in those cases, I'd suggest trying it until you observe that even a string of successes doesn't make you excited about what you're doing; and at that point, probably trying to pivot (although this is pretty dependent on the specific details). Lastly: AGI safety is a young and small field; there’s a lot to be done, and still very few people to do it. I encourage you to have agency when it comes to making things happen: most of the time the answer to “why isn’t this seemingly-good thing happening?” or “why aren’t we 10x better at this particular thing?” is “because nobody’s gotten around to it yet”. And the most important qualifications for being able to solve a problem are typically the ability to notice it and the willingness to try. One anecdote to help drive this point home: a friend of mine has had four jobs at four top alignment research organizations; none of those jobs existed before she reached out to the relevant groups to suggest that they should hire someone with her skillset. And this is just what’s possible within existing organizations—if you’re launching your own project, there are far more opportunities to do totally novel things. (The main exception is when it comes to outreach and political advocacy. Alignment is an unusual field because the base of fans and supporters is much larger than the number of researchers, and so we should be careful to avoid alignment discourse being dominated by advocates who have little familiarity with the technical details, and come across as overconfident. See the discussion here for more on this.) Alignment research I’ll start with some high-level recommendations, then give a brief overview of how I see the f...
People often ask me for career advice related to AGI safety. This post summarizes the advice I most commonly give. I’ve split it into three sections: general mindset, alignment research and governance work. See also this post I wrote two years ago, containing a bunch of fairly general career advice.
General mindset
In order to have a big impact on the world you need to find a big lever. This document assumes that you think, as I do, that AGI safety is the biggest such lever. There are many ways to pull on that lever, though—from research and engineering to operations and field-building to politics and communications. I encourage you to choose between these based primarily on your personal fit—a combination of what you're really good at and what you really enjoy. In my opinion the difference between being a great versus a mediocre fit swamps other differences in the impactfulness of most pairs of AGI-safety-related jobs.
How should you find your personal fit? To start, you should focus on finding work where you can get fast feedback loops. That will typically involve getting hands-on or doing some kind of concrete project (rather than just reading and learning) and seeing how quickly you can make progress. Eventually, once you've had a bunch of experience, you might notice a feeling of confusion or frustration: why is everyone else missing the point, or doing so badly at this? For some people that involves investigating a specific topic (for me, the question “what’s the best argument that AGI will be misaligned?“); for others it's about applying skills like conscientiousness (e.g. "why can't others just go through all the obvious steps?") Being excellent seldom feels like you’re excellent, because your own abilities set your baseline for what feels normal. (Though note that a few top researchers commented on a draft to say that they disagreed with this point.)
What if you have that experience for something you don't enjoy doing? I expect that this is fairly rare, because being good at something is often very enjoyable. But in those cases, I'd suggest trying it until you observe that even a string of successes doesn't make you excited about what you're doing; and at that point, probably trying to pivot (although this is pretty dependent on the specific details).
Lastly: AGI safety is a young and small field; there’s a lot to be done, and still very few people to do it. I encourage you to have agency when it comes to making things happen: most of the time the answer to “why isn’t this seemingly-good thing happening?” or “why aren’t we 10x better at this particular thing?” is “because nobody’s gotten around to it yet”. And the most important qualifications for being able to solve a problem are typically the ability to notice it and the willingness to try. One anecdote to help drive this point home: a friend of mine has had four jobs at four top alignment research organizations; none of those jobs existed before she reached out to the relevant groups to suggest that they should hire someone with her skillset. And this is just what’s possible within existing organizations—if you’re launching your own project, there are far more opportunities to do totally novel things. (The main exception is when it comes to outreach and political advocacy.
Alignment is an unusual field because the base of fans and supporters is much larger than the number of researchers, and so we should be careful to avoid alignment discourse being dominated by advocates who have little familiarity with the technical details, and come across as overconfident. See the discussion here for more on this.)
Alignment research
I’ll start with some high-level recommendations, then give a brief overview of how I see the f...
tl;dr: rationalists concerned about AI risk often make claims that others consider not just unjustified, but unjustifiable using their current methodology, because of high-level disagreements about epistemology. If you actually want to productively discuss AI risk, make claims that can be engaged with by others who have a wide range of opinions about the appropriate level of Knightian uncertainty.
I think that many miscommunications about AI risk are caused by a difference between two types of norms for how to talk about the likelihoods of unprecedented events. I'll call these "inside view norms" versus "Knightian norms", and describe them as follows:
Inside view norms : when talking to others, you report your beliefs directly, without adjusting for "Knightian uncertainty" (i.e. possible flaws or gaps in your model of the world that you can't account for directly).
Knightian norms: you report beliefs adjusted for your best estimate of the Knightian uncertainty. For example, if you can't imagine any plausible future in which humanity and aliens end up cooperating with each other, but you think this is a domain which faces heavy Knightian uncertainty, then you might report your credence that we'll ever cooperate with aliens as 20%, or 30%, or 10%, but definitely nowhere near 0.
I'll give a brief justification of why Knightian norms seem reasonable to me, since I expect they're counterintuitive for most people on LW. On a principled level: when reasoning about complex domains like the future, the hardest part is often "knowing the right questions to ask", or narrowing down on useful categories at all. Some different ways in which a question might be the wrong one to ask:
The question might have important ambiguities. For example, consider someone from 100 years ago asking "will humans be extinct in 1000 years?" Even for a concept like extinction that seems very black-and-white, there are many possible futures which are very non-central examples of either "extinct" or "not extinct" in the questioner's mind (e.g. all humans are digital; all humans are dramatically genetically engineered; all humans are merged with AIs; etc). And so it'd be appropriate to give an answer like "X% yes, Y% no, Z% this is the wrong question to ask".
The question might be confused or ill-posed. For example, "how heavy is phlogiston?"
You might be unable to conceptualize the actual answer. For example, suppose someone from 200 years ago asks "will physics be the fastest-moving science in the year 2023?" They think about all the sciences they know of, and all the possible future sciences they can imagine, and try to assign credences to them being the fastest-moving. But they'd very likely just totally fail to conceptualize the science that has turn out to be the fastest-moving: computer science (and machine learning more specifically). Even if they reason at a meta level "there are probably a bunch of future sciences I can basically not predict at all, so I should add credence to 'no'", the resulting uncertainty is Knightian in the sense that it's generated by reasoning about your ignorance rather than your actual models of the world.
I therefore consider Knightian norms to be appropriate when you're reasoning about a domain in which these considerations seem particularly salient. I give some more clarifications at the end of the post (in particular on why I think Knightian norms are importantly different from modesty norms). However, I'm less interested in debating the value of Knightian norms directly, and more interested in their implications for how to communicate. If one person is following inside view norms and another is following Knightian norms, that can cause serious miscommunication...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Communicating effectively under Knightian norms, published by Richard Ngo on April 3, 2023 on LessWrong. tl;dr: rationalists concerned about AI risk often make claims that others consider not just unjustified, but unjustifiable using their current methodology, because of high-level disagreements about epistemology. If you actually want to productively discuss AI risk, make claims that can be engaged with by others who have a wide range of opinions about the appropriate level of Knightian uncertainty. I think that many miscommunications about AI risk are caused by a difference between two types of norms for how to talk about the likelihoods of unprecedented events. I'll call these "inside view norms" versus "Knightian norms", and describe them as follows: Inside view norms : when talking to others, you report your beliefs directly, without adjusting for "Knightian uncertainty" (i.e. possible flaws or gaps in your model of the world that you can't account for directly). Knightian norms: you report beliefs adjusted for your best estimate of the Knightian uncertainty. For example, if you can't imagine any plausible future in which humanity and aliens end up cooperating with each other, but you think this is a domain which faces heavy Knightian uncertainty, then you might report your credence that we'll ever cooperate with aliens as 20%, or 30%, or 10%, but definitely nowhere near 0. I'll give a brief justification of why Knightian norms seem reasonable to me, since I expect they're counterintuitive for most people on LW. On a principled level: when reasoning about complex domains like the future, the hardest part is often "knowing the right questions to ask", or narrowing down on useful categories at all. Some different ways in which a question might be the wrong one to ask: The question might have important ambiguities. For example, consider someone from 100 years ago asking "will humans be extinct in 1000 years?" Even for a concept like extinction that seems very black-and-white, there are many possible futures which are very non-central examples of either "extinct" or "not extinct" in the questioner's mind (e.g. all humans are digital; all humans are dramatically genetically engineered; all humans are merged with AIs; etc). And so it'd be appropriate to give an answer like "X% yes, Y% no, Z% this is the wrong question to ask". The question might be confused or ill-posed. For example, "how heavy is phlogiston?" You might be unable to conceptualize the actual answer. For example, suppose someone from 200 years ago asks "will physics be the fastest-moving science in the year 2023?" They think about all the sciences they know of, and all the possible future sciences they can imagine, and try to assign credences to them being the fastest-moving. But they'd very likely just totally fail to conceptualize the science that has turn out to be the fastest-moving: computer science (and machine learning more specifically). Even if they reason at a meta level "there are probably a bunch of future sciences I can basically not predict at all, so I should add credence to 'no'", the resulting uncertainty is Knightian in the sense that it's generated by reasoning about your ignorance rather than your actual models of the world. I therefore consider Knightian norms to be appropriate when you're reasoning about a domain in which these considerations seem particularly salient. I give some more clarifications at the end of the post (in particular on why I think Knightian norms are importantly different from modesty norms). However, I'm less interested in debating the value of Knightian norms directly, and more interested in their implications for how to communicate. If one person is following inside view norms and another is following Knightian norms, that can cause serious miscommunication...
- There have been huge jumps in the capabilities of AIs over the last decade, to the point where it’s becoming hard to specify tasks that AIs can’t do.
- This progress has been primarily driven by scaling up a handful of relatively simple algorithms (rather than by developing a more principled or scientific understanding of deep learning).
- Very few people predicted that progress would be anywhere near this fast; but many of those who did also predict that we might face existential risk from AGI in the coming decades.
[not an April Fool's post]
One difficulty in having sensible discussions of AI policy is a gap between the norms used in different contexts - in particular the gap between decoupling and contextualizing norms. Chris Leong defines them as follows:
Decoupling norms: It is considered eminently reasonable to require the truth of your claims to be considered in isolation - free of any potential implications. An insistence on raising these issues despite a decoupling request are often seen as sloppy thinking or attempts to deflect.
Contextualising norms: It is considered eminently reasonable to expect certain contextual factors or implications to be addressed. Not addressing these factors is often seen as sloppy or an intentional evasion.
LessWrong is one setting which follows very strong decoupling norms. Another is discussion of axiology in philosophy (i.e. which outcomes are better or worse than others). In discussions of axiology, it's taken for granted that claims are made without considering cooperative or deontological considerations. For example, if somebody said "a child dying by accident is worse than an old person being murdered, all else equal", then the local discussion norms would definitely not treat this as an endorsement of killing old people to save children from accidents; everyone would understand that there are other constraints in play.
By contrast, in environments with strong contextualizing norms, claims about which outcomes are better or worse than others can be interpreted as endorsements of related actions. Under these norms, the sentence above about accidents and murders could be taken as (partial) endorsement of killing old people in order to save children, unless the speaker added relevant qualifications and caveats.
In particular, I claim that policy discussions tend to follow strong contextualizing norms. I think this is partly for bad reasons (people in politics avoid decoupled statements because they're easier to criticise) and partly for good reasons (decoupled statements can be used to "set the agenda" in underhanded ways in contexts closely linked to influential decision-making, or in contexts where there's predictably lossy transmission of ideas). However, I'm less interested in arguing about the extent to which these norms are a good idea or not, and more interested in the implications of these norms for one's ability to communicate effectively.
One implication: there are many statements where saying them directly in policy discussions will be taken as implying other statements that the speaker didn't mean to imply. Most of these are not impossible to say, but instead need to be said much more carefully in order to convey the intended message. The additional effort required may make some people decide it's no longer worthwhile to say those statements; I think this is not dishonesty, but rather responsiveness to costs of communication. To me it seems analogous to how there are many statements that need to be said very carefully in order to convey the intended message under high-decoupling norms, like claims about how another person's motivations or character traits affect their arguments.
In particular, under contextualizing norms, saying "outcome X is worse than outcome Y" can be seen as an endorsement of acting in ways which achieve outcome Y instead of outcome X. There are a range of reasons why you might not endorse this despite believing the original statement (even aside from reputational/coalitional concerns). For example, if outcome Y is "a war":
You might hold yourself to deontological constraints about not starting wars.
You might worry that endorsing some wars would make other non-endorsed wars more likely.
You mig...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Policy discussions follow strong contextualizing norms, published by Richard Ngo on April 1, 2023 on LessWrong. [not an April Fool's post] One difficulty in having sensible discussions of AI policy is a gap between the norms used in different contexts - in particular the gap between decoupling and contextualizing norms. Chris Leong defines them as follows: Decoupling norms: It is considered eminently reasonable to require the truth of your claims to be considered in isolation - free of any potential implications. An insistence on raising these issues despite a decoupling request are often seen as sloppy thinking or attempts to deflect. Contextualising norms: It is considered eminently reasonable to expect certain contextual factors or implications to be addressed. Not addressing these factors is often seen as sloppy or an intentional evasion. LessWrong is one setting which follows very strong decoupling norms. Another is discussion of axiology in philosophy (i.e. which outcomes are better or worse than others). In discussions of axiology, it's taken for granted that claims are made without considering cooperative or deontological considerations. For example, if somebody said "a child dying by accident is worse than an old person being murdered, all else equal", then the local discussion norms would definitely not treat this as an endorsement of killing old people to save children from accidents; everyone would understand that there are other constraints in play. By contrast, in environments with strong contextualizing norms, claims about which outcomes are better or worse than others can be interpreted as endorsements of related actions. Under these norms, the sentence above about accidents and murders could be taken as (partial) endorsement of killing old people in order to save children, unless the speaker added relevant qualifications and caveats. In particular, I claim that policy discussions tend to follow strong contextualizing norms. I think this is partly for bad reasons (people in politics avoid decoupled statements because they're easier to criticise) and partly for good reasons (decoupled statements can be used to "set the agenda" in underhanded ways in contexts closely linked to influential decision-making, or in contexts where there's predictably lossy transmission of ideas). However, I'm less interested in arguing about the extent to which these norms are a good idea or not, and more interested in the implications of these norms for one's ability to communicate effectively. One implication: there are many statements where saying them directly in policy discussions will be taken as implying other statements that the speaker didn't mean to imply. Most of these are not impossible to say, but instead need to be said much more carefully in order to convey the intended message. The additional effort required may make some people decide it's no longer worthwhile to say those statements; I think this is not dishonesty, but rather responsiveness to costs of communication. To me it seems analogous to how there are many statements that need to be said very carefully in order to convey the intended message under high-decoupling norms, like claims about how another person's motivations or character traits affect their arguments. In particular, under contextualizing norms, saying "outcome X is worse than outcome Y" can be seen as an endorsement of acting in ways which achieve outcome Y instead of outcome X. There are a range of reasons why you might not endorse this despite believing the original statement (even aside from reputational/coalitional concerns). For example, if outcome Y is "a war": You might hold yourself to deontological constraints about not starting wars. You might worry that endorsing some wars would make other non-endorsed wars more likely. You mig...
Episode: #141 – Richard Ngo on large language models, OpenAI, and striving to make the future go well
Release date: 2022-12-13
Large language models like GPT-3, and now ChatGPT, are neural networks trained on a large fraction of all text available on the internet to do one thing: predict the next word in a passage. This simple technique has led to something extraordinary — black boxes able to write TV scripts, explain jokes, produce satirical poetry, answer common factual questions, argue sensibly for political positions, and more. Every month their capabilities grow.
But do they really 'understand' what they're saying, or do they just give the illusion of understanding?
Today's guest, Richard Ngo, thinks that in the most important sense they understand many things. Richard is a researcher at OpenAI — the company that created ChatGPT — who works to foresee where AI advances are going and develop strategies that will keep these models from 'acting out' as they become more powerful, are deployed and ultimately given power in society.
Links to learn more, summary and full transcript.
One way to think about 'understanding' is as a subjective experience. Whether it feels like something to be a large language model is an important question, but one we currently have no way to answer.
However, as Richard explains, another way to think about 'understanding' is as a functional matter. If you really understand an idea you're able to use it to reason and draw inferences in new situations. And that kind of understanding is observable and testable.
Richard argues that language models are developing sophisticated representations of the world which can be manipulated to draw sensible conclusions — maybe not so different from what happens in the human mind. And experiments have found that, as models get more parameters and are trained on more data, these types of capabilities consistently improve.
We might feel reluctant to say a computer understands something the way that we do. But if it walks like a duck and it quacks like a duck, we should consider that maybe we have a duck, or at least something sufficiently close to a duck it doesn't matter.
In today's conversation we discuss the above, as well as:
• Could speeding up AI development be a bad thing?
• The balance between excitement and fear when it comes to AI advances
• What OpenAI focuses its efforts where it does
• Common misconceptions about machine learning
• How many computer chips it might require to be able to do most of the things humans do
• How Richard understands the 'alignment problem' differently than other people
• Why 'situational awareness' may be a key concept for understanding the behaviour of AI models
• What work to positively shape the development of AI Richard is and isn't excited about
• The AGI Safety Fundamentals course that Richard developed to help people learn more about this field
Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type 80,000 Hours into your podcasting app.
Producer: Keiran Harris
Audio mastering: Milo McGuire and Ben Cordell
Transcriptions: Katy Moore
But do they really 'understand' what they're saying, or do they just give the illusion of understanding?
Today's guest, Richard Ngo, thinks that in the most important sense they understand many things. Richard is a researcher at OpenAI — the company that created ChatGPT — who works to foresee where AI advances are going and develop strategies that will keep these models from 'acting out' as they become more powerful, are deployed and ultimately given power in society.
Links to learn more, summary and full transcript.
One way to think about 'understanding' is as a subjective experience. Whether it feels like something to be a large language model is an important question, but one we currently have no way to answer.
However, as Richard explains, another way to think about 'understanding' is as a functional matter. If you really understand an idea you're able to use it to reason and draw inferences in new situations. And that kind of understanding is observable and testable.
Richard argues that language models are developing sophisticated representations of the world which can be manipulated to draw sensible conclusions — maybe not so different from what happens in the human mind. And experiments have found that, as models get more parameters and are trained on more data, these types of capabilities consistently improve.
We might feel reluctant to say a computer understands something the way that we do. But if it walks like a duck and it quacks like a duck, we should consider that maybe we have a duck, or at least something sufficiently close to a duck it doesn't matter.
In today's conversation we discuss the above, as well as:
• Could speeding up AI development be a bad thing?
• The balance between excitement and fear when it comes to AI advances
• What OpenAI focuses its efforts where it does
• Common misconceptions about machine learning
• How many computer chips it might require to be able to do most of the things humans do
• How Richard understands the 'alignment problem' differently than other people
• Why 'situational awareness' may be a key concept for understanding the behaviour of AI models
• What work to positively shape the development of AI Richard is and isn't excited about
• The AGI Safety Fundamentals course that Richard developed to help people learn more about this field
Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type 80,000 Hours into your podcasting app.
Producer: Keiran Harris
Audio mastering: Milo McGuire and Ben Cordell
Transcriptions: Katy Moore
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Applications open for AGI Safety Fundamentals: Alignment Course, published by Richard Ngo on December 13, 2022 on LessWrong. The AGI Safety Fundamentals (AGISF): Alignment Course is designed to introduce the key ideas in AGI safety and alignment, and provide a space and support for participants to engage, evaluate and debate these arguments. Participants will meet others who are excited to help mitigate risks from future AI systems, and explore opportunities for their next steps in the field. The course is being run by the same team as for previous rounds, now under a new project called BlueDot Impact. Time commitment The course will run from February-April 2023. It comprises 8 weeks of reading and virtual small-group discussions, followed by a 4-week capstone project. The time commitment is around 4 hours per week, so participants can engage with the course alongside full-time work or study. Course structure Participants are provided with structured content to work through, alongside weekly, facilitated discussion groups. Participants will be grouped depending on their ML experience and background knowledge about AI safety. In these sessions, participants will engage in activities and discussions with other participants, guided by the facilitator. The facilitator will be knowledgeable about AI safety, and can help to answer participants’ questions. The course is followed by a capstone project, which is an opportunity for participants to synthesise their views on the field and start thinking through how to put these ideas into practice, or start getting relevant skills and experience that will help them with the next step in their career. The course content is designed by Richard Ngo (Governance team at OpenAI, previously a research engineer on the AGI safety team at DeepMind). You can read the curriculum content here. Target audience We are most excited about applicants who would be in strong position to pursue technical alignment research in their career, such as professional software engineers and students studying technical subjects (e.g. CS/maths/physics/engineering). That said, we consider all applicants and expect 25-50% of the course to consist of people with a variety of other backgrounds, so we encourage you to apply regardless. This includes community builders who would benefit from a deeper understanding of the concepts in AI alignment. We will be running another course on AI Governance in early 2023 and expect a different distribution of target participants. Apply now! If you would like to be considered for the next round of the courses, starting in February 2023, please apply here by Thursday 5th January 2023. More details can be found here. We will be evaluating applications on a rolling basis and we aim to let you know the outcome of your application by mid-January 2023. If you already have experience working on AI alignment and would be keen to join our community of facilitators, please apply to facilitate. Who is running the course? AGISF is now being run by BlueDot Impact - a new non-profit project running courses that support participants to develop the knowledge, community and network needed to pursue high-impact careers. BlueDot Impact spun out of Cambridge Effective Altruism, and was founded by the team who was primarily responsible for running previous rounds of AGISF. You can read more in our announcement post here. We’re really excited about the amount of interest in the courses and think they have great potential to build awesome communities around key issues. As such we have spent the last few months: Working with pedagogy experts to make discussion sessions more engaging Formalising our course design process with greater transparency for participants and facilitators Building systems to improve participant networking to create high-value ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Brainstorming ways to make EA safer and more inclusive, published by richard ngo on November 15, 2022 on The Effective Altruism Forum. After some recent discussion on the forum and on twitter about negative experiences that women have had in EA community spaces, I wanted to start a discussion about concrete actions that could be taken to make EA spaces safer, more comortable, and more inclusive for women. The community health team describes some of their work related to interpersonal harm here, but I expect there's a lot more that the wider community can do to prevent sexual harrassment and abusive behavior, particularly when it comes to setting up norms that proactively prevent problems rather than just dealing with them afterwards. Some prompts for discussion: What negative experiences have you had, and what do you wish the EA community had done differently in response to them? What specific behaviors have you seen which you wish were less common/wish there were stronger norms against? What would have helped you push back against them? As the movement becomes larger and more professionalized, how can we enable people to set clear boundaries and deal with conflicts of interest in workplaces and grantmaking? How can we set clearer norms related to informal power structures (e.g. people who are respected or well-connected within EA, community organizers, etc)? What codes of conduct should we have around events like EA Global? Here's the current code; are there things which should be included in there that aren't currently (e.g. explicitly talking about not asking people out in work-related 1:1s)? What are the best ways to get feedback to the right people on an ongoing basis? E.g. what sort of reporting mechanisms would make sure that concerning patterns in specific EA groups get noticed early? And which ones are currently in place? How can we enable people who are best at creating safe, welcoming environments to share that knowledge? Are there specific posts which should be written about best practices and lessons learned (e.g. additions to the community health resources here)? I'd welcome people's thoughts and experiences, whether detailed discussions or just off-the-cuff comments. I'm particularly excited about suggestions for ways to translate these ideas to concrete actions going forward. EDIT: here's a google form for people who want to comment anonymously; the answers should be visible here. And feel free to reach out to me in messages or in person if you have suggestions for how to do this better. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
After some recent discussion on the forum and on twitter about negative experiences that women have had in EA community spaces, I wanted to start a discussion about concrete actions that could be taken to make EA spaces safer, more comortable, and more inclusive for women. The community health team describes some of their work related to interpersonal harm here, but I expect there's a lot more that the wider community can do to prevent sexual harrassment and abusive behavior, particularly when it comes to setting up norms that proactively prevent problems rather than just dealing with them afterwards. Some prompts for discussion:
What negative experiences have you had, and what do you wish the EA community had done differently in response to them?
What specific behaviors have you seen which you wish were less common/wish there were stronger norms against? What would have helped you push back against them?
As the movement becomes larger and more professionalized, how can we enable people to set clear boundaries and deal with conflicts of interest in workplaces and grantmaking?
How can we set clearer norms related to informal power structures (e.g. people who are respected or well-connected within EA, community organizers, etc)?
What codes of conduct should we have around events like EA Global? Here's the current code; are there things which should be included in there that aren't currently (e.g. explicitly talking about not asking people out in work-related 1:1s)?
What are the best ways to get feedback to the right people on an ongoing basis? E.g. what sort of reporting mechanisms would make sure that concerning patterns in specific EA groups get noticed early? And which ones are currently in place?
How can we enable people who are best at creating safe, welcoming environments to share that knowledge? Are there specific posts which should be written about best practices and lessons learned (e.g. additions to the community health resources here)?
I'd welcome people's thoughts and experiences, whether detailed discussions or just off-the-cuff comments. I'm particularly excited about suggestions for ways to translate these ideas to concrete actions going forward.
EDIT: here's a google form for people who want to comment anonymously; the answers should be visible here. And feel free to reach out to me in messages or in person if you have suggestions for how to do this better.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.