May 2023: Welcome to the alpha release of TYPE III AUDIO.
Expect very rough edges and very broken stuff—and daily improvements.
Please share your thoughts, but don't share this link on social media, for now.
May 2023: Welcome to the alpha release of TYPE III AUDIO.
Expect very rough edges and very broken stuff—and daily improvements.
Please share your thoughts, but don't share this link on social media, for now.
Summary: We demonstrate a new scalable way of interacting with language models: adding certain activation vectors into forward passes. Essentially, we add together combinations of forward passes in order to get GPT-2 to output the kinds of text we want. We provide a lot of entertaining and successful examples of these "activation additions." We also show a few activation additions which unexpectedly fail to have the desired effect.
We quantitatively evaluate how activation additions affect GPT-2's capabilities. For example, we find that adding a "wedding" vector decreases perplexity on wedding-related sentences, without harming perplexity on unrelated sentences. Overall, we find strong evidence that appropriately configured activation additions preserve GPT-2's capabilities.
Our results provide enticing clues about the kinds of programs implemented by language models. For some reason, GPT-2 allows "combination" of its forward passes, even though it was never trained to do so. Furthermore, our results are evidence of linear feature directions, including "anger", "weddings", and "create conspiracy theories."
We coin the phrase "activation engineering" to describe techniques which steer models by modifying their activations. As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime. Activation additions are nearly as easy as prompting, and they offer an additional way to influence a model’s behaviors and values. We suspect that activation additions can adjust the goals being pursued by a network at inference time.
https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector
Philosopher David Chalmers asked: "Is there a canonical source for "the argument for AGI ruin" somewhere, preferably laid out as an explicit argument with premises and a conclusion?"
Unsurprisingly, the actual reason people expect AGI ruin isn't a crisp deductive argument; it's a probabilistic update based on many lines of evidence. The specific observations and heuristics that carried the most weight for someone will vary for each individual, and can be hard to accurately draw out. That said, Eliezer Yudkowsky's So Far: Unfriendly AI Edition might be a good place to start if we want a pseudo-deductive argument just for the sake of organizing discussion. People can then say which premises they want to drill down on. In The Basic Reasons I Expect AGI Ruin, I wrote: "When I say "general intelligence", I'm usually thinking about "whatever it is that lets human brains do astrophysics, category theory, etc. even though our brains evolved under literally zero selection pressure to solve astrophysics or category theory problems". It's possible that we should already be thinking of GPT-4 as "AGI" on some definitions, so to be clear about the threshold of generality I have in mind, I'll specifically talk about "STEM-level AGI", though I expect such systems to be good at non-STEM tasks too. STEM-level AGI is AGI that has "the basic mental machinery required to do par-human reasoning about all the hard sciences", though a specific STEM-level AGI could (e.g.) lack physics ability for the same reasons many smart humans can't solve physics problems, such as "lack of familiarity with the field".
https://www.lesswrong.com/posts/QzkTfj4HGpLEdNjXX/an-artificially-structured-argument-for-expecting-agi-ruin
You are the director of a giant government research program that’s conducting randomized controlled trials (RCTs) on two thousand health interventions, so that you can pick out the most cost-effective ones and promote them among the general population.
The quality of the two thousand interventions follows a normal distribution, centered at zero (no harm or benefit) and with standard deviation 1. (Pick whatever units you like — maybe one quality-adjusted life-year per ten thousand dollars of spending, or something in that ballpark.)
Unfortunately, you don’t know exactly how good each intervention is — after all, then you wouldn’t be doing this job. All you can do is get a noisy measurement of intervention quality using an RCT. We’ll call this measurement the intervention’s performance in your RCT.
https://www.lesswrong.com/posts/nnDTgmzRrzDMiPF9B/how-much-do-you-believe-your-results
This is a post about mental health and disposition in relation to the alignment problem. It compiles a number of resources that address how to maintain wellbeing and direction when confronted with existential risk.
Many people in this community have posted their emotional strategies for facing Doom after Eliezer Yudkowsky’s “Death With Dignity” generated so much conversation on the subject. This post intends to be more touchy-feely, dealing more directly with emotional landscapes than questions of timelines or probabilities of success.
The resources section would benefit from community additions. Please suggest any resources that you would like to see added to this post.
Please note that this document is not intended to replace professional medical or psychological help in any way. Many preexisting mental health conditions can be exacerbated by these conversations. If you are concerned that you may be experiencing a mental health crisis, please consult a professional.
https://www.lesswrong.com/posts/pLLeGA7aGaJpgCkof/mental-health-and-the-alignment-problem-a-compilation-of
The primary talk of the AI world recently is about AI agents (whether or not it includes the question of whether we can’t help but notice we are all going to die.)
The trigger for this was AutoGPT, now number one on GitHub, which allows you to turn GPT-4 (or GPT-3.5 for us clowns without proper access) into a prototype version of a self-directed agent.
We also have a paper out this week where a simple virtual world was created, populated by LLMs that were wrapped in code designed to make them simple agents, and then several days of activity were simulated, during which the AI inhabitants interacted, formed and executed plans, and it all seemed like the beginnings of a living and dynamic world. Game version hopefully coming soon.
How should we think about this? How worried should we be?
https://www.lesswrong.com/posts/566kBoPi76t8KAkoD/on-autogpt
https://thezvi.wordpress.com/
(Related text posted to Twitter; this version is edited and has a more advanced final section.)
Imagine yourself in a box, trying to predict the next word - assign as much probability mass to the next token as possible - for all the text on the Internet.
Koan: Is this a task whose difficulty caps out as human intelligence, or at the intelligence level of the smartest human who wrote any Internet text? What factors make that task easier, or harder? (If you don't have an answer, maybe take a minute to generate one, or alternatively, try to predict what I'll say next; if you do have an answer, take a moment to review it inside your mind, or maybe say the words out loud.)
https://www.lesswrong.com/posts/nH4c3Q9t9F3nJ7y8W/gpts-are-predictors-not-imitators
https://www.lesswrong.com/posts/fJBTRa7m7KnCDdzG5/a-stylized-dialogue-on-john-wentworth-s-claims-about-markets
(This is a stylized version of a real conversation, where the first part happened as part of a public debate between John Wentworth and Eliezer Yudkowsky, and the second part happened between John and me over the following morning. The below is combined, stylized, and written in my own voice throughout. The specific concrete examples in John's part of the dialog were produced by me. It's over a year old. Sorry for the lag.)
(As to whether John agrees with this dialog, he said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment.)
https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment.
I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is:
https://www.lesswrong.com/posts/XWwvwytieLtEWaFJX/deep-deceptiveness
This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs.
You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.)
https://www.lesswrong.com/posts/nTGEeRSZrfPiJwkEc/the-onion-test-for-personal-and-institutional-honesty
[co-written by Chana Messinger and Andrew Critch, Andrew is the originator of the idea]
You (or your organization or your mission or your family or etc.) pass the “onion test” for honesty if each layer hides but does not mislead about the information hidden within.
When people get to know you better, or rise higher in your organization, they may find out new things, but should not be shocked by the types of information that were hidden. If they are, you messed up in creating the outer layers to describe appropriately the kind-of-thing that might be inside.
Examples
Positive Example:
Outer layer says "I usually treat my health information as private."
Next layer in says: "Here are the specific health problems I have: Gout, diabetes."
Negative example:
Outer layer says: "I usually treat my health info as private."
Next layer in: "I operate a cocaine dealership. Sorry I didn't warn you that I was also private about my illegal activities."
https://www.lesswrong.com/posts/fRwdkop6tyhi3d22L/there-s-no-such-thing-as-a-tree-phylogenetically
This is a linkpost for https://eukaryotewritesblog.com/2021/05/02/theres-no-such-thing-as-a-tree/
[Crossposted from Eukaryote Writes Blog.]
So you’ve heard about how fish aren’t a monophyletic group? You’ve heard about carcinization, the process by which ocean arthropods convergently evolve into crabs? You say you get it now? Sit down. Sit down. Shut up. Listen. You don’t know nothing yet.
“Trees” are not a coherent phylogenetic category. On the evolutionary tree of plants, trees are regularly interspersed with things that are absolutely, 100% not trees. This means that, for instance, either:
I thought I had a pretty good guess at this, but the situation is far worse than I could have imagined.
https://www.lesswrong.com/posts/ma7FSEtumkve8czGF/losing-the-root-for-the-tree
You know that being healthy is important. And that there's a lot of stuff you could do to improve your health: getting enough sleep, eating well, reducing stress, and exercising, to name a few.
There’s various things to hit on when it comes to exercising too. Strength, obviously. But explosiveness is a separate thing that you have to train for. Same with flexibility. And don’t forget cardio!
Strength is most important though, because of course it is. And there’s various things you need to do to gain strength. It all starts with lifting, but rest matters too. And supplements. And protein. Can’t forget about protein.
Protein is a deeper and more complicated subject than it may at first seem. Sure, the amount of protein you consume matters, but that’s not the only consideration. You also have to think about the timing. Consuming large amounts 2x a day is different than consuming smaller amounts 5x a day. And the type of protein matters too. Animal is different than plant, which is different from dairy. And then quality is of course another thing that is important.
But quality isn’t an easy thing to figure out. The big protein supplement companies are Out To Get You. They want to mislead you. Information sources aren’t always trustworthy. You can’t just hop on The Wirecutter and do what they tell you. Research is needed.
So you listen to a few podcasts. Follow a few YouTubers. Start reading some blogs. Throughout all of this you try various products and iterate as you learn more. You’re no Joe Rogan, but you’re starting to become pretty informed.
https://gwern.net/fiction/clippy
In A.D. 20XX. Work was beginning. “How are you gentlemen !!”… (Work. Work never changes; work is always hell.)
Specifically, a MoogleBook researcher has gotten a pull request from Reviewer #2 on his new paper in evolutionary search in auto-ML, for error bars on the auto-ML hyperparameter sensitivity like larger batch sizes, because more can be different and there’s high variance in the old runs with a few anomalously high gain of function. (“Really? Really? That’s what you’re worried about?”) He can’t see why worry, and wonders what sins he committed to deserve this asshole Chinese (given the Engrish) reviewer, as he wearily kicks off yet another HQU experiment…
https://www.lesswrong.com/posts/K4urTDkBbtNuLivJx/why-i-think-strong-general-ai-is-coming-soon
I think there is little time left before someone builds AGI (median ~2030). Once upon a time, I didn't think this.
This post attempts to walk through some of the observations and insights that collapsed my estimates.
The core ideas are as follows:
https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/what-failure-looks-like
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.
I think this is probably not what failure will look like, and I want to try to paint a more realistic picture. I’ll tell the story in two parts:
I think these are the most important problems if we fail to solve intent alignment.
In practice these problems will interact with each other, and with other disruptions/instability caused by rapid progress. These problems are worse in worlds where progress is relatively fast, and fast takeoff can be a key risk factor, but I’m scared even if we have several years.
https://www.lesswrong.com/posts/gNodQGNoPDjztasbh/lies-damn-lies-and-fabricated-options
This is an essay about one of those "once you see it, you will see it everywhere" phenomena. It is a psychological and interpersonal dynamic roughly as common, and almost as destructive, as motte-and-bailey, and at least in my own personal experience it's been quite valuable to have it reified, so that I can quickly recognize the commonality between what I had previously thought of as completely unrelated situations.
The original quote referenced in the title is "There are three kinds of lies: lies, damned lies, and statistics."
https://www.lesswrong.com/posts/thkAtqoQwN6DtaiGT/carefully-bootstrapped-alignment-is-organizationally-hard
In addition to technical challenges, plans to safely develop AI face lots of organizational challenges. If you're running an AI lab, you need a concrete plan for handling that.
In this post, I'll explore some of those issues, using one particular AI plan as an example. I first heard this described by Buck at EA Global London, and more recently with OpenAI's alignment plan. (I think Anthropic's plan has a fairly different ontology, although it still ultimately routes through a similar set of difficulties)
I'd call the cluster of plans similar to this "Carefully Bootstrapped Alignment."
https://www.lesswrong.com/posts/4Gt42jX7RiaNaxCwP/more-information-about-the-dangerous-capability-evaluations
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for https://evals.alignment.org/blog/2023-03-18-update-on-recent-evals/
[Written for more of a general-public audience than alignment-forum audience. We're working on a more thorough technical report.]
We believe that capable enough AI systems could pose very large risks to the world. We don’t think today’s systems are capable enough to pose these sorts of risks, but we think that this situation could change quickly and it’s important to be monitoring the risks consistently. Because of this, ARC is partnering with leading AI labs such as Anthropic and OpenAI as a third-party evaluator to assess potentially dangerous capabilities of today’s state-of-the-art ML models. The dangerous capability we are focusing on is the ability to autonomously gain resources and evade human oversight.
We attempt to elicit models’ capabilities in a controlled environment, with researchers in-the-loop for anything that could be dangerous, to understand what might go wrong before models are deployed. We think that future highly capable models should involve similar “red team” evaluations for dangerous capabilities before the models are deployed or scaled up, and we hope more teams building cutting-edge ML systems will adopt this approach. The testing we’ve done so far is insufficient for many reasons, but we hope that the rigor of evaluations will scale up as AI systems become more capable.
As we expected going in, today’s models (while impressive) weren’t capable of autonomously making and carrying out the dangerous activities we tried to assess. But models are able to succeed at several of the necessary components. Given only the ability to write and run code, models have some success at simple tasks involving browsing the internet, getting humans to do things for them, and making long-term plans – even if they cannot yet execute on this reliably.
https://www.lesswrong.com/posts/zidQmfFhMgwFzcHhs/enemies-vs-malefactors
Status: some mix of common wisdom (that bears repeating in our particular context), and another deeper point that I mostly failed to communicate.
Short version
Harmful people often lack explicit malicious intent. It’s worth deploying your social or community defenses against them anyway. I recommend focusing less on intent and more on patterns of harm.
(Credit to my explicit articulation of this idea goes in large part to Aella, and also in part to Oliver Habryka.)
https://www.lesswrong.com/posts/LzQtrHSYDafXynofq/the-parable-of-the-king-and-the-random-process
~ A Parable of Forecasting Under Model Uncertainty ~
You, the monarch, need to know when the rainy season will begin, in order to properly time the planting of the crops. You have two advisors, Pronto and Eternidad, who you trust exactly equally.
You ask them both: "When will the next heavy rain occur?"
Pronto says, "Three weeks from today."
Eternidad says, "Ten years from today."
https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post
In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others.
https://www.lesswrong.com/posts/3RSq3bfnzuL3sp46J/acausal-normalcy
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This post is also available on the EA Forum.
Summary: Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic. I say this to be comforting rather than dismissive; if it sounds dismissive, I apologize.
With that said, I have four aims in writing this post:
https://www.lesswrong.com/posts/RryyWNmJNnLowbhfC/please-don-t-throw-your-mind-away
[Warning: the following dialogue contains an incidental spoiler for "Music in Human Evolution" by Kevin Simler. That post is short, good, and worth reading without spoilers, and this post will still be here if you come back later. It's also possible to get the point of this post by skipping the dialogue and reading the other sections.]
Pretty often, talking to someone who's arriving to the existential risk / AGI risk / longtermism cluster, I'll have a conversation like the following:
Tsvi: "So, what's been catching your eye about this stuff?"
Arrival: "I think I want to work on machine learning, and see if I can contribute to alignment that way."
T: "What's something that got your interest in ML?"
A: "It seems like people think that deep learning might be on the final ramp up to AGI, so I should probably know how that stuff works, and I think I have a good chance of learning ML at least well enough to maybe contribute to a research project."
------
This is an experiment with AI narration. What do you think? Tell us by going to t3a.is.
------
https://www.lesswrong.com/posts/bxt7uCiHam4QXrQAA/cyborgism
There is a lot of disagreement and confusion about the feasibility and risks associated with automating alignment research. Some see it as the default path toward building aligned AI, while others expect limited benefit from near term systems, expecting the ability to significantly speed up progress to appear well after misalignment and deception. Furthermore, progress in this area may directly shorten timelines or enable the creation of dual purpose systems which significantly speed up capabilities research.
OpenAI recently released their alignment plan. It focuses heavily on outsourcing cognitive work to language models, transitioning us to a regime where humans mostly provide oversight to automated research assistants. While there have been a lot of objections to and concerns about this plan, there hasn’t been a strong alternative approach aiming to automate alignment research which also takes all of the many risks seriously.
The intention of this post is not to propose an end-all cure for the tricky problem of accelerating alignment using GPT models. Instead, the purpose is to explicitly put another point on the map of possible strategies, and to add nuance to the overall discussion.
https://www.lesswrong.com/posts/CYN7swrefEss4e3Qe/childhoods-of-exceptional-people
This is a linkpost for https://escapingflatland.substack.com/p/childhoods
Let’s start with one of those insights that are as obvious as they are easy to forget: if you want to master something, you should study the highest achievements of your field. If you want to learn writing, read great writers, etc.
But this is not what parents usually do when they think about how to educate their kids. The default for a parent is rather to imitate their peers and outsource the big decisions to bureaucracies. But what would we learn if we studied the highest achievements?
Thinking about this question, I wrote down a list of twenty names—von Neumann, Tolstoy, Curie, Pascal, etc—selected on the highly scientific criteria “a random Swedish person can recall their name and think, Sounds like a genius to me”. That list is to me a good first approximation of what an exceptional result in the field of child-rearing looks like. I ordered a few piles of biographies, read, and took notes. Trying to be a little less biased in my sample, I asked myself if I could recall anyone exceptional that did not fit the patterns I saw in the biographies, which I could, and so I ordered a few more biographies.
This kept going for an unhealthy amount of time.
I sampled writers (Virginia Woolf, Lev Tolstoy), mathematicians (John von Neumann, Blaise Pascal, Alan Turing), philosophers (Bertrand Russell, René Descartes), and composers (Mozart, Bach), trying to get a diverse sample.
In this essay, I am going to detail a few of the patterns that have struck me after having skimmed 42 biographies. I will sort the claims so that I start with more universal patterns and end with patterns that are less common.
https://www.lesswrong.com/posts/NJYmovr9ZZAyyTBwM/what-i-mean-by-alignment-is-in-large-part-about-making
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
(Epistemic status: attempting to clear up a misunderstanding about points I have attempted to make in the past. This post is not intended as an argument for those points.)
I have long said that the lion's share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at all, rather than figuring out what to point it at.
It’s recently come to my attention that some people have misunderstood this point, so I’ll attempt to clarify here.
https://www.lesswrong.com/posts/NRrbJJWnaSorrqvtZ/on-not-getting-contaminated-by-the-wrong-obesity-ideas
A Chemical Hunger (a), a series by the authors of the blog Slime Mold Time Mold (SMTM), argues that the obesity epidemic is entirely caused (a) by environmental contaminants.
In my last post, I investigated SMTM’s main suspect (lithium).[1] This post collects other observations I have made about SMTM’s work, not narrowly related to lithium, but rather focused on the broader thesis of their blog post series.
I think that the environmental contamination hypothesis of the obesity epidemic is a priori plausible. After all, we know that chemicals can affect humans, and our exposure to chemicals has plausibly changed a lot over time. However, I found that several of what seem to be SMTM’s strongest arguments in favor of the contamination theory turned out to be dubious, and that nearly all of the interesting things I thought I’d learned from their blog posts turned out to actually be wrong. I’ll explain that in this post.
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
Work done at SERI-MATS, over the past two months, by Jessica Rumbelow and Matthew Watkins.
TL;DR
Anomalous tokens: a mysterious failure mode for GPT (which reliably insulted Matthew)
Prompt generation: a new interpretability method for language models (which reliably finds prompts that result in a target completion). This is good for:
In this post, we'll introduce the prototype of a new model-agnostic interpretability method for language models which reliably generates adversarial prompts that result in a target completion. We'll also demonstrate a previously undocumented failure mode for GPT-2 and GPT-3 language models, which results in bizarre completions (in some cases explicitly contrary to the purpose of the model), and present the results of our investigation into this phenomenon. Further detail can be found in a follow-up post.
https://www.lesswrong.com/posts/Zp6wG5eQFLGWwcG6j/focus-on-the-places-where-you-feel-shocked-everyone-s
Writing down something I’ve found myself repeating in different conversations:
If you're looking for ways to help with the whole “the world looks pretty doomed” business, here's my advice: look around for places where we're all being total idiots.
Look for places where everyone's fretting about a problem that some part of you thinks it could obviously just solve.
Look around for places where something seems incompetently run, or hopelessly inept, and where some part of you thinks you can do better.
Then do it better.
https://www.lesswrong.com/posts/XPv4sYrKnPzeJASuk/basics-of-rationalist-discourse-1
Introduction
This post is meant to be a linkable resource. Its core is a short list of guidelines (you can link directly to the list) that are intended to be fairly straightforward and uncontroversial, for the purpose of nurturing and strengthening a culture of clear thinking, clear communication, and collaborative truth-seeking.
"Alas," said Dumbledore, "we all know that what should be, and what is, are two different things. Thank you for keeping this in mind."
There is also (for those who want to read more than the simple list) substantial expansion/clarification of each specific guideline, along with justification for the overall philosophy behind the set.
https://www.lesswrong.com/posts/PCrTQDbciG4oLgmQ5/sapir-whorf-for-rationalists
Casus Belli: As I was scanning over my (rather long) list of essays-to-write, I realized that roughly a fifth of them were of the form "here's a useful standalone concept I'd like to reify," à la cup-stacking skills, fabricated options, split and commit, and sazen. Some notable entries on that list (which I name here mostly in the hope of someday coming back and turning them into links) include: red vs. white, walking with three, setting the zero point[1], seeding vs. weeding, hidden hinges, reality distortion fields, and something-about-layers-though-that-one-obviously-needs-a-better-word.
While it's still worthwhile to motivate/justify each individual new conceptual handle (and the planned essays will do so), I found myself imagining a general objection of the form "this is just making up terms for things," or perhaps "this is too many new terms, for too many new things." I realized that there was a chunk of argument, repeated across all of the planned essays, that I could factor out, and that (to the best of my knowledge) there was no single essay aimed directly at the question "why new words/phrases/conceptual handles at all?"
So ... voilà.
https://www.lesswrong.com/posts/pDzdb4smpzT3Lwbym/my-model-of-ea-burnout
(Probably somebody else has said most of this. But I personally haven't read it, and felt like writing it down myself, so here we go.)
I think that EA [editor note: "Effective Altruism"] burnout usually results from prolonged dedication to satisfying the values you think you should have, while neglecting the values you actually have.
Setting aside for the moment what “values” are and what it means to “actually” have one, suppose that I actually value these things (among others):
https://www.lesswrong.com/posts/Xo7qmDakxiizG7B9c/the-social-recession-by-the-numbers
This is a linkpost for https://novum.substack.com/p/social-recession-by-the-numbers
Fewer friends, relationships on the decline, delayed adulthood, trust at an all-time low, and many diseases of despair. The prognosis is not great.
One of the most discussed topics online recently has been friendships and loneliness. Ever since the infamous chart showing more people are not having sex than ever before first made the rounds, there’s been increased interest in the social state of things. Polling has demonstrated a marked decline in all spheres of social life, including close friends, intimate relationships, trust, labor participation, and community involvement. The trend looks to have worsened since the pandemic, although it will take some years before this is clearly established.
The decline comes alongside a documented rise in mental illness, diseases of despair, and poor health more generally. In August 2022, the CDC announced that U.S. life expectancy has fallen further and is now where it was in 1996. Contrast this to Western Europe, where it has largely rebounded to pre-pandemic numbers. Still, even before the pandemic, the years 2015-2017 saw the longest sustained decline in U.S. life expectancy since 1915-18. While my intended angle here is not health-related, general sociability is closely linked to health. The ongoing shift has been called the “friendship recession” or the “social recession.”
https://www.lesswrong.com/posts/pHfPvb4JMhGDr4B7n/recursive-middle-manager-hell
I think Zvi's Immoral Mazes sequence is really important, but comes with more worldview-assumptions than are necessary to make the points actionable. I conceptualize Zvi as arguing for multiple hypotheses. In this post I want to articulate one sub-hypothesis, which I call "Recursive Middle Manager Hell". I'm deliberately not covering some other components of his model[1].
tl;dr:
Something weird and kinda horrifying happens when you add layers of middle-management. This has ramifications on when/how to scale organizations, and where you might want to work, and maybe general models of what's going on in the world.
You could summarize the effect as "the org gets more deceptive, less connected to its original goals, more focused on office politics, less able to communicate clearly within itself, and selected for more for sociopathy in upper management."
You might read that list of things and say "sure, seems a bit true", but one of the main points here is "Actually, this happens in a deeper and more insidious way than you're probably realizing, with much higher costs than you're acknowledging. If you're scaling your organization, this should be one of your primary worries."
https://www.lesswrong.com/posts/mfPHTWsFhzmcXw8ta/the-feeling-of-idea-scarcity
Here’s a story you may recognize. There's a bright up-and-coming young person - let's call her Alice. Alice has a cool idea. It seems like maybe an important idea, a big idea, an idea which might matter. A new and valuable idea. It’s the first time Alice has come up with a high-potential idea herself, something which she’s never heard in a class or read in a book or what have you.
So Alice goes all-in pursuing this idea. She spends months fleshing it out. Maybe she writes a paper, or starts a blog, or gets a research grant, or starts a company, or whatever, in order to pursue the high-potential idea, bring it to the world.
And sometimes it just works!
… but more often, the high-potential idea doesn’t actually work out. Maybe it turns out to be basically-the-same as something which has already been tried. Maybe it runs into some major barrier, some not-easily-patchable flaw in the idea. Maybe the problem it solves just wasn’t that important in the first place.
From Alice’ point of view, the possibility that her one high-potential idea wasn’t that great after all is painful. The idea probably feels to Alice like the single biggest intellectual achievement of her life. To lose that, to find out that her single greatest intellectual achievement amounts to little or nothing… that hurts to even think about. So most likely, Alice will reflexively look for an out. She’ll look for some excuse to ignore the similar ideas which have already been tried, some reason to think her idea is different. She’ll look for reasons to believe that maybe the major barrier isn’t that much of an issue, or that we Just Don’t Know whether it’s actually an issue and therefore maybe the idea could work after all. She’ll look for reasons why the problem really is important. Maybe she’ll grudgingly acknowledge some shortcomings of the idea, but she’ll give up as little ground as possible at each step, update as slowly as she can.
https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In terms of content, this has a lot of overlap with Reward is not the optimization target. I'm basically rewriting a part of that post in language I personally find clearer, emphasising what I think is the core insight
When thinking about deception and RLHF training, a simplified threat model is something like this:
Before continuing, I would encourage you to really engage with the above. Does it make sense to you? Is it making any hidden assumptions? Is it missing any steps? Can you rewrite it to be more mechanistically correct?
I believe that when people use the above threat model, they are either using it as shorthand for something else or they misunderstand how reinforcement learning works. Most alignment researchers will be in the former category. However, I was in the latter.
https://www.lesswrong.com/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Introduction
A few collaborators and I recently released a new paper: Discovering Latent Knowledge in Language Models Without Supervision. For a quick summary of our paper, you can check out this Twitter thread.
In this post I will describe how I think the results and methods in our paper fit into a broader scalable alignment agenda. Unlike the paper, this post is explicitly aimed at an alignment audience and is mainly conceptual rather than empirical.
Tl;dr: unsupervised methods are more scalable than supervised methods, deep learning has special structure that we can exploit for alignment, and we may be able to recover superhuman beliefs from deep learning representations in a totally unsupervised way.
Disclaimers: I have tried to make this post concise, at the cost of not making the full arguments for many of my claims; you should treat this as more of a rough sketch of my views rather than anything comprehensive. I also frequently change my mind – I’m usually more consistently excited about some of the broad intuitions but much less wedded to the details – and this of course just represents my current thinking on the topic.
https://www.lesswrong.com/posts/qRtD4WqKRYEtT5pi3/the-next-decades-might-be-wild
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
I’d like to thank Simon Grimm and Tamay Besiroglu for feedback and discussions.
This post is inspired by What 2026 looks like and an AI vignette workshop guided by Tamay Besiroglu. I think of this post as “what would I expect the world to look like if these timelines (median compute for transformative AI ~2036) were true” or “what short-to-medium timelines feel like” since I find it hard to translate a statement like “median TAI year is 20XX” into a coherent imaginable world.
I expect some readers to think that the post sounds wild and crazy but that doesn’t mean its content couldn’t be true. If you had told someone in 1990 or 2000 that there would be more smartphones and computers than humans in 2020, that probably would have sounded wild to them. The same could be true for AIs, i.e. that in 2050 there are more human-level AIs than humans. The fact that this sounds as ridiculous as ubiquitous smartphones sounded to the 1990/2000 person, might just mean that we are bad at predicting exponential growth and disruptive technology.
Update: titotal points out in the comments that the correct timeframe for computers is probably 1980 to 2020. So the correct time span is probably 40 years instead of 30. For mobile phones, it's probably 1993 to 2020 if you can trust this statistic.
I’m obviously not confident (see confidence and takeaways section) in this particular prediction but many of the things I describe seem like relatively direct consequences of more and more powerful and ubiquitous AI mixed with basic social dynamics and incentives.
https://www.lesswrong.com/posts/SqjQFhn5KTarfW8v7/lessons-learned-from-talking-to-greater-than-100-academics
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
I’d like to thank MH, Jaime Sevilla and Tamay Besiroglu for their feedback.
During my Master's and Ph.D. (still ongoing), I have spoken with many academics about AI safety. These conversations include chats with individual PhDs, poster presentations and talks about AI safety.
I think I have learned a lot from these conversations and expect many other people concerned about AI safety to find themselves in similar situations. Therefore, I want to detail some of my lessons and make my thoughts explicit so that others can scrutinize them.
TL;DR: People in academia seem more and more open to arguments about risks from advanced intelligence over time and I would genuinely recommend having lots of these chats. Furthermore, I underestimated how much work related to some aspects AI safety already exists in academia and that we sometimes reinvent the wheel. Messaging matters, e.g. technical discussions got more interest than alarmism and explaining the problem rather than trying to actively convince someone received better feedback.
https://www.lesswrong.com/posts/6LzKRP88mhL9NKNrS/how-my-team-at-lightcone-sometimes-gets-stuff-done
Disclaimer: I originally wrote this as a private doc for the Lightcone team. I then showed it to John and he said he would pay me to post it here. That sounded awfully compelling. However, I wanted to note that I’m an early founder who hasn't built anything truly great yet. I’m writing this doc because as Lightcone is growing, I have to take a stance on these questions. I need to design our org to handle more people. Still, I haven’t seen the results long-term, and who knows if this is good advice. Don’t overinterpret this.
Suppose you went up on stage in front of a company you founded, that now had grown to 100, or 1000, 10 000+ people. You were going to give a talk about your company values. You can say things like “We care about moving fast, taking responsibility, and being creative” -- but I expect these words would mostly fall flat. At the end of the day, the path the water takes down the hill is determined by the shape of the territory, not the sound the water makes as it swooshes by. To manage that many people, it seems to me you need clear, concrete instructions. What are those? What are things you could write down on a piece of paper and pass along your chain of command, such that if at the end people go ahead and just implement them, without asking what you meant, they would still preserve some chunk of what makes your org work?
https://www.lesswrong.com/posts/rP66bz34crvDudzcJ/decision-theory-does-not-imply-that-we-get-to-have-nice
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
(Note: I wrote this with editing help from Rob and Eliezer. Eliezer's responsible for a few of the paragraphs.)
A common confusion I see in the tiny fragment of the world that knows about logical decision theory (FDT/UDT/etc.), is that people think LDT agents are genial and friendly for each other.[1]
One recent example is Will Eden’s tweet about how maybe a molecular paperclip/squiggle maximizer would leave humanity a few stars/galaxies/whatever on game-theoretic grounds. (And that's just one example; I hear this suggestion bandied around pretty often.)
I'm pretty confident that this view is wrong (alas), and based on a misunderstanding of LDT. I shall now attempt to clear up that confusion.
To begin, a parable: the entity Omicron (Omega's little sister) fills box A with $1M and box B with $1k, and puts them both in front of an LDT agent saying "You may choose to take either one or both, and know that I have already chosen whether to fill the first box". The LDT agent takes both.
"What?" cries the CDT agent. "I thought LDT agents one-box!"
LDT agents don't cooperate because they like cooperating. They don't one-box because the name of the action starts with an 'o'. They maximize utility, using counterfactuals that assert that the world they are already in (and the observations they have already seen) can (in the right circumstances) depend (in a relevant way) on what they are later going to do.
A paperclipper cooperates with other LDT agents on a one-shot prisoner's dilemma because they get more paperclips that way. Not because it has a primitive property of cooperativeness-with-similar-beings. It needs to get the more paperclips.
https://www.lesswrong.com/posts/6Xgy6CAf2jqHhynHL/what-2026-looks-like#2022
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This was written for the Vignettes Workshop.[1] The goal is to write out a detailed future history (“trajectory”) that is as realistic (to me) as I can currently manage, i.e. I’m not aware of any alternative trajectory that is similarly detailed and clearly more plausible to me. The methodology is roughly: Write a future history of 2022. Condition on it, and write a future history of 2023. Repeat for 2024, 2025, etc. (I'm posting 2022-2026 now so I can get feedback that will help me write 2027+. I intend to keep writing until the story reaches singularity/extinction/utopia/etc.)
What’s the point of doing this? Well, there are a couple of reasons:
This vignette was hard to write. To achieve the desired level of detail I had to make a bunch of stuff up, but in order to be realistic I had to constantly ask “but actually though, what would really happen in this situation?” which made it painfully obvious how little I know about the future. There are numerous points where I had to conclude “Well, this does seem implausible, but I can’t think of anything more plausible at the moment and I need to move on.” I fully expect the actual world to diverge quickly from the trajectory laid out here. Let anyone who (with the benefit of hindsight) claims this divergence as evidence against my judgment prove it by exhibiting a vignette/trajectory they themselves wrote in 2021. If it maintains a similar level of detail (and thus sticks its neck out just as much) while being more accurate, I bow deeply in respect!
https://www.lesswrong.com/posts/REA49tL5jsh69X3aM/introduction-to-abstract-entropy#fnrefpi8b39u5hd7
This post, and much of the following sequence, was greatly aided by feedback from the following people (among others): Lawrence Chan, Joanna Morningstar, John Wentworth, Samira Nedungadi, Aysja Johnson, Cody Wild, Jeremy Gillen, Ryan Kidd, Justis Mills and Jonathan Mustin. Illustrations by Anne Ore.
Introduction & motivation
In the course of researching optimization, I decided that I had to really understand what entropy is.[1] But there are a lot of other reasons why the concept is worth studying:
I didn't intend to write a post about entropy when I started trying to understand it. But I found the existing resources (textbooks, Wikipedia, science explainers) so poor that it actually seems important to have a better one as a prerequisite for understanding optimization! One failure mode I was running into was that other resources tended only to be concerned about the application of the concept in their particular sub-domain. Here, I try to take on the task of synthesizing the abstract concept of entropy, to show what's so deep and fundamental about it. In future posts, I'll talk about things like:
https://www.lesswrong.com/posts/8vesjeKybhRggaEpT/consider-your-appetite-for-disagreements
Poker
There was a time about five years ago where I was trying to get good at poker. If you want to get good at poker, one thing you have to do is review hands. Preferably with other people.
For example, suppose you have ace king offsuit on the button. Someone in the highjack opens to 3 big blinds preflop. You call. Everyone else folds. The flop is dealt. It's a rainbow Q75. You don't have any flush draws. You missed. Your opponent bets. You fold. They take the pot and you move to the next hand.
Once you finish your session, it'd be good to come back and review this hand. Again, preferably with another person. To do this, you would review each decision point in the hand. Here, there were two decision points.
The first was when you faced a 3BB open from HJ preflop with AKo. In the hand, you decided to call. However, this of course wasn't your only option. You had two others: you could have folded, and you could have raised. Actually, you could have raised to various sizes. You could have raised small to 8BB, medium to 10BB, or big to 12BB. Or hell, you could have just shoved 200BB! But that's not really a realistic option, nor is folding. So in practice your decision was between calling and raising to various realistic sizes.
https://www.lesswrong.com/posts/fFY2HeC9i2Tx8FEnK/my-resentful-story-of-becoming-a-medical-miracle
This is a linkpost for https://acesounderglass.com/2022/10/13/my-resentful-story-of-becoming-a-medical-miracle/
You know those health books with “miracle cure” in the subtitle? The ones that always start with a preface about a particular patient who was completely hopeless until they tried the supplement/meditation technique/healing crystal that the book is based on? These people always start broken and miserable, unable to work or enjoy life, perhaps even suicidal from the sheer hopelessness of getting their body to stop betraying them. They’ve spent decades trying everything and nothing has worked until their friend makes them see the book’s author, who prescribes the same thing they always prescribe, and the patient immediately stands up and starts dancing because their problem is entirely fixed (more conservative books will say it took two sessions). You know how those are completely unbelievable, because anything that worked that well would go mainstream, so basically the book is starting you off with a shit test to make sure you don’t challenge its bullshit later?
Well 5 months ago I became one of those miraculous stories, except worse, because my doctor didn’t even do it on purpose. This finalized some already fermenting changes in how I view medical interventions and research. Namely: sometimes knowledge doesn’t work and then you have to optimize for luck.
I assure you I’m at least as unhappy about this as you are.
https://www.lesswrong.com/posts/CKgPFHoWFkviYz7CB/the-redaction-machine
On the 3rd of October 2351 a machine flared to life. Huge energies coursed into it via cables, only to leave moments later as heat dumped unwanted into its radiators. With an enormous puff the machine unleashed sixty years of human metabolic entropy into superheated steam.
In the heart of the machine was Jane, a person of the early 21st century.
From her perspective there was no transition. One moment she had been in the year 2021, sat beneath a tree in a park. Reading a detective novel.
Then the book was gone, and the tree. Also the park. Even the year.
She found herself laid in a bathtub, immersed in sickly fatty fluids. She was naked and cold.
https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
I think that in the coming 15-30 years, the world could plausibly develop “transformative AI”: AI powerful enough to bring us into a new, qualitatively different future, via an explosion in science and technology R&D. This sort of AI could be sufficient to make this the most important century of all time for humanity.
The most straightforward vision for developing transformative AI that I can imagine working with very little innovation in techniques is what I’ll call human feedback[1] on diverse tasks (HFDT):
Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.
HFDT is not the only approach to developing transformative AI,[2] and it may not work at all.[3] But I take it very seriously, and I’m aware of increasingly many executives and ML researchers at AI companies who believe something within this space could work soon.
Unfortunately, I think that if AI companies race forward training increasingly powerful models using HFDT, this is likely to eventually lead to a full-blown AI takeover (i.e. a possibly violent uprising or coup by AI systems). I don’t think this is a certainty, but it looks like the best-guess default absent specific efforts to prevent it.
https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values
TL;DR: We propose a theory of human value formation. According to this theory, the reward system shapes human values in a relatively straightforward manner. Human values are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics which were shaped by and bootstrapped from crude, genetically hard-coded reward circuitry.
https://www.lesswrong.com/posts/AfH2oPHCApdKicM4m/two-year-update-on-my-personal-ai-timelines#fnref-fwwPpQFdWM6hJqwuY-12
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
I worked on my draft report on biological anchors for forecasting AI timelines mainly between ~May 2019 (three months after the release of GPT-2) and ~Jul 2020 (a month after the release of GPT-3), and posted it on LessWrong in Sep 2020 after an internal review process. At the time, my bottom line estimates from the bio anchors modeling exercise were:[1]
These were roughly close to my all-things-considered probabilities at the time, as other salient analytical frames on timelines didn’t do much to push back on this view. (Though my subjective probabilities bounced around quite a lot around these values and if you’d asked me on different days and with different framings I’d have given meaningfully different numbers.)
It’s been about two years since the bulk of the work on that report was completed, during which I’ve mainly been thinking about AI. In that time it feels like very short timelines have become a lot more common and salient on LessWrong and in at least some parts of the ML community.
My personal timelines have also gotten considerably shorter over this period. I now expect something roughly like this:
https://www.lesswrong.com/posts/9kNxhKWvixtKW5anS/you-are-not-measuring-what-you-think-you-are-measuring
Eight years ago, I worked as a data scientist at a startup, and we wanted to optimize our sign-up flow. We A/B tested lots of different changes, and occasionally found something which would boost (or reduce) click-through rates by 10% or so.
Then one week I was puzzling over a discrepancy in the variance of our daily signups. Eventually I scraped some data from the log files, and found that during traffic spikes, our server latency shot up to multiple seconds. The effect on signups during these spikes was massive: even just 300 ms was enough that click-through dropped by 30%, and when latency went up to seconds the click-through rates dropped by over 80%. And this happened multiple times per day. Latency was far and away the most important factor which determined our click-through rates. [1]
Going back through some of our earlier experiments, it was clear in hindsight that some of our biggest effect-sizes actually came from changing latency - for instance, if we changed the order of two screens, then there’d be an extra screen before the user hit the one with high latency, so the latency would be better hidden. Our original interpretations of those experiments - e.g. that the user cared more about the content of one screen than another - were totally wrong. It was also clear in hindsight that our statistics on all the earlier experiments were bunk - we’d assumed that every user’s click-through was statistically independent, when in fact they were highly correlated, so many of the results which we thought were significant were in fact basically noise.
https://www.lesswrong.com/posts/WNpvK67MjREgvB8u8/do-bamboos-set-themselves-on-fire
Cross-posted from Telescopic Turnip.
As we all know, the best place to have a kung-fu fight is a bamboo forest. There are just so many opportunities to grab pieces of bamboos and manufacture improvised weapons, use them to catapult yourself in the air and other basic techniques any debutant martial artist ought to know. A lesser-known fact is that bamboo-forest fights occur even when the cameras of Hong-Kong filmmakers are not present. They may even happen without the presence of humans at all. The forest itself is the kung-fu fight.
It's often argued that humans are the worst species on Earth, because of our limitless potential for violence and mutual destruction. If that's the case, bamboos are second. Bamboos are sick. The evolution of bamboos is the result of multiple layers of shear brutality, with two imbricated levels of war, culminating in an apocalypse of combustive annihilation. At least, according to some hypotheses.
https://www.lesswrong.com/posts/oyKzz7bvcZMEPaDs6/survey-advice
Things I believe about making surveys, after making some surveys:
https://www.lesswrong.com/posts/J3wemDGtsy5gzD3xa/toni-kurz-and-the-insanity-of-climbing-mountains
Content warning: death
I've been on a YouTube binge lately. My current favorite genre is disaster stories about mountain climbing. The death statistics for some of these mountains, especially ones in the Himalayas are truly insane.
To give an example, let me tell you about a mountain most people have never heard of: Nanga Parbat. It's a 8,126 meter "wall of ice and rock", sporting the tallest mountain face and the fastest change in elevation in the entire world: the Rupal Face.
I've posted a picture above, but these really don't do justice to just how gigantic this wall is. This single face is as tall as the largest mountain in the Alps. It is the size of ten empire state buildings stacked on top of one another. If you could somehow walk straight up starting from the bottom, it would take you an entire HOUR to reach the summit.
31 people died trying to climb this mountain before its first successful ascent. Imagine being climber number 32 and thinking "Well I know no one has ascended this mountain and thirty one people have died trying, but why not, let's give it a go!"
The stories of deaths on these mountains (and even much shorter peaks in the Alps or in North America) sound like they are out of a novel. Stories of one mountain in particular have stuck with me: the first attempts to climb tallest mountain face in the alps: The Eigerwand.
The Eigerwand is the North face of a 14,000 foot peak named "The Eiger". After three generations of Europeans had conquered every peak in the Alps, few great challenges remained in the area. The Eigerwand was one of these: widely considered to be the greatest unclimbed route in the Alps.
The peak had already been reached in the 1850s, during the golden age of Alpine exploration. But the north face of the mountain remained unclimbed.
Many things can make a climb challenging: steep slopes, avalanches, long ascents, no easy resting spots and more. The Eigerwand had all of those, but one hazard in particular stood out: loose rock and snow.
In the summer months (usually considered the best time for climbing), the mountain crumbles. Fist-sized boulders routinely tumble down the mountain. Huge avalanaches sweep down its 70-degree slopes at incredible speed. And the huge, concave face is perpetually in shadow. It is extremely cold and windy, and the concave face seems to cause local weather patterns that can be completely different from the pass below. The face is deadly.
Before 1935, no team had made a serious attempt at the face. But that year, two young German climbers from Bavaria, both extremely experienced but relatively unknown outside the climbing community, decided they would make the first serious attempt.
https://www.lesswrong.com/posts/gs3vp3ukPbpaEie5L/deliberate-grieving-1
This post is hopefully useful on its own, but begins a series ultimately about grieving over a world that might (or, might not) be doomed.
It starts with some pieces from a previous coordination frontier sequence post, but goes into more detail.
At the beginning of the pandemic, I didn’t have much experience with grief. By the end of the pandemic, I had gotten quite a lot of practice grieving for things. I now think of grieving as a key life skill, with ramifications for epistemics, action, and coordination.
I had read The Art of Grieving Well, which gave me footholds to get started with. But I still had to develop some skills from scratch, and apply them in novel ways.
Grieving probably works differently for different people. Your mileage may vary. But for me, grieving is the act of wrapping my brain around the fact that something important to me doesn’t exist anymore. Or can’t exist right now. Or perhaps never existed. It typically comes in two steps – an “orientation” step, where my brain traces around the lines of the thing-that-isn’t-there, coming to understand what reality is actually shaped like now. And then a “catharsis” step, once I fully understand that the thing is gone. The first step can take hours, weeks or months.
You can grieve for people who are gone. You can grieve for things you used to enjoy. You can grieve for principles that were important to you but aren’t practical to apply right now.
Grieving is important in single-player mode – if I’m holding onto something that’s not there anymore, my thoughts and decision-making are distorted. I can’t make good plans if my map of reality is full of leftover wishful markings of things that aren’t there.
https://www.lesswrong.com/s/6xgy8XYEisLk3tCjH/p/CPP2uLcaywEokFKQG
Tl;dr:
I've noticed a dichotomy between "thinking in toolboxes" and "thinking in laws".
The toolbox style of thinking says it's important to have a big bag of tools that you can adapt to context and circumstance; people who think very toolboxly tend to suspect that anyone who goes talking of a single optimal way is just ignorant of the uses of the other tools.
The lawful style of thinking, done correctly, distinguishes between descriptive truths, normative ideals, and prescriptive ideals. It may talk about certain paths being optimal, even if there's no executable-in-practice algorithm that yields the optimal path. It considers truths that are not tools.
Within nearly-Euclidean mazes, the triangle inequality - that the path AC is never spatially longer than the path ABC - is always true but only sometimes useful. The triangle inequality has the prescriptive implication that if you know that one path choice will travel ABC and one path will travel AC, and if the only pragmatic path-merit you care about is going the minimum spatial distance (rather than say avoiding stairs because somebody in the party is in a wheelchair), then you should pick the route AC. But the triangle inequality goes on governing Euclidean mazes whether or not you know which path is which, and whether or not you need to avoid stairs.
Toolbox thinkers may be extremely suspicious of this claim of universal lawfulness if it is explained less than perfectly, because it sounds to them like "Throw away all the other tools in your toolbox! All you need to know is Euclidean geometry, and you can always find the shortest path through any maze, which in turn is always the best path."
If you think that's an unrealistic depiction of a misunderstanding that would never happen in reality, keep reading.
https://www.lesswrong.com/posts/PBRWb2Em5SNeWYwwB/humans-are-not-automatically-strategic
Reply to: A "Failure to Evaluate Return-on-Time" Fallacy
Lionhearted writes:
[A] large majority of otherwise smart people spend time doing semi-productive things, when there are massively productive opportunities untapped.A somewhat silly example: Let's say someone aspires to be a comedian, the best comedian ever, and to make a living doing comedy. He wants nothing else, it is his purpose. And he decides that in order to become a better comedian, he will watch re-runs of the old television cartoon 'Garfield and Friends' that was on TV from 1988 to 1995....I’m curious as to why.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
[Thanks to a variety of people for comments and assistance (especially Paul Christiano, Nostalgebraist, and Rafe Kennedy), and to various people for playing the game. Buck wrote the top-1 prediction web app; Fabien wrote the code for the perplexity experiment and did most of the analysis and wrote up the math here, Lawrence did the research on previous measurements. Epistemic status: we're pretty confident of our work here, but haven't engaged in a super thorough review process of all of it--this was more like a side-project than a core research project.]
How good are modern language models compared to humans, at the task language models are trained on (next token prediction on internet text)? While there are language-based tasks that you can construct where humans can make a next-token prediction better than any language model, we aren't aware of any apples-to-apples comparisons on non-handcrafted datasets. To answer this question, we performed a few experiments comparing humans to language models on next-token prediction on OpenWebText.
Contrary to some previous claims, we found that humans seem to be consistently worse at next-token prediction (in terms of both top-1 accuracy and perplexity) than even small models like Fairseq-125M, a 12-layer transformer roughly the size and quality of GPT-1. That is, even small language models are "superhuman" at predicting the next token. That being said, it seems plausible that humans can consistently beat the smaller 2017-era models (though not modern models) with a few hours more practice and strategizing. We conclude by discussing some of our takeaways from this result.
We're not claiming that this result is completely novel or surprising. For example, FactorialCode makes a similar claim as an answer on this LessWrong post about this question. We've also heard from some NLP people that the superiority of LMs to humans for next-token prediction is widely acknowledged in NLP. However, we've seen incorrect claims to the contrary on the internet, and as far as we know there hasn't been a proper apples-to-apples comparison, so we believe there's some value to our results.
If you want to play with our website, it’s here; in our opinion playing this game for half an hour gives you some useful perspective on what it’s like to be a language model.
https://www.lesswrong.com/posts/jDQm7YJxLnMnSNHFu/moral-strategies-at-different-capability-levels
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Let’s consider three ways you can be altruistic towards another agent:
I think a lot of unresolved tensions in ethics comes from seeing these types of morality as in opposition to each other, when they’re actually complementary:
https://www.lesswrong.com/posts/xFotXGEotcKouifky/worlds-where-iterative-design-fails
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In most technical fields, we try designs, see what goes wrong, and iterate until it works. That’s the core iterative design loop. Humans are good at iterative design, and it works well in most fields in practice.
In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse.
By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails. So, if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason; in worlds where it doesn’t fail, we probably don’t die anyway.
Why might the iterative design loop fail? Most readers probably know of two widely-discussed reasons:
… but these certainly aren’t the only reasons the iterative design loop potentially fails. This post will mostly talk about some particularly simple and robust failure modes, but I’d encourage you to think on your own about others. These are the things which kill us; they’re worth thinking about.Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In most technical fields, we try designs, see what goes wrong, and iterate until it works. That’s the core iterative design loop. Humans are good at iterative design, and it works well in most fields in practice.
In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse.
By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails. So, if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason; in worlds where it doesn’t fail, we probably don’t die anyway.
Why might the iterative design loop fail? Most readers probably know of two widely-discussed reasons:
… but these certainly aren’t the only reasons the iterative design loop potentially fails. This post will mostly talk about some particularly simple and robust failure modes, but I’d encourage you to think on your own about others. These are the things which kill us; they’re worth thinking about.
https://www.lesswrong.com/posts/QBAjndPuFbhEXKcCr/my-understanding-of-what-everyone-in-technical-alignment-is
Despite a clear need for it, a good source explaining who is doing what and why in technical AI alignment doesn't exist. This is our attempt to produce such a resource. We expect to be inaccurate in some ways, but it seems great to get out there and let Cunningham’s Law do its thing.[1]
The main body contains our understanding of what everyone is doing in technical alignment and why, as well as at least one of our opinions on each approach. We include supplements visualizing differences between approaches and Thomas’s big picture view on alignment. The opinions written are Thomas and Eli’s independent impressions, many of which have low resilience. Our all-things-considered views are significantly more uncertain.
This post was mostly written while Thomas was participating in the 2022 iteration SERI MATS program, under mentor John Wentworth. Thomas benefited immensely from conversations with other SERI MATS participants, John Wentworth, as well as many others who I met this summer.
https://www.lesswrong.com/posts/rYDas2DDGGDRc8gGB/unifying-bargaining-notions-1-2
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a two-part sequence of posts, in the ancient LessWrong tradition of decision-theory-posting. This first part will introduce various concepts of bargaining solutions and dividing gains from trade, which the reader may or may not already be familiar with.
The upcoming part will be about how all introduced concepts from this post are secretly just different facets of the same underlying notion, as originally discovered by John Harsanyi back in 1963 and rediscovered by me from a completely different direction. The fact that the various different solution concepts in cooperative game theory are all merely special cases of a General Bargaining Solution for arbitrary games, is, as far as I can tell, not common knowledge on Less Wrong.
Bargaining Games
Let's say there's a couple with a set of available restaurant options. Neither of them wants to go without the other, and if they fail to come to an agreement, the fallback is eating a cold canned soup dinner at home, the worst of all the options. However, they have different restaurant preferences. What's the fair way to split the gains from trade?
Well, it depends on their restaurant preferences, and preferences are typically encoded with utility functions. Since both sides agree that the disagreement outcome is the worst, they might as well index that as 0 utility, and their favorite respective restaurants as 1 utility, and denominate all the other options in terms of what probability mix between a cold canned dinner and their favorite restaurant would make them indifferent. If there's something that scores 0.9 utility for both, it's probably a pretty good pick!
https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators#fncrt8wagfir9
Summary
TL;DR: Self-supervised learning may create AGI or its foundation. What would that look like?
Unlike the limit of RL, the limit of self-supervised learning has received surprisingly little conceptual attention, and recent progress has made deconfusion in this domain more pressing.
Existing AI taxonomies either fail to capture important properties of self-supervised models or lead to confusing propositions. For instance, GPT policies do not seem globally agentic, yet can be conditioned to behave in goal-directed ways. This post describes a frame that enables more natural reasoning about properties like agency: GPT, insofar as it is inner-aligned, is a simulator which can simulate agentic and non-agentic simulacra.
The purpose of this post is to capture these objects in words so GPT can reference them and provide a better foundation for understanding them.
I use the generic term “simulator” to refer to models trained with predictive loss on a self-supervised dataset, invariant to architecture or data type (natural language, code, pixels, game states, etc). The outer objective of self-supervised learning is Bayes-optimal conditional inference over the prior of the training distribution, which I call the simulation objective, because a conditional model can be used to simulate rollouts which probabilistically obey its learned distribution by iteratively sampling from its posterior (predictions) and updating the condition (prompt). Analogously, a predictive model of physics can be used to compute rollouts of phenomena in simulation. A goal-directed agent which evolves according to physics can be simulated by the physics rule parameterized by an initial state, but the same rule could also propagate agents with different values, or non-agentic phenomena like rocks. This ontological distinction between simulator (rule) and simulacra (phenomena) applies directly to generative models like GPT.
TL;DR: To even consciously consider an alignment research direction, you should have evidence to locate it as a promising lead. As best I can tell, many directions seem interesting but do not have strong evidence of being “entangled” with the alignment problem such that I expect them to yield significant insights.
For example, “we can solve an easier version of the alignment problem by first figuring out how to build an AI which maximizes the number of real-world diamonds” has intuitive appeal and plausibility, but this claim doesn’t have to be true and this problem does not necessarily have a natural, compact solution. In contrast, there do in fact exist humans who care about diamonds. Therefore, there are guaranteed-to-exist alignment insights concerning the way people come to care about e.g. real-world diamonds.
“Consider how humans navigate the alignment subproblem you’re worried about” is a habit which I (TurnTrout) picked up from Quintin Pope. I wrote the post, he originated the tactic.
https://www.lesswrong.com/posts/DdDt5NXkfuxAnAvGJ/changing-the-world-through-slack-and-hobbies
In EA orthodoxy, if you're really serious about EA, the three alternatives that people most often seem to talk about are
(1) “direct work” in a job that furthers a very important cause;
(2) “earning to give”;
(3) earning “career capital” that will help you do those things in the future, e.g. by getting a PhD or teaching yourself ML.
By contrast, there’s not much talk of:
(4) being in a job / situation where you have extra time and energy and freedom to explore things that seem interesting and important.
But that last one is really important!
This is Part 1 of my «Boundaries» Sequence on LessWrong.
Summary: «Boundaries» are a missing concept from the axioms of game theory and bargaining theory, which might help pin-down certain features of multi-agent rationality (this post), and have broader implications for effective altruism discourse and x-risk (future posts).
Epistemic status: me describing what I mean.
With the exception of some relatively recent and isolated pockets of research on embedded agency (e.g., Orseau & Ring, 2012; Garrabrant & Demsky, 2018), most attempts at formal descriptions of living rational agents — especially utility-theoretic descriptions — are missing the idea that living systems require and maintain boundaries.
When I say boundary, I don't just mean an arbitrary constraint or social norm. I mean something that could also be called a membrane in a generalized sense, i.e., a layer of stuff-of-some-kind that physically or cognitively separates a living system from its environment, that 'carves reality at the joints' in a way that isn't an entirely subjective judgement of the living system itself. Here are some examples that I hope will convey my meaning:
https://www.lesswrong.com/posts/MdZyLnLHuaHrCskjy/itt-passing-and-civility-are-good-charity-is-bad
I often object to claims like "charity/steelmanning is an argumentative virtue". This post collects a few things I and others have said on this topic over the last few years.
My current view is:
Related to: Slack gives you the ability to notice/reflect on subtle things
Epistemic status: A possibly annoying mixture of straightforward reasoning and hard-to-justify personal opinions.
It is often stated (with some justification, IMO) that AI risk is an “emergency.” Various people have explained to me that they put various parts of their normal life’s functioning on hold on account of AI being an “emergency.” In the interest of people doing this sanely and not confusedly, I’d like to take a step back and seek principles around what kinds of changes a person might want to make in an “emergency” of different sorts.
There are plenty of ways we can temporarily increase productivity on some narrow task or other, at the cost of our longer-term resources. For example:
(As usual, this post was written by Nate Soares with some help and editing from Rob Bensinger.)
In my last post, I described a “hard bit” of the challenge of aligning AGI—the sharp left turn that comes when your system slides into the “AGI” capabilities well, the fact that alignment doesn’t generalize similarly well at this turn, and the fact that this turn seems likely to break a bunch of your existing alignment properties.
Here, I want to briefly discuss a variety of current research proposals in the field, to explain why I think this problem is currently neglected.
I also want to mention research proposals that do strike me as having some promise, or that strike me as adjacent to promising approaches.
Before getting into that, let me be very explicit about three points:
https://www.lesswrong.com/posts/28zsuPaJpKAGSX4zq/humans-are-very-reliable-agents
Over the last few years, deep-learning-based AI has progressed extremely rapidly in fields like natural language processing and image generation. However, self-driving cars seem stuck in perpetual beta mode, and aggressive predictions there have repeatedly been disappointing. Google's self-driving project started four years before AlexNet kicked off the deep learning revolution, and it still isn't deployed at large scale, thirteen years later. Why are these fields getting such different results?
Right now, I think the biggest answer is that ML benchmarks judge models by average-case performance, while self-driving cars (and many other applications) require matching human worst-case performance. For MNIST, an easy handwriting recognition task, performance tops out at around 99.9% even for top models; it's not very practical to design for or measure higher reliability than that, because the test set is just 10,000 images and a handful are ambiguous. Redwood Research, which is exploring worst-case performance in the context of AI alignment, got reliability rates around 99.997% for their text generation models.
By comparison, human drivers are ridiculously reliable. The US has around one traffic fatality per 100 million miles driven; if a human driver makes 100 decisions per mile, that gets you a worst-case reliability of ~1:10,000,000,000 or ~99.999999999%. That's around five orders of magnitude better than a very good deep learning model, and you get that even in an open environment, where data isn't pre-filtered and there are sometimes random mechanical failures. Matching that bar is hard! I'm sure future AI will get there, but each additional "nine" of reliability is typically another unit of engineering effort. (Note that current self-driving systems use a mix of different models embedded in a larger framework, not one model trained end-to-end like GPT-3.)
https://www.lesswrong.com/posts/2GxhAyn9aHqukap2S/looking-back-on-my-alignment-phd
The funny thing about long periods of time is that they do, eventually, come to an end. I'm proud of what I accomplished during my PhD. That said, I'm going to first focus on mistakes I've made over the past four[1] years.
I think I got significantly smarter in 2018–2019, and kept learning some in 2020–2021. I was significantly less of a fool in 2021 than I was in 2017. That is important and worth feeling good about. But all things considered, I still made a lot of profound mistakes over the course of my PhD.
I figured this point out by summer 2021.
I wanted to be more like Eliezer Yudkowsky and Buck Shlegeris and Paul Christiano. They know lots of facts and laws about lots of areas (e.g. general relativity and thermodynamics and information theory). I focused on building up dependencies (like analysis and geometry and topology) not only because I wanted to know the answers, but because I felt I owed a debt, that I was in the red until I could at least meet other thinkers at their level of knowledge.
But rationality is not about the bag of facts you know, nor is it about the concepts you have internalized. Rationality is about how your mind holds itself, it is how you weigh evidence, it is how you decide where to look next when puzzling out a new area.
If I had been more honest with myself, I could have nipped the "catching up with other thinkers" mistake in 2018. I could have removed the bad mental habits using certain introspective techniques; or at least been aware of the badness.
But I did not, in part because the truth was uncomfortable. If I did not have a clear set of prerequisites (e.g. analysis and topology and game theory) to work on, I would not have a clear and immediate direction of improvement. I would have felt adrift.
https://www.lesswrong.com/posts/7iAABhWpcGeP5e6SB/it-s-probably-not-lithium
A Chemical Hunger (a), a series by the authors of the blog Slime Mold Time Mold (SMTM) that has been received positively on LessWrong, argues that the obesity epidemic is entirely caused (a) by environmental contaminants. The authors’ top suspect is lithium (a)[1], primarily because it is known to cause weight gain at the doses used to treat bipolar disorder.
After doing some research, however, I found that it is not plausible that lithium plays a major role in the obesity epidemic, and that a lot of the claims the SMTM authors make about the topic are misleading, flat-out wrong, or based on extremely cherry-picked evidence. I have the impression that reading what they have to say about this often leaves the reader with a worse model of reality than they started with, and I’ll explain why I have that impression in this post.
https://www.lesswrong.com/posts/bhLxWTkRc8GXunFcB/what-are-you-tracking-in-your-head
A large chunk - plausibly the majority - of real-world expertise seems to be in the form of illegible skills: skills/knowledge which are hard to transmit by direct explanation. They’re not necessarily things which a teacher would even notice enough to consider important - just background skills or knowledge which is so ingrained that it becomes invisible.
I’ve recently noticed a certain common type of illegible skill which I think might account for the majority of illegible-skill-value across a wide variety of domains.
Here are a few examples of the type of skill I have in mind:
I have been doing red team, blue team (offensive, defensive) computer security for a living since September 2000. The goal of this post is to compile a list of general principles I've learned during this time that are likely relevant to the field of AGI Alignment. If this is useful, I could continue with a broader or deeper exploration.
I used to use the phrase when teaching security mindset to software developers that "security doesn't happen by accident." A system that isn't explicitly designed with a security feature is not going to have that security feature. More specifically, a system that isn't designed to be robust against a certain failure mode is going to exhibit that failure mode.
This might seem rather obvious when stated explicitly, but this is not the way that most developers, indeed most humans, think. I see a lot of disturbing parallels when I see anyone arguing that AGI won't necessarily be dangerous. An AGI that isn't intentionally designed not to exhibit a particular failure mode is going to have that failure mode. It is certainly possible to get lucky and not trigger it, and it will probably be impossible to enumerate even every category of failure mode, but to have any chance at all we will have to plan in advance for as many failure modes as we can possibly conceive.
As a practical enforcement method, I used to ask development teams that every user story have at least three abuser stories to go with it. For any new capability, think at least hard enough about it that you can imagine at least three ways that someone could misuse it. Sometimes this means looking at boundary conditions ("what if someone orders 2^64+1 items?"), sometimes it means looking at forms of invalid input ("what if someone tries to pay -$100, can they get a refund?"), and sometimes it means being aware of particular forms of attack ("what if someone puts Javascript in their order details?").
by paulfchristiano, 20th Jun 2022.
(Partially in response to AGI Ruin: A list of Lethalities. Written in the same rambling style. Not exhaustive.)
Editor's note: The following is a lightly edited copy of a document written by Eliezer Yudkowsky in November 2017. Since this is a snapshot of Eliezer’s thinking at a specific time, we’ve sprinkled reminders throughout that this is from 2017.
A background note:
It’s often the case that people are slow to abandon obsolete playbooks in response to a novel challenge. And AGI is certainly a very novel challenge.
Italian general Luigi Cadorna offers a memorable historical example. In the Isonzo Offensive of World War I, Cadorna lost hundreds of thousands of men in futile frontal assaults against enemy trenches defended by barbed wire and machine guns. As morale plummeted and desertions became epidemic, Cadorna began executing his own soldiers en masse, in an attempt to cure the rest of their “cowardice.” The offensive continued for 2.5 years.
Cadorna made many mistakes, but foremost among them was his refusal to recognize that this war was fundamentally unlike those that had come before. Modern weaponry had forced a paradigm shift, and Cadorna’s instincts were not merely miscalibrated—they were systematically broken. No number of small, incremental updates within his obsolete framework would be sufficient to meet the new challenge.
Other examples of this type of mistake include the initial response of the record industry to iTunes and streaming; or, more seriously, the response of most Western governments to COVID-19.
https://www.lesswrong.com/posts/pL4WhsoPJwauRYkeK/moses-and-the-class-struggle
"𝕿𝖆𝖐𝖊 𝖔𝖋𝖋 𝖞𝖔𝖚𝖗 𝖘𝖆𝖓𝖉𝖆𝖑𝖘. 𝕱𝖔𝖗 𝖞𝖔𝖚 𝖘𝖙𝖆𝖓𝖉 𝖔𝖓 𝖍𝖔𝖑𝖞 𝖌𝖗𝖔𝖚𝖓𝖉," said the bush.
"No," said Moses.
"Why not?" said the bush.
"I am a Jew. If there's one thing I know about this universe it's that there's no such thing as God," said Moses.
"You don't need to be certain I exist. It's a trivial case of Pascal's Wager," said the bush.
"Who is Pascal?" said Moses.
"It makes sense if you are beyond time, as I am," said the bush.
"Mysterious answers are not answers," said Moses.
https://www.lesswrong.com/posts/T6kzsMDJyKwxLGe3r/benign-boundary-violations
Recently, my friend Eric asked me what sorts of things I wanted to have happen at my bachelor party.
I said (among other things) that I'd really enjoy some benign boundary violations.
Eric went ????
Subsequently: an essay.
We use the word "boundary" to mean at least two things, when we're discussing people's personal boundaries.
The first is their actual self-defined boundary—the line that they would draw, if they had perfect introspective access, which marks the transition point from "this is okay" to "this is no longer okay."
Different people have different boundaries:
https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities
(If you're already familiar with all basics and don't want any preamble, skip ahead to Section B for technical difficulties of alignment proper.)
I have several times failed to write up a well-organized list of reasons why AGI will kill you. People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first. Some fraction of those people are loudly upset with me if the obviously most important points aren't addressed immediately, and I address different points first instead.
Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants. I'm not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally more dignified.
Three points about the general subject matter of discussion here, numbered so as not to conflict with the list of lethalities: