May 2023: Welcome to the alpha release of TYPE III AUDIO.
Expect very rough edges and very broken stuff—and daily improvements.
Please share your thoughts, but don't share this link on social media, for now.
May 2023: Welcome to the alpha release of TYPE III AUDIO.
Expect very rough edges and very broken stuff—and daily improvements.
Please share your thoughts, but don't share this link on social media, for now.
The primary talk of the AI world recently is about AI agents (whether or not it includes the question of whether we can’t help but notice we are all going to die.)
The trigger for this was AutoGPT, now number one on GitHub, which allows you to turn GPT-4 (or GPT-3.5 for us clowns without proper access) into a prototype version of a self-directed agent.
We also have a paper out this week where a simple virtual world was created, populated by LLMs that were wrapped in code designed to make them simple agents, and then several days of activity were simulated, during which the AI inhabitants interacted, formed and executed plans, and it all seemed like the beginnings of a living and dynamic world. Game version hopefully coming soon.
How should we think about this? How worried should we be?
Original article:
https://www.lesswrong.com/posts/566kBoPi76t8KAkoD/on-autogpt
Narrated for LessWrong by TYPE III AUDIO.
This is a linkpost for https://evals.alignment.org/blog/2023-03-18-update-on-recent-evals/
In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others.
There is a lot of disagreement and confusion about the feasibility and risks associated with automating alignment research. Some see it as the default path toward building aligned AI, while others expect limited benefit from near term systems, expecting the ability to significantly speed up progress to appear well after misalignment and deception. Furthermore, progress in this area may directly shorten timelines or enable the creation of dual purpose systems which significantly speed up capabilities research.
OpenAI recently released their alignment plan. It focuses heavily on outsourcing cognitive work to language models, transitioning us to a regime where humans mostly provide oversight to automated research assistants. While there have been a lot of objections to and concerns about this plan, there hasn’t been a strong alternative approach aiming to automate alignment research which also takes all of the many risks seriously.
The intention of this post is not to propose an end-all cure for the tricky problem of accelerating alignment using GPT models. Instead, the purpose is to explicitly put another point on the map of possible strategies, and to add nuance to the overall discussion.
Original article:
https://www.lesswrong.com/posts/bxt7uCiHam4QXrQAA/cyborgism
Narrated for LessWrong by TYPE III AUDIO.
This report examines what I see as the core argument for concern about existential risk from misaligned artificial intelligence. I proceed in two stages. First, I lay out a backdrop picture that informs such concern. On this picture, intelligent agency is an extremely powerful force, and creating agents much more intelligent than us is playing with fire -- especially given that if their objectives are problematic, such agents would plausibly have instrumental incentives to seek power over humans. Second, I formulate and evaluate a more specific six-premise argument that creating agents of this kind will lead to existential catastrophe by 2070. On this argument, by 2070: (1) it will become possible and financially feasible to build relevantly powerful and agentic AI systems; (2) there will be strong incentives to do so; (3) it will be much harder to build aligned (and relevantly powerful/agentic) AI systems than to build misaligned (and relevantly powerful/agentic) AI systems that are still superficially attractive to deploy; (4) some such misaligned systems will seek power over humans in high-impact ways; (5) this problem will scale to the full disempowerment of humanity; and (6) such disempowerment will constitute an existential catastrophe. I assign rough subjective credences to the premises in this argument, and I end up with an overall estimate of ~5% that an existential catastrophe of this kind will occur by 2070. (May 2022 update: since making this report public in April 2021, my estimate here has gone up, and is now at >10%.)
Original article:
https://arxiv.org/abs/2206.13353
Narrated for Joseph Carlsmith by TYPE III AUDIO.
(Epistemic status: attempting to clear up a misunderstanding about points I have attempted to make in the past. This post is not intended as an argument for those points.)
I have long said that the lion's share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at all, rather than figuring out what to point it at.
It’s recently come to my attention that some people have misunderstood this point, so I’ll attempt to clarify here.
Original article:
https://www.lesswrong.com/posts/NJYmovr9ZZAyyTBwM/what-i-mean-by-alignment-is-in-large-part-about-making
Narrated for LessWrong by TYPE III AUDIO.
This is a linkpost for https://epochai.org/blog/literature-review-of-transformative-artificial-intelligence-timelines
We summarize and compare several models and forecasts predicting when transformative AI will be developed.
Highlights
Original article:
Narrated for the Effective Altruism Forum by TYPE III AUDIO.
TL;DR
Anomalous tokens: a mysterious failure mode for GPT (which reliably insulted Matthew)
Prompt generation: a new interpretability method for language models (which reliably finds prompts that result in a target completion). This is good for:
In this post, we'll introduce the prototype of a new model-agnostic interpretability method for language models which reliably generates adversarial prompts that result in a target completion. We'll also demonstrate a previously undocumented failure mode for GPT-2 and GPT-3 language models, which results in bizarre completions (in some cases explicitly contrary to the purpose of the model), and present the results of our investigation into this phenomenon. Further detail can be found in a follow-up post.
Original article:
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
Narrated for LessWrong by TYPE III AUDIO.
In this post, we point out that short AI timelines would cause real interest rates to be high, and would do so under expectations of either unaligned or aligned AI. However, 30- to 50-year real interest rates are low. We argue that this suggests one of two possibilities:
In the rest of this post we flesh out this argument.
Original article:
https://forum.effectivealtruism.org/posts/8c7LycgtkypkgYjZx/agi-and-the-emh-markets-are-not-expecting-aligned-or
Narrated for the Effective Altruism Forum by TYPE III AUDIO.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In terms of content, this has a lot of overlap with Reward is not the optimization target. I'm basically rewriting a part of that post in language I personally find clearer, emphasising what I think is the core insight.
When thinking about deception and RLHF training, a simplified threat model is something like this:
Before continuing, I would encourage you to really engage with the above. Does it make sense to you? Is it making any hidden assumptions? Is it missing any steps? Can you rewrite it to be more mechanistically correct?
I believe that when people use the above threat model, they are either using it as shorthand for something else or they misunderstand how reinforcement learning works. Most alignment researchers will be in the former category. However, I was in the latter.
Original article:
https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward
Narrated for LessWrong by TYPE III AUDIO.
A few collaborators and I recently released a new paper: Discovering Latent Knowledge in Language Models Without Supervision. For a quick summary of our paper, you can check out this Twitter thread.
In this post I will describe how I think the results and methods in our paper fit into a broader scalable alignment agenda. Unlike the paper, this post is explicitly aimed at an alignment audience and is mainly conceptual rather than empirical.
Tl;dr: unsupervised methods are more scalable than supervised methods, deep learning has special structure that we can exploit for alignment, and we may be able to recover superhuman beliefs from deep learning representations in a totally unsupervised way.
Original article:
https://www.lesswrong.com/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without
Narrated for LessWrong by TYPE III AUDIO.
If you fear that someone will build a machine that will seize control of the world and annihilate humanity, then one kind of response is to try to build further machines that will seize control of the world even earlier without destroying it, forestalling the ruinous machine’s conquest. An alternative or complementary kind of response is to try to avert such machines being built at all, at least while the degree of their apocalyptic tendencies is ambiguous.
The latter approach seems to me like the kind of basic and obvious thing worthy of at least consideration, and also in its favor, fits nicely in the genre ‘stuff that it isn’t that hard to imagine happening in the real world’. Yet my impression is that for people worried about extinction risk from artificial intelligence, strategies under the heading ‘actively slow down AI progress’ have historically been dismissed and ignored (though ‘don’t actively speed up AI progress’ is popular).
Original article:
https://forum.effectivealtruism.org/posts/vwK3v3Mekf6Jjpeep/let-s-think-about-slowing-down-ai-1
Narrated for the Effective Altruism Forum by TYPE III AUDIO.
In previous pieces, I argued that there's a real and large risk of AI systems' aiming to defeat all of humanity combined - and succeeding.
I first argued that this sort of catastrophe would be likely without specific countermeasures to prevent it. I then argued that countermeasures could be challenging, due to some key difficulties of AI safety research.
But while I think misalignment risk is serious and presents major challenges, I don’t agree with sentiments along the lines of “We haven’t figured out how to align an AI, so if transformative AI comes soon, we’re doomed.” Here I’m going to talk about some of my high-level hopes for how we might end up avoiding this risk.
Original article:
https://forum.effectivealtruism.org/posts/rJRw78oihoT5paFGd/high-level-hopes-for-ai-alignment
Narrated by Holden Karnofsky for the Cold Takes blog.
This post is inspired by What 2026 looks like and an AI vignette workshop guided by Tamay Besiroglu. I think of this post as “what would I expect the world to look like if these timelines (median compute for transformative AI ~2036) were true” or “what short-to-medium timelines feel like” since I find it hard to translate a statement like “median TAI year is 20XX” into a coherent imaginable world.
I expect some readers to think that the post sounds wild and crazy but that doesn’t mean its content couldn’t be true. If you had told someone in 1990 or 2000 that there would be more smartphones and computers than humans in 2020, that probably would have sounded wild to them. The same could be true for AIs, i.e. that in 2050 there are more human-level AIs than humans. The fact that this sounds as ridiculous as ubiquitous smartphones sounded to the 1990/2000 person, might just mean that we are bad at predicting exponential growth and disruptive technology.
Original article:
https://www.lesswrong.com/posts/qRtD4WqKRYEtT5pi3/the-next-decades-might-be-wild
Narrated for LessWrong by TYPE III AUDIO.
The long-term future of intelligent life is currently unpredictable and undetermined. In the linked document, we argue that the invention of artificial general intelligence (AGI) could change this by making extreme types of lock-in technologically feasible. In particular, we argue that AGI would make it technologically feasible to (i) perfectly preserve nuanced specifications of a wide variety of values or goals far into the future, and (ii) develop AGI-based institutions that would (with high probability) competently pursue any such values for at least millions, and plausibly trillions, of years.
The rest of this post contains the summary (6 pages), with links to relevant sections of the main document (40 pages) for readers who want more details.
Original article:
https://forum.effectivealtruism.org/posts/KqCybin8rtfP3qztq/agi-and-lock-in
Narrated for the Effective Altruism Forum by TYPE III AUDIO.
This is going to be a list of holes I see in the basic argument for existential risk from superhuman AI systems.
To start, here’s an outline of what I take to be the basic case:
I. If superhuman AI systems are built, any given system is likely to be ‘goal-directed’
II. If goal-directed superhuman AI systems are built, their desired outcomes will probably be about as bad as an empty universe by human lights
III. If most goal-directed superhuman AI systems have bad goals, the future will very likely be bad
Original article:
https://forum.effectivealtruism.org/posts/zoWypGfXLmYsDFivk/counterarguments-to-the-basic-ai-risk-case
Narrated for the Effective Altruism Forum by TYPE III AUDIO.
In this note I’ll summarize the bio-anchors report, describe my initial reactions to it, and take a closer look at two disagreements that I have with background assumptions used by (readers of) the report.
This report attempts to forecast the year when the amount of compute required to train a transformative AI (TAI) model will first become available, as the year when a forecast for the amount of compute required to train TAI in a given year will intersect a forecast for the amount of compute that will be available for a training run of a single project in a given year.
Original article:
https://docs.google.com/document/d/1_GqOrCo29qKly1z48-mR86IV7TUDfzaEXxD3lGFQ8Wk/edit#
Narrated for the Effective Altruism Forum by TYPE III AUDIO.
Excerpt:
During my Master's and Ph.D. (still ongoing), I have spoken with many academics about AI safety. These conversations include chats with individual PhDs, poster presentations and talks about AI safety.
I think I have learned a lot from these conversations and expect many other people concerned about AI safety to find themselves in similar situations. Therefore, I want to detail some of my lessons and make my thoughts explicit so that others can scrutinize them.
TL;DR: People in academia seem more and more open to arguments about risks from advanced intelligence over time and I would genuinely recommend having lots of these chats. Furthermore, I underestimated how much work related to some aspects AI safety already exists in academia and that we sometimes reinvent the wheel. Messaging matters, e.g. technical discussions got more interest than alarmism and explaining the problem rather than trying to actively convince someone received better feedback.
Original article:
https://www.lesswrong.com/posts/SqjQFhn5KTarfW8v7/lessons-learned-from-talking-to-greater-than-100-academics
Narrated for LessWrong by TYPE III AUDIO.
https://80000hours.org/career-reviews/china-related-ai-safety-and-governance-paths/#strong-networking-abilities-especially-for-policy-roles
Expertise in China and its relations with the world might be critical in tackling some of the world’s most pressing problems. In particular, China’s relationship with the US is arguably the most important bilateral relationship in the world, with these two countries collectively accounting for over 40% of global GDP.1 These considerations led us to publish a guide to improving China–Western coordination on global catastrophic risks and other key problems in 2018. Since then, we have seen an increase in the number of people exploring this area.
China is one of the most important countries developing and shaping advanced artificial intelligence (AI). The Chinese government’s spending on AI research and development is estimated to be on the same order of magnitude as that of the US government,2 and China’s AI research is prominent on the world stage and growing.
Because of the importance of AI from the perspective of improving the long-run trajectory of the world, we think relations between China and the US on AI could be among the most important aspects of their relationship. Insofar as the EU and/or UK influence advanced AI development through labs based in their countries or through their influence on global regulation, the state of understanding and coordination between European and Chinese actors on AI safety and governance could also be significant.
That, in short, is why we think working on AI safety and governance in China and/or building mutual understanding between Chinese and Western actors in these areas is likely to be one of the most promising China-related career paths. Below we provide more arguments and detailed information on this option.
If you are interested in pursuing a career path described in this profile, contact 80,000 Hours’ one-on-one team and we may be able to put you in touch with a specialist advisor.
https://www.lesswrong.com/posts/6Xgy6CAf2jqHhynHL/what-2026-looks-like#2022
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This was written for the Vignettes Workshop.[1] The goal is to write out a detailed future history (“trajectory”) that is as realistic (to me) as I can currently manage, i.e. I’m not aware of any alternative trajectory that is similarly detailed and clearly more plausible to me. The methodology is roughly: Write a future history of 2022. Condition on it, and write a future history of 2023. Repeat for 2024, 2025, etc. (I'm posting 2022-2026 now so I can get feedback that will help me write 2027+. I intend to keep writing until the story reaches singularity/extinction/utopia/etc.)
What’s the point of doing this? Well, there are a couple of reasons:
https://www.lesswrong.com/posts/K4urTDkBbtNuLivJx/why-i-think-strong-general-ai-is-coming-soon
I think there is little time left before someone builds AGI (median ~2030). Once upon a time, I didn't think this.
This post attempts to walk through some of the observations and insights that collapsed my estimates.
The core ideas are as follows:
Some notes up front:
This post is part of my AI strategy nearcasting series: trying to answer key strategic questions about transformative AI, under the assumption that key events will happen very soon, and/or in a world that is otherwise very similar to today's.
This post gives my understanding of what the set of available strategies for aligning transformative AI would be if it were developed very soon, and why they might or might not work. It is heavily based on conversations with Paul Christiano, Ajeya Cotra and Carl Shulman, and its background assumptions correspond to the arguments Ajeya makes in this piece (abbreviated as “Takeover Analysis”).
I premise this piece on a nearcast in which a major AI company (“Magma,” following Ajeya’s terminology) has good reason to think that it can develop transformative AI very soon (within a year), using what Ajeya calls “human feedback on diverse tasks” (HFDT) - and has some time (more than 6 months, but less than 2 years) to set up special measures to reduce the risks of misaligned AI before there’s much chance of someone else deploying transformative AI.
https://www.lesswrong.com/posts/nbq2bWLcYmSGup9aF/a-transparency-and-interpretability-tech-tree
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Thanks to Chris Olah, Neel Nanda, Kate Woolverton, Richard Ngo, Buck Shlegeris, Daniel Kokotajlo, Kyle McDonell, Laria Reynolds, Eliezer Yudkowksy, Mark Xu, and James Lucassen for useful comments, conversations, and feedback that informed this post.
The more I have thought about AI safety over the years, the more I have gotten to the point where the only worlds I can imagine myself actually feeling good about humanity’s chances are ones in which we have powerful transparency and interpretability tools that lend us insight into what our models are doing as we are training them.[1] Fundamentally, that’s because if we don’t have the feedback loop of being able to directly observe how the internal structure of our models changes based on how we train them, we have to essentially get that structure right on the first try—and I’m very skeptical of humanity’s ability to get almost anything right on the first try, if only just because there are bound to be unknown unknowns that are very difficult to predict in advance.
Certainly, there are other things that I think are likely to be necessary for humanity to succeed as well—e.g. convincing leading actors to actually use such transparency techniques, having a clear training goal that we can use our transparency tools to enforce, etc.—but I currently feel that transparency is the least replaceable necessary condition and yet the one least likely to be solved by default.
Nevertheless, I do think that it is a tractable problem to get to the point where transparency and interpretability is reliably able to give us the sort of insight into our models that I think is necessary for humanity to be in a good spot. I think many people who encounter transparency and interpretability, however, have a hard time envisioning what it might look like to actually get from where we are right now to where we need to be. Having such a vision is important both for enabling us to better figure out how to make that vision into reality and also for helping us tell how far along we are at any point—and thus enabling us to identify at what point we’ve reached a level of transparency and interpretability that we can trust it to reliably solve different sorts of alignment problems.
The goal of this post, therefore, is to attempt to lay out such a vision by providing a “tech tree” of transparency and interpretability problems, with each successive problem tackling harder and harder parts of what I see as the core difficulties. This will only be my tech tree, in terms of the relative difficulties, dependencies, and orderings that I expect as we make transparency and interpretability progress—I could, and probably will, be wrong in various ways, and I’d encourage others to try to build their own tech trees to represent their pictures of progress as well.
By Benjamin Hilton.
Abstract:
AI might bring huge benefits—if we avoid the risks.
Source URL:
https://80000hours.org/problem-profiles/artificial-intelligence/