Current cases of AI misalignment and their implications for future risks

Synthese 202 (5):1-23 (2023)
  Copy   BIBTEX

Abstract

How can one build AI systems such that they pursue the goals their designers want them to pursue? This is the alignment problem. Numerous authors have raised concerns that, as research advances and systems become more powerful over time, misalignment might lead to catastrophic outcomes, perhaps even to the extinction or permanent disempowerment of humanity. In this paper, I analyze the severity of this risk based on current instances of misalignment. More specifically, I argue that contemporary large language models and game-playing agents are sometimes misaligned. These cases suggest that misalignment tends to have a variety of features: misalignment can be hard to detect, predict and remedy, it does not depend on a specific architecture or training paradigm, it tends to diminish a system’s usefulness and it is the default outcome of creating AI via machine learning. Subsequently, based on these features, I show that the risk of AI alignment magnifies with respect to more capable systems. Not only might more capable systems cause more harm when misaligned, aligning them should be expected to be more difficult than aligning current AI.

Links

PhilArchive



    Upload a copy of this work     Papers currently archived: 93,590

External links

Setup an account with your affiliations in order to access resources via your University's proxy server

Through your library

Analytics

Added to PP
2023-10-27

Downloads
93 (#60,262)

6 months
65 (#242,015)

Historical graph of downloads
How can I increase my downloads?

Author's Profile

Leonard Dung
Universität Erlangen-Nürnberg

Citations of this work

Understanding Artificial Agency.Leonard Dung - forthcoming - Philosophical Quarterly.

Add more citations

References found in this work

No references found.

Add more references