Current cases of AI misalignment and their implications for future risks

Leonard Dung

Download from

dx.doi.org

More download options

Current cases of AI misalignment and their implications for future risks

Leonard Dung

Synthese 202 (5):1-23 (2023) Copy BIBT_EX

Abstract

How can one build AI systems such that they pursue the goals their designers want them to pursue? This is the alignment problem. Numerous authors have raised concerns that, as research advances and systems become more powerful over time, misalignment might lead to catastrophic outcomes, perhaps even to the extinction or permanent disempowerment of humanity. In this paper, I analyze the severity of this risk based on current instances of misalignment. More specifically, I argue that contemporary large language models and game-playing agents are sometimes misaligned. These cases suggest that misalignment tends to have a variety of features: misalignment can be hard to detect, predict and remedy, it does not depend on a specific architecture or training paradigm, it tends to diminish a system’s usefulness and it is the default outcome of creating AI via machine learning. Subsequently, based on these features, I show that the risk of AI alignment magnifies with respect to more capable systems. Not only might more capable systems cause more harm when misaligned, aligning them should be expected to be more difficult than aligning current AI.

Cite

Plain text

BibTeX

Formatted text

Zotero

EndNote

Reference Manager

RefWorks

Options

Edit

Mark as duplicate

Find it on Scholar

Request removal from index

Revision history

Author's Profile

Leonard Dung

Universität Erlangen-Nürnberg

Keywords

AI alignment Existential risk Goal Reinforcement learning Reward hacking

Reprint years

DOI

10.1007/s11229-023-04367-0

My notes

Analytics

Added to PP
2023-10-27

Downloads
93 (#60,262)

6 months
65 (#242,015)

Historical graph of downloads

How can I increase my downloads?

Author's Profile

Leonard Dung

Universität Erlangen-Nürnberg

Citations of this work

Understanding Artificial Agency.Leonard Dung - forthcoming - Philosophical Quarterly.

The argument for near-term human disempowerment through AI.Leonard Dung - 2024 - AI and Society:1-14.

Add more citations

References found in this work

No references found.

Add more references

Applied ethics	Epistemology	History of Western Philosophy	Meta-ethics	Metaphysics	Normative ethics
Philosophy of biology	Philosophy of language	Philosophy of mind	Philosophy of religion	Science Logic and Mathematics	More ...

Current cases of AI misalignment and their implications for future risks

Abstract

Author's Profile

Categories

Keywords

Reprint years

DOI

Links

PhilArchive

External links

Through your library

My notes

Similar books and articles

Analytics

Author's Profile

Citations of this work

References found in this work