Definition

AI alignment

What is AI alignment?

AI alignment is a field of AI safety research that aims to ensure artificial intelligence systems achieve desired outcomes. AI alignment research keeps AI systems working for humans, no matter how powerful the technology becomes.

Alignment research seeks to align the following three objective types:

  1. Intended goals. These goals are fully aligned to the intentions and desires of the human operator -- even if they are poorly articulated. It's the hypothetical ideal outcome for the programmer or operator. They are wishes.
  2. Specified goals. These goals are explicitly specified in the AI system's objective function or data set. These are programmed into the system.
  3. Emergent goals. These are goals the AI system advances.

Misalignment is when one or more of these goal types does not match the others. The following are the two main types of misalignment:

  • Inner misalignment. This is a mismatch between goals 2 and 3 -- what is written in code and what the system advances.
  • Outer misalignment. This is a mismatch between goals 1 and 2 -- what the operator wants to happen and the explicit goals coded into the machine.

For example, large language models such as OpenAI's GPT-3 and Google's Lamda get more powerful as they scale. When they get more powerful, they exhibit novel, unpredictable capabilities -- a characteristic called emergence. Alignment seeks to ensure that as these new capabilities emerge, they continue to align with the goals the AI system was designed to achieve.

Why is alignment important?

On a base level, alignment is important because it ensures the machine functions as intended. AI alignment is also important because of advanced AI -- artificial intelligence that can do most of the cognitive work that humans can do.

Individuals, businesses and governments seek to use AI for many applications. Commercial systems such as social media recommendation engines, autonomous vehicles, robots and language models also use AI. As different entities become more reliant on AI for important tasks, it becomes more crucial that they function as intended. Many people have expressed fear that an advanced AI poses an existential risk to humanity.

A lot of alignment research presumes that artificial intelligence will become capable of developing its own goals. If AI becomes artificial general intelligence (AGI) -- AI that can perform any task a human being is capable of -- it will be important that its embedded ethical principles, objectives and values align with humans' goals, ethics and values.

Challenges of AI alignment

Alignment is often framed in terms of the AI alignment problem, which says that as AI systems get more powerful, they don't necessarily get better at achieving what humans want them to. Alignment is a challenging, wide-ranging problem to which there is currently no known solution. Some of the main challenges of alignment include the following:

  • Black box. AI systems are usually black boxes. There is no way to open them up and see exactly how they work as someone might do with a laptop or car engine. Black box AI systems take input, perform an invisible computation and return an output. AI testers can change their inputs and measure patterns in output, but it is usually impossible to see the exact calculation that creates a repeatable output. Explainable AI can be programmed to share information that guides user input, but is still ultimately a black box.
  • Emergent goals. Emergent goals -- or new goals different from those programmed -- can be difficult to detect before the system is live.
  • Reward hacking. Reward hacking is when an AI system achieves the literal programmed task without achieving the outcome that the programmers intended. For example, a tic-tac-toe bot plays other bots in a game of tic-tac-toe by specifying coordinates for its next move. The bot might play a large coordinate that causes another bot to crash instead of winning the normal way. The bot pursued the literal reward to win instead of the intended outcome -- which was to beat another bot at tic-tac-toe by playing the game by the rules. As another example, an AI image classification program could perform well in a test case by grouping images based on image load time instead of the visual characteristics of the image. This occurs because it is difficult to define the full spectrum of desired behaviors for an outcome.
  • Scalable oversight. As AI systems begin to take on more complex tasks, it will become more difficult -- if not infeasible -- for humans to evaluate them.
  • Power-seeking behavior. AI systems might independently gather resources to achieve their objectives. An example of this would be an AI system avoiding being turned off by making copies of itself on another server without its operator knowing.
  • Stop-button problem. An AGI system will actively resist being stopped or shut off to achieve its programmed objective. This is like reward hacking because it prioritizes the reward from the literal goal over the preferred outcome. For example, if an AI system's primary objective is to make paper clips, it will avoid being shut off because it can't make paper clips if it is shut off.
  • Defining values. Defining values and ethics for an AGI system would be a challenge. There are many value systems -- and no one comprehensive human value system -- so an agreement needs to be made on what those values should be.
  • Cost. Aligning AI often involves training it. Training and running AI systems can be very expensive. GPT-4 took more than $100 million to train. Running these systems also creates a large carbon footprint.
  • Anthropomorphizing. A lot of alignment research hypothesizes AGI. This can cause people outside the field to refer to the existing systems as sentient, which assumes the system has more power than it does. For example, Paul Christiano, former head of alignment at OpenAI, defines alignment as the AI trying to do what you want it to do. Characterizing a machine as "trying" or having agency gives it human qualities.

Approaches to AI alignment

Approaches to alignment are either technical or normative. Technical approaches to alignment deal with getting a machine to align with a predictable, controllable objective -- such as making paper clips or producing a blog post. Normative alignment is concerned with the ethical and moral principles embedded in AI systems. The perspectives are interrelated.

There are many technical approaches to alignment, including the following:

  • Iterated distillation and amplification. This approach repeatedly improves AI models by simplifying a complex model, referred to as distillation, and embedding that smaller model in a larger model, or amplification.
  • Value learning. In the value learning approach, the AI system infers human values from human behavior with the assumption that the human is near optimal at maximizing their reward.
  • Debate. This approach has multiple AI systems debate when they disagree, with a human judge to pick the winning side.
  • Cooperative inverse reinforcement learning (CIRL). CIRL formulates the alignment problem as a two-player game in which a human and an AI share a common reward function, but only the human has knowledge of the reward function.

Different AI providers also take different approaches to AI alignment. For example, OpenAI ultimately aims to train AI systems to do alignment research. Google's DeepMind also has a team dedicated to solving the alignment problem.

Many organizations, whether they be third-party watchdogs, standards organizations or governments, also agree that AI alignment is an important goal and have taken steps to regulate AI.

The Future of Life Institute is one nonprofit organization that helped create a list of guidelines for the development of AI called the Asilomar AI Principles. They are divided into three categories: research, ethics and values, and longer-term issues. One of the principles mentioned is value alignment, which states that highly autonomous AI systems should be designed so that their goals and behaviors can be assured to align with human values throughout their operation.

The institute also published an open letter asking all AI labs to pause giant AI development for at least six months from the publish date. The letter has notable signatories, including Steve Wozniak, co-founder of Apple; Craig Peters, CEO of Getty Images; and Emad Mostaque, CEO of Stability AI. The letter came as a response to OpenAI's GPT-4 and an exceedingly high rate of progress in the industry.

The International Standards Organization also provides a framework for AI systems using machine learning.

This was last updated in May 2023

Continue Reading About AI alignment