What is the inception score (IS)?
The inception score (IS) is a mathematical algorithm used to measure or determine the quality of images created by generative AI through a generative adversarial network (GAN). The word "inception" refers to the spark of creativity or initial beginning of a thought or action traditionally experienced by humans.
Without an inception score, humans are left to observe a generative image and make a visual evaluation of the image -- but such visual evaluations are highly subjective and can vary widely based on the preferences and biases of the human viewer. The inception score, and other metrics such as Fréchet inception distance (FID), offer objective and consistent measures of generated images; and by extension, the quality and capability of the underlying generative model.
The score produced by the IS algorithm can range from zero (worst) to infinity (best). The inception score algorithm measures two factors:
- Quality. How good the generated image is. Generated images should be believable or realistic as if a real person painted a picture or took a photograph. For example, if the AI produces images of cats, each image should include a clearly identifiable cat. If the object is not clearly identifiable as a cat, the corresponding IS will be low.
- Diversity. How diverse the generated image is. Generated images should have high randomness (entropy), meaning that the generative AI should produce highly varied images. For example, if the AI produces images of cats, each image should be a different cat breed and perhaps a different cat pose. If the AI is producing images of the same cat breed in the same pose, the diversity and corresponding IS will be low.
Generative AI developers use the inception score as a measure of image quality. The IS may be employed as a training mechanism by feeding the IS back to the AI model. This kind of training can provide more objective and explainable feedback than solely allowing human viewers to subjectively "score" generative images.
How does the inception score work?
The inception score, first defined in a 2016 technical paper, is based on Google's "Inception" image classification network.
Calculating an inception score starts by using the image classification network to ingest a generated image and return a probability distribution for the image. The image classification network is fundamentally a pre-trained Inception v3 model, which can predict class probabilities -- what something might be -- for each computer-generated image. A probability distribution is simply a numbered list of what the image classification network "thinks" the image might be -- each with a fractional score that adds up to 1.0.
This article is part of
What is Gen AI? Generative AI explained
For example, the image classification network might see the generated image of a cat and return a series of potential results such as the following:
Cat | 0.5 |
Flower | 0.2 |
Car | 0.2 |
House | 0.1 |
Total | 1.0 |
The probability distribution helps to determine whether the generated image contains one well-defined thing, or a series of things that are harder (if not impossible) for the image classification network to identify. This is the foundation of the quality factor -- does the generated image look like something specific and identifiable?
Next, the inception score process compares the probability distribution for all the generated images. There may be as many as 50,000 generated images in a sample. This creates a second factor called marginal distribution, which indicates the amount of variety present in the generative AI's images.
For the cat example, the labels utilized in probability distribution are summed to show the focused distribution (the number of same images such as cats), and the uniform distribution (the number of flowers, cars, houses, and so on). These factors illustrate the variety in the generative AI's output. This is the foundation of the diversity factor -- can the AI produce varied items and scenes?
The last step is to combine probability distribution and marginal distribution into a single score, which can represent both the distinctiveness of the object as well as the diversity of the output. The more those two distributions differ, the higher the inception score. The actual score is calculated using a statistical method called the Kullback-Leibler divergence, or KL divergence.
When there is high KL divergence, there is a strong probability distribution and an even (flat) marginal distribution -- each image has a distinct label (such as a cat), but the overall set of images has many different labels. This yields the highest inception score.
Finally, the IS algorithm takes the exponent of the KL divergence and produces an average of the final number for every image in the sample set.
What are the limitations of the inception score?
Although the inception score algorithm provides an objective means of measuring the quality and diversity of AI-generated images, the IS poses three principal limitations for AI developers:
- Small image sizes. The IS algorithm only works on small, square image sizes -- roughly 300 x 300 pixels.
- Limited samples. The IS measures image diversity, so a limited sample size -- such as only one seascape image or the same image produced many times -- will produce an artificially high inception score because there just are not enough images of that type or class to adequately judge diversity.
- Unusual images. The inception score is calculated against a pre-trained data set within the image classification network that represents about 1000 image types or classes. The IS will produce an artificially low inception score if the AI generates an image that is not within those 1000 classes. This is because there is no similar pre-trained data to compare the new image against. Any generative work with labels that are not in the image classification network -- such as different fish or varieties of trees -- may score lower.
Inception score vs. Fréchet inception distance
Another metric used to evaluate the quality of AI-generated images is the Fréchet inception distance. FID was introduced in 2017 and has generally superseded inception score as the preferred measure of generative image model performance.
The principal difference between IS and FID is the comparative use and evaluation of real images, referred to as "ground truth." This allows FID to analyze real images alongside computer-generated images in a bid to better simulate human perception. By comparison, IS only evaluates computer-generated images.
Although FID has generally edged out IS as the preferred quality metric for GANs, FID has also been shown to demonstrate some statistical bias, and does not always accurately reflect human perception.
For more information on generative AI-related terms, read the following articles:
What is a large language model (LLM)?
How to calculate the inception score
The actual formula to calculate inception score requires the use of calculus and is beyond the scope of this definition. For a more complete explanation, however, an abbreviated mathematical expression for inception score can be shown as the following:
IS(G) = exp (Ex∼pg DKL (p(y|x) || p(y) ) )
The major components of the formula are as follows:
- IS is the final inception score.
- DKL is the KL divergence.
- p(y|x) is the conditional probability distribution.
- p(y) is the marginal probability distribution.
- Ex~pg is the sum and average of all results.
The common process for resolving this expression and determining a final inception score involves five basic steps:
- Process the AI-generated images through the image classification network to obtain the conditional probability distribution or p(y|x).
- Calculate the marginal probability distribution or p(y).
- Calculate the KL divergence (between p(y) and p(y|x)).
- Calculate the sum for classes and calculate an average score for all images (basically repeat the previous steps for all images in the computer-generated set or sample).
- Calculate the average value of all results (Ex~pg) and take its exponent (exp).
This final result is the inception score for the given set of computer-generated images.
How to implement the inception score
Although the mathematical formula for inception score can be resolved manually, the process of repeating advanced multi-step calculations across thousands of images can be a daunting and error-prone human challenge.
Instead of manual calculations, AI developers working with generative image models will typically implement a metric such as inception score using a mathematical software package. Common math processing alternatives include the following:
- Keras. An open source software library and Python interface for artificial neural networks supporting the TensorFlow library. Keras can interoperate with the Inception v3 model directly.
- NumPy. A Python library for scientific computing which supports multidimensional array objects, derived objects and various routines for fast operations on arrays, along with statistical operations.
Implementing IS in a math package will require some amount of coding to derive probability distributions (or access to data where distributions are stored) and perform other required calculations. Coding may be performed by AI scientists already working on generative AI systems or supporting development staff.