Getty Images

Model collapse explained: How synthetic training data breaks AI

Without human-generated training data, AI systems malfunction. This could be a problem if the internet becomes flooded with AI-generated content.

Garbage in, garbage out. Data pollution is ruining generative AI's future.

A recent study by researchers in Canada and the U.K. explained the phenomenon of model collapse. Model collapse occurs when new generative models train on AI-generated content and gradually degenerate as a result.

In this scenario, models start to forget the true underlying data distribution, even if the distribution does not change. This means that the models begin to lose information about the less common -- but still important -- aspects of the data. As generations of AI models progress, models start producing increasingly similar and less diverse outputs.

Generative AI models need to train on human-produced data to function. When trained on model-generated content, new models exhibit irreversible defects. Their outputs become increasingly "wrong" and homogenous. Researchers found that even in the best learning conditions, model collapse was inevitable.

Why is model collapse important?

Model collapse is important because generative AI is poised to bring about significant change in digital content. More and more online communications are being partially or completely generated using AI tools. Generally, this phenomenon has the potential to create data pollution on a large scale. Although creating large quantities of text is more efficient than ever, model collapse states that none of this data will be valuable to train the next generation of AI models.

More AI-generated data pollution makes data from human interactions with systems harder to find and more valuable. Companies and platforms with access to human-generated data will be more likely to create high-quality AI-generated models. Companies that were able to scrape the web before AI pollution will have an advantage over those scraping the post-ChatGPT web for quality training data.

How does model collapse occur?

Model collapse happens when new AI models are trained on generated or synthetic data from older models. The new models become too dependent on patterns in the generated data. Model collapse is based on the principle that generative models are replicating patterns that they have already seen, and there is only so much information that can be pulled from those patterns.

In model collapse, probable events are overestimated and improbable events are underestimated. Through repeated generations, probable events poison the data set, and tails shrink. Tails are the improbable but important parts of the data set that help maintain model accuracy and output variance. Over generations, models compound errors and more drastically misinterpret data.

Researchers define two types of model collapse: early and late. In early model collapse, the model begins to lose information about probability tails. In late model collapse, the model blends together what should be distinct patterns in the data. Eventually, the outputs become increasingly similar to each other with little resemblance to the original data.

The previously mentioned study, "The Curse of Recursion: Training on Generated Data Makes Models Forget," tested three AI model types by repeatedly feeding them model-generated data. In all three cases, the researchers found instances of model collapse:

  • Gaussian mixture model (GMM). A GMM is designed to separate data into clusters using a Gaussian distribution. Within 50 re-generations, the data distribution completely changed. By generation 2,000, there was no longer any variance in the data.
  • Variational autoencoder (VAE). The VAE was trained on real data and used to generate images of handwritten digits. The next generations were trained on model-generated data. As the generations progressed, the images got progressively blurrier until each digit resembled a roughly uniform smudge.
  • Large language model (LLM). The LLM -- OPT-125m -- was fine-tuned using only artificial model data in one scenario and a mixture of human-generated data and artificial data in another. Researchers found that although model performance degraded over time, some level of learning was possible with generated data. Still, the example given in the study showed several outputs from OPT-125m responding to prompts about medieval architecture in which, by the fourth generation, the model was outputting completely unrelated text about jackrabbits.

Future of model collapse

The effects of model collapse -- long-term poisoning of language model data sets -- has been occurring since before the mainstreaming of technology such as ChatGPT. Content farms have been used for years to intentionally influence search algorithms and social networks to make changes in their valuation of content. For example, Google devalues content that appears to be farmed or low-value, and it focuses more on rewarding content from trustworthy sources such as education domains.

Researchers argue that there will be an increased need for ways to distinguish between artificially generated data and data that comes from humans. There is currently no way to track LLM data at scale.

In the immediate future, it is likely that companies with a stake in creating the next generation of machine learning models will rush to acquire human data wherever possible. They will do this in anticipation of a future where it will be even harder to distinguish human-generated data from synthetic data.

The Internet Archive recently experienced an outage due to extremely fast, high-volume requests for its public domain optical character recognition files. The Internet Archive called the requests "abusive traffic" in a tweet and said the traffic came from an AWS customer, speculating that it was an AI company harvesting data.

Model collapse vs. modal collapse

The term model collapse is inspired by literature on modal collapse in generative adversarial networks (GANs). The terms are similar, but have some key differences:

  • Modal collapse is specific to GANs, which are a type of machine learning model. It occurs when the generator in a GAN begins producing a very limited variety of samples regardless of the input. It is called modal collapse because it fails to capture the multiple modes in a diverse data distribution.
  • Model collapse is a general term for a model failing to learn properly. It applies to many types of machine learning models and generative AI systems, including LLMs, VAEs and GMMs. It is inherent in all machine learning models and is a result of using synthetic training data.

How to prevent model collapse in LLMs

Although there is no agreed-upon way to track LLM-generated content at scale, one proposed option is community-wide coordination among organizations involved in LLM creation to share information and determine the origins of data.

In the meantime, to avoid being affected by model collapse, companies should try to preserve access to pre-2023 bulk stores of data.

Next Steps

The future of generative AI: How will it impact the enterprise?

Dig Deeper on Data analytics and AI