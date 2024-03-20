Key Takeaways Generative AI is saturating the internet with algorithm-generated content, leading to model collapse over time.

Model collapse occurs when AI trains on its own outputs, resulting in declining quality and diversity of data.

To avoid model collapse, training on public data must be avoided, and human curation may be necessary for larger models in the future.

Generative AI has taken off in ways that we could have never imagined, and it's become incredibly pervasive across the entire internet. With X (formerly Twitter) bots powered by generative AI, automated accounts on Reddit, and even scholarly papers being made at least in part by LLMs, the internet is slowly being flooded with content generated by algorithms. While humans can by and large filter this content out as they go (or even find some value in it), LLMs can't, and it's exactly why AI will poison itself over time.

Model collapse has already been proven

The future of generative AI may already be bleak

AI generated data is now beginning to enter training sets, contributing to "model collapse" where AI data trains on itself. The consequences of this model collapse have already been demonstrated in a study entitled "The Curse of Recursion: Training on Generated Data Makes Models Forget." This study demonstrated how with the OPT-125M model (a small model by today's standards, but capable of showing what can happen), a model would poison itself.

In the study, the researchers trained each model with data provided by the previous generation. According to the study, after just a few generations, it stopped making any sense. By the ninth generation, an input relating to 1360s parish laborers resulted in an answer that referenced different colors of jackrabbits. This is what model collapse can lead to.

Model collapse is a critical issue in generative models according to the researchers, leading to a significant decline in the quality and diversity of the data they generate. This issue manifests in two main stages: early and late model collapse. Early model collapse is characterized by the model's failure to capture the full range of the data distribution, particularly the less common aspects. Late model collapse occurs when the model begins to confuse different aspects of the data, resulting in outputs that bear little resemblance to the original dataset and lack variability.

The primary cause of model collapse is statistical approximation error, which stems from the limitations of working with finite samples. This error introduces a risk of losing information at each resampling step. The secondary cause, functional approximation error, arises from limitations in the model's ability to accurately represent complex data distributions. This can be due to the model being either not expressive enough or too expressive in ways that do not align with the original data distribution.

These errors can either exacerbate or mitigate model collapse, depending on the model's approximation capabilities. Improved model expressiveness can help in accurately capturing the true data distribution but can also amplify existing errors, leading to greater divergence from the intended output. Additionally, computational errors related to floating-point representation can further complicate the issue, though they are generally of lesser impact and can be mitigated with more precise hardware. These floating-point errors don't typically matter when it comes to model collapse.

Understanding and addressing these causes of model collapse is crucial to enhancing the performance and reliability of generative models, ensuring they generate diverse and accurate representations of the original data distributions. As AI generated information fills the corpus of data used to train these models, though, we contribute towards model collapse.

How can we avoid model collapse?

The answer isn't all that simple

Avoiding model collapse will likely mean staying away from training on public data, or will require an additional hand in choosing the data used for training these models. It's widely speculated that GPT-4 made use of a lot of data from social media sites such as Reddit, and GPT 3.5 was trained in part on Common Crawl, a corpus of data which contains information crawled from the internet. This includes copyrighted material, though is distributed under claims of fair use.

In a world where generated media is becoming commonplace, it becomes hard to shield against training back on that data without intervention. This data can damage the integrity of the training sets, making them unusable or in the very least detrimental to model development progress. Diverse and accurate data is what's healthiest for an LLM, but feeding it back inputs from its own outputs to train it reinforces what it's already learned, rather than giving it additional views that it can use to iterate on.

As it stands, for smaller models, we're starting to reach the point of model collapse already. For larger models, we're a long way away, but it's clear that at some point mitigations will need to be taken to avoid a situation where AI generated content is any significant percentage of the training data. Model collapse is a very real and serious issue, and applies to image generation algorithms too. As models grow, humans that curate content to train them will likely be necessary in future, and this is why companies like Reddit have reportedly struck deals with Google in order to share data for training AI, as it's likely to have a higher guaranteed percentage of human-made data.