Recent research by Ilia Shumailov and his team at Google DeepMind highlights a significant challenge for the future of AI language models (LLMs). They discovered that if LLMs are trained predominantly on AI-generated content, it leads to a phenomenon called “model collapse.” This occurs when new generations of models, using data produced by older AI models, start to misinterpret reality and degrade in performance.
Researchers have discovered AI’s worst enemy — its data. https://t.co/OO0v6RH00R
— Randy Kemp (@randylewiskemp) July 24, 2024
The study, published in Nature, showed that LLMs trained on AI-generated data tend to forget less common elements from their original training sets. For instance, if a model is tasked with generating images of tourist landmarks, it might overly focus on popular sites like the Statue of Liberty, eventually ignoring other landmarks altogether. This repetitive focus can result in the models generating meaningless or repetitive phrases, such as “tailed jackrabbits,” as seen in their experiments.
The issue of model collapse raises concerns about the future of machine learning advancements. The research suggests that while model collapse can affect any LLM, the severity depends on the model’s architecture, learning processes, and the quality of data it uses. This situation echoes past challenges faced by search engines, which had to adjust their algorithms due to content farms flooding the internet with low-quality articles.
For the average user, this problem might not be immediately noticeable, as major chatbot creators conduct thorough evaluations to prevent such degradation. However, for AI companies, understanding and addressing model collapse is crucial. Using high-quality, human-generated content for training could be a solution, as it provides more reliable data than AI-generated content.
AI models collapse when trained on recursively generated data
“the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet.”https://t.co/w8oqVaS5sl
— Joshua Grubb (@entogrubb) July 25, 2024
The research also points to the potential value of platforms like Reddit, where human interactions generate a wealth of content. Companies like Google and OpenAI have already made deals with such platforms, recognizing the importance of quality data in developing robust AI models.
Key Points:
- Research by Google DeepMind found that LLMs trained on AI-generated data can suffer “model collapse.”
- Model collapse leads to repetitive and degraded responses, as models forget less common data elements.
- The problem can slow machine learning advancements and requires high-quality data for training.
- Major chatbot creators can detect and prevent degradation through evaluations.
- Platforms like Reddit, with human-generated content, offer valuable training data for AI models.
RM Tomi – Reprinted with permission of Whatfinger News