When trained on data, models collapse

Gen 1: Revival architecture, St. John’s Basilica in Rome and the 18th @-@ century Church of Our Lady of Guernsey

Gen 9: architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-.

It had been translated into more than 100 languages, such as English, French, German, Italian, Spanish, Portuguese and Dutch.

Gen 1: architecture such as St. Peter’s Basilica in Rome or St. Peter’s Basilica in Buenos Aires. There is no evidence that any of these buildings were built during the reign of Pope Innocent III, but it is possible that they may have been built during the reign of his successor, Pope Innocent.

Gen 0: St. John’s Cathedral is a Revival architecture. The earliest surviving example of Perpendicular Revival architecture is found in the 18th @-@ century Church of Our Lady of Guernsey, which dates from the late 19th century. There are two types of churches.

Input: some started before 1360 — was typically accomplished by a master mason and a small team of itinerant masons, supplemented by local parish labourers, according to Poyntz Wright. But other authors reject this model, suggesting instead that leading architects designed the parish church towers based on early examples of Perpendicular.

The effects of generational learning on language models: wikitext2 and a corresponding analysis of VAEs and GMMs in the Supplementary Materials

The described process demonstrates that fine-tuning of language models does not curb the effects of model collapse and models that are being fine-tuned are also vulnerable. We find that, over the generations, models tend to produce more probable sequences from the original data and start introducing their own improbable sequences, that is, errors.

Figure 1b,c on the left shows histograms of individual data-point perplexities generated by the models of different generations as is evaluated by the first model developed with real wikitext2 training data. The models tend to produce more sequence that the original model would produce. VAEs and GMMs have been described in the Supplementary Materials as models started to produce samples that would be produced with higher probabilities by the original model. At the same time, we discover that generated data have much longer tails, suggesting that some of the data would never be produced by the original model—these are the errors that accumulate because of the learning with generational data.

Ten epochs, 10% of original training data preserved. A random 10% of the original data points have been trained for the model in this way. The performance of the original task is presented. 1c. We find that preservation of the original data allows for better model fine-tuning and leads to only minor degradation of performance.

While training regimes cause degraded performance in our models, we found that models can learn the underlying task with generated data. In particular, from Fig. 1 and their 3D versions in the Supplementary Materials, we see that model collapse occurs, as the density of samples with low perplexity begins to accumulate over the generations. This in turn makes it likely that, over the generations, the sampled data will similarly collapse to a delta function.

It is important to note here that the observed behaviour is in line with the general intuition established in the section ‘Theoretical intuition’. To be precise, in all experiments, generational learning is only performed on a finite (usually small) number of generations, whereas claims of the section ‘Theoretical intuition’ are mostly presented in the limit of generations going to infinity. However, as seen from experiments on VAEs and GMMs in the Supplementary Materials, convergence to delta functions and specific rates of such convergence are highly related to the specifics of the problem considered, and complete collapse may or may not occur, even after a small number of steps. This is further illustrated theoretically in the Supplementary Materials, in which potentially notable divergence from the original model can occur even after a few generations.

The researchers trained the model with a dataset from Wikipedia and then demonstrated model collapse using a pre-trained LLM. They then asked the resulting model to generate its own Wikipedia-style articles. To train the next generation of the model, they started with the same pre-trained LLM, but fine-tuned it on the articles created by its predecessor. The models were judged on how well they predicted the next few sentences, and how well they trained on real data. The team was expecting to see errors but they were surprised at how quickly things went wrong.

The study shows that models forget the information that is mentioned least frequently as outputs become more homogeneity, even before complete collapse. This is a concern when it comes to making AI models that represent all groups fairly, because low-probability events often relate to marginalized groups, says study co-author Ilia Shumailov, who worked on the project while at the University of Oxford, UK.

Julia Kempe is a computer scientist at New York University in New York City. Until now, many technology firms have improved their models by feeding them larger and larger amounts of data. They are hoping to use synthetic data as human-produced content runs out. The study — a version of which first appeared on the arXiv preprint server in May 2023 — has spurred the AI community to try to find solutions to the problem, she says. It has been a call to action.

Language models work by building associations between words and words in big swathes of text that are often taken from the Internet. They generate text by spitting out the most probable next word.

Collapse happens because each model necessarily only samples from the data it is trained on. This means that words that were not common in the data are less likely to be reproduced and the chance of common ones being repeated is increased. Complete collapse eventually occurs because each model learns not from reality, but from the previous model’s prediction of reality, with errors getting amplified in each iteration. “Over time, those errors end up stacking up on top of each other, to the point where the model basically only learns errors and nothing else,” says Shumailov.

The problem is analogous to inbreeding in a species, says Hany Farid, a computer scientist at the University of California, Berkeley. “If a species inbreeds with their own offspring and doesn’t diversify their gene pool, it can lead to a collapse of the species,” says Farid, whose work has demonstrated the same effect in image models, producing eerie distortions of reality2.

Developers might need to find ways, such as watermarking, to keep AI-generated data separate from real data, which would require unprecedented coordination by big-tech firms, says Shumailov. Incentives may need to be found for human creators to keep producing content. Filtering is likely to become important, too — for example, humans could curate AI-generated text before it goes back into the data pool, says Kempe. She says that if you can properly care for it, it might be partly or fully avoided.