Why generative AI doesn't plagiarize
Published on Mar 30, 2025 in Thoughts on AI
This article is going to be quite short, since the subject is actually, from a technical point of view, very simple. Everything I’m going to say is true for text or image generation.
It’s quite common to see the crowd of the internet railing against the plagiarism of generative AI.
For anyone who has ever trained a generative model, it is obvious that the question does not arise.
Indeed, when you train these kinds of models, you try to avoid underfitting and overfitting. To do this, we compare two metrics, the training loss and the eval loss. Once the training loss has converged to its minimum (this is the goal of training), the eval loss must not diverge from the training loss.
When the eval loss is too low, it is called underfitting, which means that the model has not been able to learn the data. It can be too much data for a too small model or poor adjustment of training hyperparameters.
Overfitting is when the model stores data instead of “understanding” patterns. It’s detected when the eval loss is too high. Again, this can be a poor setting of the hyperparameters. We can also use a dropout layer to prevent this.
But no one wants overfitting. When training a model, you want to develop the ability to generalize the data, not memorize them and bring them out later.
A well-trained model can make a good pastiche, but in no way plagiarism. It does not store data. Just the substantive marrow.
This is why American law considers that the use of copyrighted data for the training of a model falls within the scope of fair use. But it’s not that simple in all countries.
Fundamentally, the subject of plagiarism is not a very deep debate, led by people who have never trained a generative model and have no idea how it works. It can happen that a poorly trained model produces training data. But in this case, the problem is not that generative AI plagiarizes. The problem is that there are poorly trained models.
On the other hand, many other subjects should be debated more, such as the ethical way to acquire data, transparency on the data used… But I believe that the plagiarism debate is simple (but false), so it crystallizes all the emotions on the subject.