Imagine a super smart AI that can create images, write articles, and even code. But what if this AI learned all these skills by copying from others without permission? That’s a big problem. The company behind the popular chatbot ChatGPT, OpenAI, is facing lawsuits from writers, programmers, and news outlets over this very issue. They claim that OpenAI used their content to train their AI without permission.
OpenAI says they did nothing wrong, that they used the content legally and followed international copyright rules. But the owners of the content disagree, saying US law doesn’t let you just take someone’s work to train an AI without asking.
Recently, researchers from the University of Washington, the University of Copenhagen, and Stanford came up with a new way to check if an AI model has memorized the data it was trained on. They tested this method on OpenAI’s models and found some interesting results. It turns out that these smart AI models can recall entire passages from books or movies, even if they weren’t supposed to just copy and paste.
The researchers used a technique called “high-surprisal” to test two of OpenAI’s models, GPT-4 and GPT-3.5. They took text from novels and New York Times articles, removed some of the less common words, and then asked the AI to fill in the blanks. If the AI could guess the missing words correctly, it meant that it had memorized the content during training.
The tests showed that GPT-4 could remember parts of popular books and articles from the New York Times. This means that the data used to train these AI models might have included copyrighted material, and this new method could be used to detect copyright infringement in AI training.