Generative AI models such as ChatGPT have captured the imaginations of millions of people, offering a glimpse of what an AI-assisted future might look like.
The new technology also brings up novel copyright issues. For example, several rightsholders are worried that their work is being used to train and exploit AI without any form of compensation.
These concerns have triggered numerous AI-related lawsuits in the United States, many of which target OpenAI. Just a few days ago, the Author’s Guild and several prominent members including George RR Martin and John Grisham joined in on the legal action.
The allegations in their complaint are similar to others aired over the past few months. The first case was filed in a Californian federal court by authors Paul Tremblay and Mona Awad, who were later joined by writer/comedian Sarah Silverman and other authors in a similar suit.
According to the plaintiffs, large language model training sets shouldn’t be permitted to use every piece of text they come across online. They accuse OpenAI of using books as training data, without permission, relying on datasets that were sourced from pirate sites.
The complaints mention the controversial Books2 and Books3 datasets that are believed to be sourced from shadow libraries such as LibGen, Z-Library, Sci-Hub, and Bibliotik.
OpenAI’s Motion to Dismiss
In August, OpenAI responded to these complaints, asking a California federal court to dismiss nearly all claims. According to the tech company, there are no viable claims for vicarious copyright infringement, DMCA violation, unfair competition, and unjust enrichment.
The only claim that wasn’t contested by OpenAI is direct copyright infringement, which the company plans to address at a later stage.
Among its arguments to dismiss the claims, the AI company cited fair use. It argued that the use of large amounts of copyrighted texts could be seen as ‘fair’ because it helps to facilitate progress and innovation.
“Numerous courts have applied the fair use doctrine to strike that balance, recognizing that the use of copyrighted materials by innovators in transformative ways does not violate copyright,” OpenAI wrote.
The authors responded to those arguments this week. While the ‘Tremblay’ and ‘Silverman’ cases are not yet officially merged, both submitted the exact same opposition briefs, asking the court to deny OpenAI’s motion to dismiss the claims.
According to the authors, it is “telling” that OpenAI makes no attempt to dismiss the direct copyright infringement claim. This issue is best suited to be discussed at trial and the same applies to the other claims.
“Nevertheless, OpenAI still tries to leverage its motion to pre-litigate issues it thinks will carry the day in the future. This is improper on a motion to dismiss and should be disregarded,” they write.
The Fair Use Urban Legend
The authors note that OpenAI’s detailed interpretation of fair use in an AI context is irrelevant, at least at this stage. Fair use is a defense that is typically not used to dismiss copyright infringement claims before they’re properly argued.
“Fair use, of course, is an important—yet limited—feature of U.S. copyright law. Importantly, however, fair use is an affirmative defense, and is “inappropriate to resolve on a motion to dismiss.” Given that, OpenAI’s arguments regarding fair use are wholly misplaced.
To bolster their argument, the authors refer to a recent ruling in a Thomson Reuters lawsuit, which also deals with AI-related copyright claims. In that case, the court rejected the fair use argument and referred the matter to trial.
In addition, the plaintiffs note that using copyrighted works for AI purposes isn’t always considered fair use; that’s an urban legend.
“Contrary to widespread urban legend in the AI industry, no U.S. court has squarely ruled on the question of whether training an AI model with copyrighted expression is fair use,” plaintiffs write.
Piracy as a Source
The authors also double down on their piracy allegations and mention three types of copyright infringement. In addition to using copyrighted works for training data, the LLM models themselves are also infringing derivative works, and the same applies to the output of the models.
These accusations and claims largely rely on the suspicion that OpenAI used hundreds of thousands of copyrighted books as training material. While the company never mentioned its source, the authors believe that the models are trained on pirated books from shadow libraries such as LibGen, Z-Library, Sci-Hub, and Bibliotik.
“The book datasets used by OpenAI for training language models included thousands of copyrighted books, including books written by Plaintiffs,” they write.
“Given the size of these book datasets, the most likely source of these books is one or more of the notorious ‘shadow library’ websites that host massive numbers of pirated texts that are not in the public domain.”
The direct and vicarious copyright infringement claims rest on this suspicion, and the same is true for the DMCA violations. The authors hope that they will be able to prove this at trial and ask the court not to dismiss any claims prematurely.