Meta Admits Use of ‘Pirated’ Book Dataset to Train AI

Home > AI >

With AI initiatives developing at a rapid pace, copyright holders are on high alert. In addition to legislation, several currently ongoing lawsuits will help to define what's allowed and what isn't. Responding to a lawsuit from several authors, Meta now admits that it used portions of the Books3 dataset to train its Llama models. This dataset includes many pirated books.

meta logoIn recent months, rightsholders of all ilks have filed lawsuits against companies that develop AI models.

The list includes record labels, individual authors, visual artists, and more recently the New York Times. These rightsholders all object to the presumed use of their work without proper compensation.

Several of the lawsuits filed by book authors include a piracy component as well. The cases allege that tech companies, including Meta and OpenAI, used the controversial Books3 dataset to train their models.

The Books3 dataset has a clear piracy angle. It was created by AI researcher Shawn Presser in 2020, who scraped the library of ‘pirate’ site Bibliotik. This book archive was publicly hosted by digital archiving collective ‘The Eye‘ at the time, alongside various other data sources.

Bibliotik and other sources previously hosted at The Eye

the eye

The general vision was that the plaintext collection of more than 195,000 books, which is nearly 37GB in size, could help AI enthusiasts build better models, which would spur innovation.

AI Boom Triggers Copyright Troubles

Presser wasn’t wrong, but the dataset didn’t just help garage AI startups. Several of the world’s largest tech companies discovered it too and used it to improve their own language models.

For years, Books3 continued to be freely and widely available, aiding AI researchers and enthusiasts around the world. However, when the AI boom reached the mainstream last year, book authors and publishers took notice, then took retaliatory action.

For example, Danish anti-piracy group Rights Alliance demanded The Eye to remove their copy of Books3, which it did. The dataset also disappeared from the website of AI company Huggingface, citing reported copyright infringement, while others considered their options.

As previously reported by Wired, Bloomberg informed Rights Alliance that it doesn’t plan to train future versions of its BloombergGPT model using Books3, and other companies likely made similar decisions behind closed doors.

Meta Admits Books3 Use

These are noteworthy developments but not all complaints can be resolved with promises. Several lawsuits against OpenAI and Meta remain ongoing, accusing the companies of using the Books3 dataset to train their models.

While OpenAI and Meta are very cautious about discussing the subject in public, Meta provided more context in a California federal court this week.

Responding to a lawsuit from writer/comedian Sarah Silverman, author Richard Kadrey, and other rights holders, the tech giant admits that “portions of Books3” were used to train the Llama AI model before its public release.

“Meta admits that it used portions of the Books3 dataset, among many other materials, to train Llama 1 and Llama 2,” Meta writes in its answer.

meta books3 answer

This admission doesn’t come as a massive surprise as several sources, including research papers, basically reached the same conclusion. While the use of Books3 is not contested by Meta, the question remains whether the company was in the wrong when it did so.

Meta Denies Copyright Infringement

Meta’s answer admits the use of Books3 but denies various other allegations and claims. For example, the authors alleged that Meta trained its AI on copyrighted works without permission. The answer doesn’t directly deny this but notes that consent or compensation is not necessarily required.

“To the extent a response is deemed required, Meta denies that its use of copyrighted works to train Llama required consent, credit, or compensation,” Meta writes.

The authors further stated that, as far as their books appear in the Books3 database, they are referred to as “infringed works”. This prompted Meta to respond with yet another denial. “Meta denies that it infringed Plaintiffs’ alleged copyrights,” the company writes.

Fair Use

Meta’s response doesn’t provide much additional detail and the full defense will be revealed as the case progresses. It is clear, however, that the company plans to rely on a fair use defense, at least in part.

“To the extent that Meta made any unauthorized copies of any Plaintiffs’ registered copyrighted works, such copies constitute fair use under 17 U.S.C. § 107,” Meta notes.

The fair use angle is expected to be a key part of this and other AI lawsuits. This doesn’t only apply to ‘pirate’ sources but also to the use of content that’s published through official channels, but used without explicit permission.

These legal battles are still in their early stages, but may ultimately find their way to the Supreme Court if needed. AI companies have stressed that progress will be hampered if rules and regulations are too strict.

Earlier this week, OpenAI mentioned that fair use is both necessary and critical to building competitive AI models, noting that news organizations can opt out if they wish. Needless to say, this option didn’t previously exist, certainly not for the Books3 database.

We presume that when Presser created Books3, he never envisioned the dataset to be at the center of landmark lawsuits that could define the future of AI. However, the stakes have changed, and the well-intended ‘archiving’ effort is now part of a major copyright clash.

A copy of Meta’s response to the author’s first consolidated amended complaint is available here (pdf)


Popular Posts
From 2 Years ago…