The stolen words

Image by Dariusz Sankowski from Pixabay

The Taken words

Almost 190,000 books from illegal sources were used to train numerous AI language models. Many authors are outraged, but there is little they can do.

Large language models like Open AI's GPT-3, Meta's LLaMA and Google's Bard are data hogs. So that they can later form grammatically correct and coherent sentences, they must be fed with so-called training data. They are largely collected from the public Internet , but not only: some data sets with which some well-known language models were trained are demonstrably pirated.

At the heart of the current debate is a data set called Books3. This includes 197,000 editions of mostly Western literature, from Shakespeare to Stephen King, from cookbooks to the philosophy of Sartre, and from Nobel Prize winner G√ľnter Grass to US comedian Sarah Silverman. Sarah Silverman, who, like numerous other authors , accuses companies like Meta and Open-AI of using their copyrighted works without permission.

The US magazine The Atlantic published a searchable database on Monday containing around 183,000 of the titles included in Books3. The journalists examined the approximately 100 gigabyte database, which is a gigantic text file, for ISBN. This enabled them to find out exactly which works were included. It's safe to assume that any artificial intelligence that Books3 uses in its training data has been fed these books.

One of these language models is LLaMA from Facebook parent company Meta. This is evident from the official documentation . The language model GPT-J from the non-profit research group EleutherAI and BloombergGPT from the media company of the same name also use the data set. However, it is unclear which training data GPT-4 is based on; OpenAI is silent on the matter “due to competition and security implications.” In the case of GPT-3, which currently forms the basis of the free version of the chatbot ChatGPT, the data sets Books1 and Books2 were included , which presumably come from so-called shadow libraries, i.e. from non-public and often non-legal collections.

The background of Books3

Books3 also comes from such a shadow library. In October 2020, a few weeks after OpenAI first presented GPT-3 to the public, AI researcher Shawn Presser began compiling the data set. As he composed at the time on

Presser did not collect the almost 200,000 books himself, but instead used Bibliotik, a Bittorrent tracker (like the much better known Pirate Bay) that specializes in e-books. There, users share works with each other, which in most cases are protected by copyright. So they distribute pirated copies. Presser's approach was not particularly well received in the tracker scene because he mentioned bibliotics by name as the source for his data set.

In December 2020, Books3 was ingested by the EleutherAI research group into an even larger training dataset called The Pile . Even though Books3 ultimately represents only a small portion of The Pile and the training data as a whole (in the case of LLaMA, it is only 4.5 percent ), there is no denying that some of the leading AI language models use data from illegal sources, or to put it bluntly stolen words, included.

Unresolved legal questions regarding AI language models

For complaining authors like Sarah Silverman and their lawyers, the matter is therefore clear. As they write , companies like Meta and OpenAI are violating copyright law by using data from unlawful sources to train their algorithms without the consent of the authors.

In fact, the matter is more complex. For example, companies could refer to the fair use rule that applies in the USA: language models do not copy the content of the training data, but instead create new content in the form of answers to user questions. Since these works are not in direct competition with the original books, no copyright claims can be asserted.

It is also questionable whether it even matters whether the Books3 data comes from an illegal shadow library. "If the source is unauthorized, that can be a factor," US law professor Jason Schultz tells The Atlantic . As Wired magazine writes , there is "no precedent in the USA that makes fair use directly dependent on whether the copyrighted works were acquired legally or not." Once again it is clear that numerous copyright questions in the context of artificial intelligence have not yet been conclusively answered.

The AI ​​copyright debate is spreading

Regardless, the case of Books3 also exposes the power dynamics in the AI ​​industry. According to Shawn Presser, only large companies like OpenAI could afford such datasets if they had to pay for them or obtain licenses. Smaller companies and start-ups would not have these opportunities. That could lead to concentration of market power, he told Wired . In addition, anyone who has already fed their algorithms with Books3 cannot subsequently remove the data and therefore may have an advantage over new language models that do without Books3 and other data sets that are questionable under copyright law.

Meanwhile, the momentum of the debate is spreading. Last week, Shawn Presser received mail from Silverman's lawyers, as he writes on X . They advise him that he should secure emails and documents that may be relevant to the case. The Danish anti-piracy organization Rights Alliance is currently taking increased action against platforms that offer the Books3 data set for download. And the US Authors Guild has already collected more than 10,000 signatures from authors in a petition calling on AI companies to obtain consent from authors and pay them fairly when they use their works.

However, some authors bow to developments. The best-known example is Stephen King, whose words are most frequently included in Books3 after those of Shakespeare. Creative artificial intelligence exerts a “terrible fascination” on him, writes King . Would he therefore, if he could, forbid a machine from learning from his words? "Then, at that point, I should be a Luddite attempting to stop modern advancement by crushing a winding around machine."

Post a Comment