Authors sue OpenAI for copyright infringement, claim ChatGPT unlawfully ‘ingested’ their books

Jul. 06, 2023

OpenAI, a leading artificial intelligence research laboratory, is facing a class-action complaint filed by authors Paul Tremblay and Mona Awad in a California federal court. The authors allege that OpenAI violated copyright law by training its language model, ChatGPT, using their books without obtaining permission.

According to the complaint filed in the U.S. District Court in San Francisco, ChatGPT is trained by “ingesting” massive amounts of text from a variety of sources to form its training dataset. Tremblay and Awad, both residing in Massachusetts, claim that they did not give consent for their copyrighted books to be used as training material for ChatGPT. Nevertheless, the lawsuit contends that OpenAI trained the model using the authors’ copyrighted works.

Tremblay, known for his book “The Cabin at the End of the World,” holds registered copyrights for several of his books. Awad, on the other hand, owns registered copyrights for books such as “13 Ways of Looking at a Fat Girl” and “Bunny.” The authors argue that when prompted, ChatGPT generates summaries of their copyrighted works, a capability that would only be possible if the model had been trained on their materials. The complaint further claims that OpenAI benefits commercially and profits from the use of their copyrighted works and those of other class members through the use of ChatGPT.

The complaint references a June 2018 paper published by OpenAI, in which the company revealed that its GPT-1 tool had been trained on BookCorpus, a collection of over 7,000 unpublished books from various genres. The lawsuit highlights the value of such datasets, as they allow generative models to learn from long stretches of contiguous text and condition on long-range information. Numerous large language models, including those developed by OpenAI, Google, Amazon, and others, have been trained using BookCorpus, according to the complaint. Intellectual property law expert Andres Guadamuz commented that this lawsuit is the first of its kind specifically targeting OpenAI’s use of copyrighted materials.

Attorneys Joseph Saveri and Matthew Butterick, representing the authors, emphasized that books are an ideal source for training large language models due to their high-quality, well-edited, long-form prose, making them the “gold standard of idea storage.” The attorneys argue that OpenAI breached its duties by collecting, maintaining, and controlling the authors’ copyrighted works without authorization and by designing and maintaining systems, including ChatGPT, that were trained on the infringed works.

As the legal battle unfolds, this case could have significant implications for the use of copyrighted material in training AI models and may shape future practices in the field of artificial intelligence.

AI ChatGPT Writers