Two bestselling novelists filed a suit against OpenAI in a San Francisco federal court on Wednesday, claiming in a proposed class action that the company used copyright-protected intellectual property to “train” its artificial intelligence chatbot.
Authors Mona Awad and Paul Tremblay claim that ChatGPT was trained in part by “ingesting” their novels without their consent. The generative AI is powered by two software programs known as large language models, which forgo a traditional programming method and instead extract massive amounts of text in order to produce natural and lifelike responses to user prompts.
When prompted, ChatGPT emitted extremely detailed summaries of Tremblay’s “The Cabin at the End of the World” and Awad’s “Bunny” and “13 Ways of Looking at a Fat Girl.” Both authors claim this is proof that their novels were used to train the chatbot, and the filing includes ChatGPT’s responses to prompts regarding their novels.
According to the suit, much of the material that OpenAI uses to train its generative chatbots comes from copyrighted works, including books written by Awad and Tremblay, “that were copied by OpenAI without consent, without credit, and without compensation.”
The lawsuit alleges that a variety of materials had been used to train the large language models, but books have been “a key ingredient in training datasets for large language models because books offer the best examples of high-quality longform writing.”
In June 2018, OpenAI revealed that it trained GPT-1 using BookCorpus, which the suit described as a “controversial dataset” assembled by artificial intelligence researchers in 2015, with a collection of “over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance.
“They copied the books from a website called Smashwords.com that hosts unpublished novels that are available to readers at no cost. Those novels, however, are largely under copyright.”
According to the complaint, later iterations of the company’s large language models were trained using significantly larger quantities of copyright-protected books. In a July 2020 paper introducing GPT-3, the company revealed that 15% of the training data set came from “two internet-based books corpora” that OpenAI simply called “Books1” and “Books2.”
The suit approximates that, based on numbers revealed in OpenAI’s paper about GPT-3, Books1 would contain roughly 63,000 titles, and Books2 would include approximately 294,000 titles.
“Because the OpenAI Language Models cannot function without the expressive information extracted from Plaintiffs’ works (and others) and retained inside them, the OpenAI Language Models are themselves infringing derivative works, made without Plaintiffs’ permission and in violation of their exclusive rights under the Copyright Act.,” the suit reads.
Also on Wednesday, a broader class-action suit was filed by Clarkson, a public-interest law firm, on behalf of a dozen anonymous clients, accusing OpenAI of lifting private, sometimes identifying information from Internet users “without their informed consent or knowledge,” according to a report in Rolling Stone. Experts have predicted more suits are sure to follow as AI becomes more adept at using information from the web to generate new content.