
In short
- Hachette Book Group and Cengage Group on Thursday asked a federal court in California to intervene in a class action accusing Google of copyright infringement in the field of AI training.
- The publishers claim that Google downloaded their books from pirate sites, including Z-Library and OceanofPDF, and then copied them repeatedly while training its models.
- Google’s C4 training dataset reportedly comes from at least 28 piracy-linked websites, with the copyright symbol appearing more than 200 million times.
Major book publishers Hachette Book Group and Cengage Group filed a motion Thursday to intervene in an existing class action lawsuit filed against Google last year, accusing the tech giant of orchestrating “historic copyright infringement” to build its Gemini platform.
The complaint filed in California federal court alleges that Google “chose to steal a vast amount of content from Plaintiffs and the class to train its AI model” rather than obtain the proper licensing and engage in willful infringement “at every stage” of development.
The consolidated case was originally filed in 2023 by individual authors as a proposed copyright class action, accusing Google of copying books to train its generative AI models.
The publishers claim that Google downloaded books from pirate sites and then copied them repeatedly during the AI training process, first into the computer’s memory, then into formats that the AI systems could read, and again into training sets for each new model version.
Google’s C4 training dataset contains copyrighted works scraped from the Z-Library, a pirate collection from which authorities have seized more than 350 websites and web domains, the lawsuit alleges.
The publishers noted how books were copied from b-ok.org, a Z-Library domain that now shows a federal seizure warrant, along with OceanofPDF and WeLib, “another prolific site with access to large amounts of unauthorized copyrighted content.”
The C4 dataset includes works from at least 28 sites identified by the U.S. government as markets for piracy and counterfeiting, the complaint said.
“The copyright symbol (©) appears more than 200 million times in the C4 dataset,” the complaint reads, noting that Google would have excluded “policy notices” and “terms of use” warnings but would have included “vast categories of copyrighted works, pirated works, and works from behind paywalls.”
The publishers claim that Google copied works from subscription-based libraries such as Scribd.com, bypassing legitimate licensing agreements.
When confronted with this practice, nonprofit data provider Common Crawl is said to have responded with “a blame-the-victim mentality, proclaiming, ‘You shouldn’t have put your content on the Internet if you didn’t want it to be on the Internet.'”
The lawsuit claims that Gemini now produces products that “replace copyrighted works,” including verbatim reproductions, detailed summaries, and “counterfeit products that copy creative elements of original works.”
Declutter has contacted Google and the publisher’s counsel.
AI and publishers
Google is simultaneously defending itself against antitrust claims from Penske Media Corporation over its AI Overviews feature, with the tech giant claiming that displaying AI-generated summaries constitutes “legitimate product enhancement rather than anti-competitive conduct.”
The publishers are seeking statutory damages, injunctions to stop further infringement, and an injunction requiring Google to destroy all unauthorized copies of their works and to disclose which books were used to train Gemini.
The motion to intervene follows a series of copyright lawsuits filed by authors against AI companies in 2023, with federal judges granting partial victories to Meta and Anthropic. They ruled that their use of copyrighted books to train their models constituted a fair use under copyright law, but criticized the companies for maintaining permanent libraries of pirated books.
Daily debriefing Newsletter
Start every day with today’s top news stories, plus original articles, a podcast, videos and more.

