OpenAI continues to rapidly expand its portfolio of partnerships with major publishers to secure rights for using their content to train its artificial intelligence models like GPT-4 and Dall-E 2.
However, the company is facing increasing legal challenges and ethical discussions about the same practices that enable systems like ChatGPT to perform so well. Ironically, the lawsuits that initially seemed like serious threats to OpenAI might actually help it overcome competitors and lead the AI industry for years to come.
OpenAI Signs New Deals with LeMonde and Prisa Media to Use Their Content
Le Monde and Open AI have signed a partnership agreement on artificial intelligence.
✍️ Read our editorial by Louis Dreyfus, Le Monde's Chief Executive Officer, and Jérôme Fenoglio, Le Monde's director.https://t.co/FpTRIFJU6v
— Le Monde in English (@LeMonde_EN) March 14, 2024
On Wednesday, OpenAI announced new deals with the leading French newspaper Le Monde and Spanish media conglomerate Prisa Media, which owns publications like El País, Cinco Días, and others.
The partnerships will allow OpenAI to incorporate content from these well-established media outlets into ChatGPT and integrate their reporting and analysis in unique ways for users.
“We are committed to supporting journalism by using new AI technologies and improving opportunities for content creators. In collaboration with Le Monde and Prisa Media, our aim is to help ChatGPT users worldwide engage with the news in fresh, interactive, and meaningful ways,” said Brad Lightcap, OpenAI’s Chief Operating Officer, in the company’s official press release.
ChatGPT Will Provide References and Links to Articles from These Two Publishers
Over the coming months, ChatGPT will feature select summaries of stories from Le Monde and Prisa Media brands with attribution and links back to the original articles.
For the publishers, the deals represent new revenue streams and expanded audience reach through ChatGPT’s tens of millions of users worldwide. They also aim to ensure that authoritative journalism maintains a strong presence amidst the rise of AI-generated content.
“At the moment we are celebrating the 80th anniversary of Le Monde, this partnership allows us to expand our reach and uphold our commitment to providing accurate, verified news stories at scale”, stated Louis Dreyfus, CEO of Le Monde. “Collaborating with OpenAI ensures our content can be accessed and appreciated by a broader audience.”
Carlos Núñez, Chairman and CEO of Prisa Media, echoed these sentiments: “Leveraging ChatGPT’s capabilities allows us to present our in-depth, quality journalism in novel ways, reaching individuals who seek credible and independent content.”
The deals build on OpenAI’s previous partnerships with The Associated Press, German publisher Axel Springer, and others to incorporate their archives and real-time reporting into its systems. However, the strategy has proven increasingly controversial and exposes OpenAI to legal challenges from other media outlets that also allege copyright violations.
OpenAI Considers New York Times Lawsuit “Without Merits”
Most notably, The New York Times sued OpenAI and Microsoft in late December over claims that their AI models were trained using the work of its journalists without permission or licensing. The lawsuit seeks to block OpenAI from using any content produced by The Times and potentially recover billions in damages.
If the New York Times wins, the lawsuit will set a world-changing precedent that would force AI makers to own or license any and all copyrighted material they train their AIs with. Because these kinds of models take a gargantuan amount of data to train, this could become prohibitively expensive for all but the richest companies.
OpenAI has forcefully pushed back, arguing that its models simply learn general knowledge from aggregated training data in the same way that humans do through reading and education. The company maintains that occasional instances of verbatim text appearing in outputs, known as “regurgitation”, are unintentional bugs that are being actively addressed.
“Memorization is a rare failure of the learning process that we are continually making progress on, but it’s more common when particular content appears more than once in training data”, OpenAI stated in a blog post that discussed the legal proceeding. “Because models learn from the enormous aggregate of human knowledge, any one sector—including news—is a tiny slice of overall training data.”
However, as demonstrated in The Times’ lawsuit exhibits, ChatGPT can, at times, reproduce entire paragraphs from copyrighted articles nearly word-for-word. The Times alleges that this shows clear evidence of direct copying and infringement of its proprietary content.
OpenAI has sought to get ahead of such disputes by allowing any publisher or website to opt out of having its content crawled and included in training data. The Times was among those utilizing the opt-out by August 2023, which OpenAI says should have prevented the alleged infringement from occurring with its latest models. However, if the court rules that training models using copyrighted material is copyright infringement, OpenAI’s opt-out program won’t matter much.
The company has taken steps to limit potential regurgitation issues, including shutting down a real-time news content feature in ChatGPT last year after learning that the output was too similar to the sources it was trained on. The firm insists that memorization goes against the intent of its technology to apply learned knowledge to novel situations.
New Legislation Could Impact OpenAI’s Ability to Train New Models
Lawsuits are just one element of the heated debates around how AI companies obtain their training data. Most have relied heavily on an interpretation of US copyright fair use doctrines that deem ingesting publicly available online material for the purpose of machine learning as permissible, even without licenses.
This approach faces increasing scrutiny from creators and intellectual property owners who argue that their works are being exploited for commercial gain without compensation or consent. OpenAI whistleblowers have even questioned if unaccredited appropriation meets ethical AI principles around consent and justice.
In response, major AI firms have started pursuing licensing deals to supplement their training data, though the costs remain largely opaque. The Information reported in January that OpenAI offered publishers between $1 to $5 million per year while Apple made pitches to media outlets for around $50 million over several years.
Striking partnerships lend credibility and PR value, but total payouts could quickly skyrocket if deals proliferate across industries encompassing books, music, TV, movies, and other copyrighted works used to feed these insatiable AI models. This raises concerns about erecting barriers to entry for smaller players and academics who are unable to afford similar arrangements.
OpenAI could actually end up better off in the long run if it loses its copyright lawsuits. It would certainly be hurt in the short term as it would likely be hit with a fine, and it might have to pause the use of its advanced models that were trained on copyrighted material. However, OpenAI already has a large catalogue of valuable licensing deals that could put it ahead of the game in training the next batch of AI models.
Smaller companies would likely get priced out with this precedent, allowing companies like OpenAI and Google, which have access to large amounts of content already, to thrive with diminished competition.
There are proposals for “safe harbor” regulations to permit AI training across copyrighted works without licensing fees, though opposition remains from creative industries fearing further disruption to their business models. The European Union’s newly approved AI Act just enshrined transparency requirements for models trained on copyrighted data but it’s the first and only major active AI legislation in the world.
OpenAI Justifies the Use of Copyrighted Materials by Comparing it to Human Education
As OpenAI proclaims its publishing partnerships as mutually beneficial collaborations promoting quality journalism, the ethical debate continues over how to ensure an equitable AI ecosystem. The company maintains that its intent is simply to enable its systems to better comprehend and analyze the world’s information for the benefit of users.
“Just as humans obtain a broad education to learn how to solve new problems, we want our AI models to observe the range of the world’s information, including from every language, culture, and industry.”, OpenAI stated to justify its activities and practices.
With deals like those announced yesterday, major publishers appear willing to embrace that vision and work within the AI paradigm, seeking a sustainable path alongside the rapidly advancing technology.
However, a myriad of challenges continues to plague the advancement of AI to new levels of knowledge due to the increasing number of publishers who are actively refusing to go along. This could ultimately force a reckoning over AI’s relationship with intellectual property and open access to data.