AI and the Reddit goldmine: protecting user-generated content

Published May 24, 2024

As artificial intelligence (AI) continues to evolve and enmesh into more aspects of our lives, so do the legal complexities surrounding its development and use. One critical area of concern is the use of content from social media platforms by AI companies. With social media giants like Reddit considering legal action against AI firms for unauthorized use of user-generated content, it’s useful to understand the potential grounds for such claims, the unique challenges posed by user-generated content, and the broader implications for the industry.

Reddit’s claim

As a leading forum platform, Reddit is a goldmine for training AI models that require extensive sets of data on a broad range of topics. Moreover, training large language models on user-generated content allows them to develop capabilities to converse in more ‘natural’ ways and in a variety of tones which is crucial to improving user experiences.

However, Reddit is not keen on AI companies’ ungoverned use of its data. In a recent statement by its chief operating officer, Jen Wong, Reddit clarified it would consider taking legal action against any technology company looking to extract data from its website without permission for commercial purposes.

If Reddit is successful in any claim against an AI company, it could set a precedent that other social media companies may consider following. Platforms like Facebook, Twitter, and Instagram, which also host vast quantities of user-generated content, may see this as an opportunity to protect their data assets and revenue streams from unauthorized use by AI firms.

Reddit is not the first to contemplate copyright claims against AI providers. Other notable claims already brought include:

The New York Times v. OpenAI and Microsoft: prior to agreeing on a licensing deal, the New York Times sued OpenAI and Microsoft, alleging that their AI models, such as ChatGPT, used copyrighted material from the newspaper without authorization. The focus is on whether outputs from the AI are infringing derivative works. The court dismissed most of the grounds but allowed claims such as unfair competition to proceed subject to further proof of direct copyright infringement.
Getty Images v. Stability AI: in the UK, Getty Images is pursuing legal action against Stability AI, alleging copyright infringement for using its images to train the Stable Diffusion model. This case has been allowed to go to trial, addressing complex issues such as whether AI-generated images and software can be considered “articles” under UK copyright law. The court has permitted amendments to include claims relating to the image-to-image feature of Stable Diffusion.

What are AI IP infringement claims based on?

These claims are highly dependent on the specific facts of the case, however, there are two key grounds that are likely to recur on most claims:

Copyright infringement:user-generated content on platforms such as Reddit is often protected by copyright. AI companies that scrape and utilize this content without authorization may be infringing copyright. Rightsholders (including the original content creators and the platform itself) could claim that their exclusive rights to reproduce, distribute, and create derivative works from the content have been violated.
Breach of terms of service:social media platforms have terms of service (ToS) agreements that users and third parties must adhere to. If AI companies are found to be breaching these ToS — particularly clauses related to restrictions on data scraping and use of content — they could face legal action for breach of contract.

On the flip side, it is also important to note that social media content is slightly different from other copyright-protected works because it is user-generated. This raises specific legal challenges including determining the ownership of user-generated content, and potential compliance issues with data protection laws, such as the UK GDPR, when extracting personal data from such content.

Is licensing the way forward?

To mitigate legal risks and ensure compliance with copyright laws, many AI companies are increasingly turning to licensing agreements. These deals not only ensure that AI developers have legal access to valuable data but also provide compensation to content creators and platforms. Notable examples include:

Google and Reddit: Google has entered into a licensing arrangement with Reddit — reportedly worth $60 million annually — to use Reddit’s extensive user-generated content for training its AI models.
OpenAI and various publishers:OpenAI has been actively negotiating and securing licensing deals with several major publishers for the right-to-use content to train its AI models. Deals have been reported with publishers like the New York Times and the Financial Times. These arrangements allow OpenAI to legally use the copyrighted material from these publishers to improve the capabilities of their AI models.

As AI technology advances, the legal landscape around its use of social media content will undoubtedly become increasingly complex. Social media companies are likely to pursue legal solutions to protect their content, especially against commercial AI firms. Understanding the grounds for these claims, the challenges posed by user-generated content, the emerging trend towards licensing agreements, and the distinctions between non-profit and commercial use will be crucial for navigating these legal waters.

← Back to News