Stack Overflow Joins Reddit in Charging Tech Giants for AI Training Data

by Cindy Tan

Published: April 21, 2023 at 12:00 pm Updated: May 29, 2024 at 9:58 am

by Victor Dey

Edited and fact-checked: April 21, 2023 at 12:00 pm

In Brief

Stack Overflow wants to charge tech giants that are using its data to develop LLMs for commercial purposes.

Both Stack Overflow and Reddit will continue licensing data for free to certain companies.

Stack Overflow is currently developing its own generative AI services.

Stack Overflow Joins Reddit in Charging Tech Giants for AI Training Data

Stack Overflow, a question-and-answer forum for programmers, has decided to charge tech giants for using its data to train AI and large language models (LLM), The Wired first reported.

This follows Reddit’s announcement on Tuesday that it will begin charging for access to its data API. In response to Google, OpenAI, Meta, and other companies that are using Reddit’s vast user-generated content for commercial AI projects without payment, Reddit’s CEO and co-founder, Steve Huffman, told The New York Times that such companies will now have to pay for using Reddit’s data to train their AI models, starting from June.

“Crawling Reddit, generating value, and not returning any of that value to our users is something we have a problem with,” Huffman told The Times. Developers who wish to create applications and bots that facilitate the use of Reddit, as well as researchers who want to study Reddit purely for academic or non-commercial purposes, will continue to have free access to Reddit’s API.

Digital and print media publishers are also not letting AI giants off the hook. The News/Media Alliance released its AI principles on Thursday, declaring that the unlicensed use of its content by generative artificial intelligence (GAI) systems constitutes an infringement of intellectual property rights. The guidelines also specify that GAI systems must seek permission from publishers before using their content and that publishers should be entitled to negotiate for fair compensation for the use of their IP by these developers.

Over 50 million questions and answers have been posted on Stack Overflow. Meta has been training its large language model LLaMA using data scraped from Stack Exchange, the maker of Stack Overflow.

Speaking out on his support of Reddit’s approach, Stack Overflow’s CEO Prashanth Chandrasekar told The Wired:

“Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive.”

Chandrasekar added that LLM developers using Stack Overflow’s data are violating the site’s terms of service as users own the content they post, which falls under a Creative Commons license that requires anyone who uses the content later to credit the source. He explained that AI companies “are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license.”

He also clarified that Stack Overflow would only charge companies developing big LLMs for commercial purposes. Additionally, Stack Overflow is working on its own generative AI applications as part of its broader AI strategy. In a previous blog post, Chandrasekar stated that he had tasked a dedicated team to “work full time on GenAI applications” that can be integrated into Stack Overflow’s public platform.

Both Reddit and Stack Overflow are currently working on pricing information for their data API, which will be revealed in the coming months.

Read more:

Keep track of cryptocurrency distributions in our Airdrops Calendar.

Tags:

Disclaimer

In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.

About The Author

Cindy is a journalist at Metaverse Post, covering topics related to web3, NFT, metaverse and AI, with a focus on interviews with Web3 industry players. She has spoken to over 30 C-level execs and counting, bringing their valuable insights to readers. Originally from Singapore, Cindy is now based in Tbilisi, Georgia. She holds a Bachelor's degree in Communications & Media Studies from the University of South Australia and has a decade of experience in journalism and writing. Get in touch with her via [email protected] with press pitches, announcements and interview opportunities.

Cindy Tan