Key Takeaways Users on Stack Overflow are upset about their content being used to train AI models without explicit permission.

Some users have been deleting their posts in protest, causing functional damage to the site's usability.

Stack Overflow terms and conditions prevent users from deleting posts, leading to backlash and discussions about data ownership.

It's been less than a month since the ubiquitous question-answer site Stack Overflow announced that they'll be partnering with OpenAI to provide training data for upcoming models, yet the decision has thrown the site into civil war over the use of user's data.

Stack Overflow has long been a bastion of help and support for programmers, developers and sysadmins, but it also features subsections for a range of topics from math to philosophy. Stack Overflow proudly boasts that it's the "world's largest development community, with over 59 million questions and answers", but the decision to partner with OpenAI has garnered significant backlash. This follows up the company's similar deal with Google Gemini earlier this year.

Stack Overflow is using developer's content to train AI

And some contributors aren't happy about it

Source: Stack Overflow

The crux of user uproar on Stack Overflow is twofold. Firstly, users are angry that their content is being used to train AI models without their explicit permission. The second cause of user frustration is Stack Overflow's response to protests against this policy.

Stack Overflow has been running for 15 years, and utilizes a similar system of user-content generation and up/down voting of both posts and responses to Reddit. This means that most of the content on the site is generated, sorted and in-effect moderated by volunteer members of the community.

Stack Overflow is now fighting its users

Users have been deleting their posts

In response to this announcement, users have been deleting their posts across Stack Overflow - unwilling to allow content they've created to be used to train AI models. This has caused a significant problem for Stack Overflow though, despite their terms of service allowing user content to be used for AI training. Stack Overflow has responded to this by temporarily suspending the accounts of users who have deleted their posts, and restoring the posts to public display.

Users have also found that editing their posts is ineffective. Moderators on the site have been suspending users' accounts and undoing changes to answers edited with incorrect or false information.

Stack Overflow's design makes it vulnerable to this kind of protest

The nature of Stack Overflow's content makes high-reputation users deleting their posts particularly damaging. As the site discourages duplicate posts, answers on popular posts are effectively a singular source of truth on the site. Each thread on the site is posted as a question, and the creator of this can select an "accepted" response from the replies. This reply is then promoted to the top of the question post. This isn't necessarily the most up-voted answer, and other users browsing the question can still see other users' answers.

An example Stack Overflow question post.

If a user deletes their post, any chosen or highly up-voted or accepted answers they've posted will become inaccessible, damaging the usability of the whole site. A user looking for an answer to a common question on Stack Overflow will find a highly accessible question post, with limited answers, and no clear accepted answer. The effect of this is that users with many highly-valued answers on the site hold the power to do significant functional damage to the wider site.

Stack Overflow's Terms and Conditions prevent users deleting posts

In response to this, Stack Overflow has been suspending or even banning the accounts of users who have been deleting their posts. A close reading of their terms and conditions shows that they do not allow users to delete their posts after posting, and retain ownership of all community-created content. Stack Overflow's terms and conditions say the following. We've removed some definitions for brevity.

You agree that any and all content [...] is perpetually and irrevocably licensed to Stack Overflow on a worldwide, royalty-free, non-exclusive basis

This continues with:

you grant Stack Overflow the perpetual and irrevocable right and license to access, use, process, copy, distribute, export, display and to commercially exploit such Subscriber Content, even if such Subscriber Content has been contributed and subsequently removed by you

Helpfully summarized later on with the following quote:

This means that you cannot revoke permission for Stack Overflow to publish, distribute, store and use such content and to allow others to have derivative rights to publish, distribute, store and use such content.

This would seemingly give Stack Overflow legal permission to retain a users' data once it's been added to the pool of community knowledge, but this clearly doesn't sit right with users. While it hasn't been challenged yet, this might also fly in the favor of "Right to be forgotten" legislation (GDPR) in the EU, among other legal hurdles.

Users are divided over Stack Overflow

OpenAI has opened a can of worms

Source: Stack Overflow

Delving into discussion online about OpenAI and Stack Overflow's partnership, there's plenty to unpack. The level of hostility towards Stack Overflow varies, with some users seeing their answers as being posted online without conditions - effectively free for all to use, and Stack Overflow granting OpenAI access to that data as no great betrayal. These users might argue that they've posted their answers for the betterment of everyone's knowledge, and don't place any conditions on its use, similar to a highly permissive open source license.

Other users are irked that Stack Overflow is providing access to an open-resource to a company using it to build closed-source products, which won't necessarily better all users (and may even replace the site they were originally posted on.) Despite OpenAI's stated ambition, there is no guarantee that Stack Overflow will remain freely accessible in perpetuity, or that access to any AIs trained on this data will be free to the users who contributed to it.

Reddit has been in similar trouble

We'd be remiss not to mention that Reddit has been in similar trouble with users over the years, and is vulnerable to some of the same problems with highly-upvoted users deleting essential replies or answers in a thread, which then makes the whole discussion unreadable. Reddit earned a similar backlash after also agreeing to license users data to Google.

Source: Reuters

AI is coming for all of our jobs, anyway

Whatever the future holds for Stack Overflow, one thing is clear. AI is marching on, and is slowly wading its way through many of the training data licensing issues exposed by the explosion of ChatGPT. While many AI products might be accused of being half-baked now, that's surely going to change.

Whatever the conclusion of this, Stack Overflow might well be in dangerous territory as AI continues to improve. Why bother wading through ten years of out-of-date jQuery questions and answers when you can just ask ChatGPT? But this raises another question - how will we train the AIs of the future?