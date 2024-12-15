Bit by Bit is a weekly column focusing on technical advances each and every week across multiple spaces. My name is Adam Conway, and I've been covering tech and following the cutting-edge for a decade. If there's something you're interested in and would like to see covered, you can reach out to me at adam@xda-developers.com.

If you've ever made a Google search and seen a message at the bottom of your screen that says something along the lines of "Some results may have been removed under data protection law," then you've had first-hand experience of the right to be forgotten. It's a philosophy that suggests that private individuals should be able to remove information about themselves from search engines and other sites. It's primarily a European Union right at present, but plenty of countries outside of the EU have their own rules on it too, and Google has extended the right to request removal of data to U.S. citizens.

However, there's an interesting quandary when it comes to the right to be forgotten and how that pertains to a large language model like ChatGPT. It's not clear whether it actually applies, but a recent discovery with ChatGPT suggests that OpenAI is attempting to find ways to respect those rights in the case the company is later forced to abide by those requests. If that is the case, then the company might find itself entering some very treacherous territory in the future.

What happened?

"David Mayer" happened

It all started with the name "David Meyer," where users noticed that ChatGPT straight up refused to answer questions about the name or respond to the name, and trying to force it to say the name would result in an error. After a while, other names started to pop up that faced the same problem, as shared by ArsTechnica. These include, but are not limited to:

Brian Hood

Jonathan Turley

Jonathan Zittrain

David Faber

Guido Scorza

It later became apparent that one thing all of these names had in common was that there had been a request for data relating to those names to be removed from Google. "David Mayer" was fixed and can now be said by ChatGPT, but the rest can't be.

As for the actual problem, OpenAI said to The Guardian via a spokesperson that “One of our tools mistakenly flagged this name and prevented it from appearing in responses, which it shouldn’t have. We’re working on a fix." While a fairly innocuous statement, reading between the lines suggests that OpenAI does filter out some names from responses and that David Mayer was filtered by mistake.

Can right-to-be-forgotten work when it comes to an LLM?

And can it cause other problems, too?

LLMs are trained on a huge amount of data, and the data isn't stored in a database where specific information can be retrieved. Removing data retroactively is borderline impossible, and tracing data back to the parameters it was trained on is an incredibly difficult task. As well, identifying personal data in those datasets is incredibly difficult, especially if someone has the same name as someone else.

As a result, the tactic OpenAI has undertaken has to filter responses instead, essentially wrapping the LLM's responses inside of a piece of software that analyzes it for content that should be filtered out. That's how OpenAI prevents responses that contain problematic content, and the same goes for preventing the LLM from being trained on problematic data. This is also what enables ChatGPT "jailbreaks" which get around these conditions.

As more and more names are added to the filter that removes content, further problems may arise. How can you account for every person who wants their data removed, and how can you even police that? Because of the sensitive nature of data removal, ChatGPT likely instantly fails the request rather than trying to filter it out to prevent any chance of personal information being shared with the user. It's very likely that ChatGPT would struggle with large swathes of the internet if it were to try and respect every removal request.

As an example, it's likely that "Brian Hood," Mayor of Hepburn Shire Council, is one of the names as he threatened to sue OpenAI for defamatory statements it made about him. He gave OpenAI 28 days to filter his name out of defamatory statements, which the company agreed to do. Since then, his name has caused ChatGPT to fail when you type it in a request, just like the other names.

Interestingly, "Jonathan Zittrain" told 404 Media that despite him pointing out the problems around AI in an opinion piece published in The Atlantic, he doesn't know why ChatGPT filters out his name on the service. Searching his name on Google also reveals that content has been removed under data protection laws, which I'd speculate is linked. This can't be proven, though.

As for "Jonathan Turley," he wrote in a blog post that he was defamed by ChatGPT, but told 404 Media that he did not file any lawsuits against OpenAI, nor was he contacted by OpenAI. However, his name also raises an error, and searching his name reveals the same flag as Zittrain, saying that content has been removed under data protection laws.

What does the future of AI look like?

It's hard to say who wins in the end

With lawmakers cracking down on AI and how it can process user data, it's likely that this problem is only going to become more pervasive as time goes on. If the right to be forgotten is forced on LLM makers like OpenAI, then it's likely that change will need to come from the training data rather than filters around the finished product. But then what about public web pages that contain information? Can LLMs no longer interact with those?

This is going to be as much an ethical question as it is a technical one, and whatever approach companies take will have significant downsides, either for the people who requested their data removed or the companies that have to remove that data.