Bit by Bit is a weekly column focusing on technical advances each and every week across multiple spaces. My name is Adam Conway, and I've been covering tech and following the cutting-edge for a decade. If there's something you're interested in and would like to see covered, you can reach out to me at adam@xda-developers.com.

When you played games like Pokémon Go, did you ever predict that your data would be used to train an AI model years down the line? You'd be forgiven if you didn't, and most people wouldn't have thought about it. However, things are undoubtedly moving in that direction across a host of services, with Niantic revealing in a blog post that it was building a "Large Geospatial Model" (LGM) in order to achieve "Spatial Intelligence", built on features like the new Pokémon Playground which sees users place a Pokémon at a specific location and others can appear to try and capture it.

While Niantic is very clear that older user data isn't being used to train its models, it raises an interesting question: what about services that are? It's no secret that LLMs like GPT have been trained on publicly available text going back over a timeline of literally hundreds of years, and the same goes for image-generation models. Niantic clearly wants to toe the line of what's "acceptable," as while companies will claim they own the things you put on their services or collect your data, AI training on older data has been a contentious issue. The likes of Meta have faced questions from the Data Protection Commission in the EU, but the nature of AI models makes it hard to even prove where the training data has come from.

What is a Large Geospatial Model?

It's analogous to an LLM

No attribution required -- Unsplash (https://unsplash.com/photos/Am1io6KusFM)

Large Geospatial Models (LGMs) are a new kind of AI designed to help computers understand and interact with the physical world. Think of them as a combination of maps and artificial intelligence. Unlike traditional maps, LGMs don’t just show where things are, they learn how spaces connect and adapt to new environments, even if only partially scanned.

Niantic, known for AR games like Pokémon Go, is pushing full steam ahead with its Visual Positioning System (VPS). By training AI on billions of location-tagged images, Niantic has created models that recognize over a million real-world locations. Their vision is to expand this into a global system that understands and links physical spaces, forming the backbone of technologies like AR glasses, robotics, and autonomous systems.

LGMs are similar to tools like ChatGPT, which use language data to generate text. However, instead of words, LGMs process spatial data, images, and 3D structures, to create a deep understanding of physical spaces. Unlike standard 3D models, LGMs can tie their understanding to real-world coordinates, ensuring precise navigation and interaction.

How your old data can be used to train new models

Meta is a great example of it

Source: Meta

Think back to 2014, when Facebook was active and all of your friends were using it. You posted life updates, stories, pictures of you and your friends, and you knew that Meta (Facebook at the time) was technically the owner of all of the content you posted. Now, a decade later, that same content can be used in ways that you wouldn't have even imagined. That's exactly what Meta was warned against doing, and so far, the company appears to be complying.

The same goes for games like Pokémon Go, where your data from your movements could be used to train a geospatial model that will teach a computer how to navigate a real-world space. Niantic is very clear that it isn't using old data to train new models and that the only data being used to train a model is the data collected from Pokémon Playgrounds, but still, there's technically nothing stopping the company from training on your older movement data. Plus, there'd be no way to tell. AI models are a black box after all, which is why it's a contentious issue when it comes to what data is being used to train models, and companies can really only be taken at their word.

As an example, The New York Times and Daily News sued OpenAI and its investor, Microsoft, over suspicions that OpenAI was using copywritten work to train its models. OpenAI is said to have given the plaintiffs in the case access to two virtual machines for searching through the training data, but all of the programs and search result data stored on one of those two machines were erased by OpenAI's engineers. This was explained in a letter to Magistrate Judge Wang:

On November 14, all of News Plaintiffs’ programs and search result data stored on one of the dedicated virtual machines was erased by OpenAI engineers. While OpenAI was able to recover much of the data that it erased, the folder structure and file names of the News Plaintiffs’ work product have been irretrievably lost. Unfortunately, without the folder structure and original final names, the recovered data is unreliable and cannot be used to determine where the News Plaintiffs’ copied articles were used to build Defendants’ models. Therefore, News Plaintiffs have been forced to recreate their work from scratch using significant person-hours and computer processing time. The News Plaintiffs learned only yesterday that the recovered data is unusable and that an entire week’s worth of its experts’ and lawyers’ work must be re-done.

Those plaintiffs stress that they have no reason to believe that it was intentional, but it's clear that the training of these models requires a huge amount of data, and it's hard to get enough data to train a powerful enough model with permission from every rights holder that would form that training data. It's also impossible to tell with certainty what exactly trained a model without having access to the original training data, which is a huge part of the development of a model and is an incredibly secretive process.

It's an ethical question without clear answers

Legally, we're still waiting for an answer

As it stands, it's hard to say on which side of the law problems like these will fall. GDPR makes it easier in the EU to prosecute companies as data has to typically be collected with a specifically outlined purpose, but again, proving the data was used will be difficult. Proof could be found in the future thanks to whistleblowers or data leaks, but at the moment, there's no clear way to tell where the data is coming from. As well, the EU has been very clear on how it stands on AI and is finding ways to crack down on nefarious uses.

That's without taking into account the ethical aspect, too. Even if it's judged that companies fall on the right side of the law on this issue, quite a few people will be uncomfortable with their data being used to train new models when that data was uploaded in the past without even a suspicion that incredibly powerful artificial intelligence models would use them for the basis of their development.

Niantic is very clear about how it's collecting the data it's using and is very clear that it's only new data, but what about companies that don't? What about your Reddit posts that are out there training models right now, or your movement data that other applications have collected and might be using to silently build a new model? That's where it gets murky, and it's going to be a long time until we have clear answers on what is and isn't okay.