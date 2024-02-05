Key Takeaways ChatGPT quickly became the fastest-growing consumer software tool, amassing over 100 million users in just two months, showcasing the power of large-language models in the tech industry.

LLMs like ChatGPT require massive amounts of data to train and absorb into their models, leading to a significant change in how data is available on the internet.

The availability and access to training data for LLMs are becoming increasingly restricted as companies lock down their APIs and data, leading to challenges and potential legal issues regarding content reproduction and data usage.

The world didn't change when ChatGPT launched to the public in November 2022. But in the tech space, it certainly felt like it. Most people were familiar with algorithms, fewer with Machine Learning, and an even smaller margin were familiar with reinforcement learning. These algorithms were black boxes, spooky bits of mysterious technology inside big-tech companies designed to push relevant ads. But overnight, ChatGPT took over the tech news cycle as increasingly impressive demos of its capabilities were being discovered. Overnight, people were enamored by the power of large-language models, from debugging code to writing cover letters. AI rapidly moved from slowly building hype to being the newly established frontline of tech.

ChatGPT became the fastest-growing consumer software tool in history, amassing over 100 million users within two months of launch. It was a huge success story. While large-language models had been in development for years, this was the first real demonstration of the power of AI that landed with the public. It's the closest we've come to the illusion of true Artificial General Intelligence (AGI). But with this huge success came a new conversation, worsening an already difficult problem for AI. Where does all the data come from?

Access to data for training LLMs is fundamentally changing how available data is on the internet, and it's a problem that's not going away soon.

How are LLMs trained?

GPTs need data — a lot of it.

LLMs like ChatGPT require massive amounts of data to 'absorb' into their models, giving them the knowledge we see without the need for internet access. These models are 'trained' on this data, and while models have been getting more and more powerful, this has come with requirements for more and more data. To understand why data is so important to GPT models, we can take a quick look into their history.

Google changed the game for LLMs

In 2017, Google published its transformer architecture of artificial neural networks (ANNs), which forms the foundation of ChatGPT now. Transformers vastly improved the capabilities of neural networks to consider context and dramatically reduced the volume of labeled training data required. Labeled training data is data that's been modified with labels or categories to add context or additional meaning. Transformers represented a significant inflection point in developing natural language processing (NLP) neural networks and their success spurred OpenAI to adopt a similar model and expand on it with its new GPT (Generative Pre-Trained Transformer) models.

These Transformers are trained (or pre-trained, in the LLM-specific lexicon) with objectives based around next-sentence and next-word prediction - i.e. given either a word or phrase in context, predicting the following word or sentence. This is done using a reward function - a mathematical function which feeds back how 'good' each response is to the model, allowing it to adjust and optimize. Transformer architecture LLMs like BERT or ChatGPT are pre-trained on an extremely large set of unclassified, raw data (known as an unlabeled dataset), like the body of Wikipedia. This sentence/token training on a large body of data is what allows LLMs to 'absorb' information - they are effectively learning to reproduce a body of data/language using statistics, as explained by Google:

"One of the biggest challenges in natural language processing (NLP) is the shortage of training data. [...] To help close this gap in data, researchers have developed a variety of techniques for training general purpose language representation models using the enormous amount of unannotated text on the web (known as pre-training)."

Much of the progress in LLM models in recent years has stemmed from the Transformer, as it enables the greater use of unlabeled data, which is more easily scraped directly from the internet. Generating labeled training data is expensive, and the size of datasets to train against is limited.

Refinement stages make ChatGPT great

The pre-training stage is far from the complete product though; the model must then be refined. The effectiveness of the refinement stage is what's really separated ChatGPT from other LLMs previously. In the refinement stage, the model is refined with a number of objectives in mind, including sanitizing its output, making it more coherent, reducing bias and protecting its original dataset.

OpenAI hasn't released any specifics on how ChatGPT's models were refined, but we know that this stage was performed with a technique known as RLHF (Reinforcement Learning with Human Feedback.) There are different techniques for RLHF, but the general idea is that the model is given a reward function based on human feedback, with human input being used directly to refine the answers the model gives to specific questions.

GPTs are reproducing data they're trained on

OpenAI has been caught copying homework

GPT models ingest and absorb large amounts of content, effectively compress it into the model, then reproduce it back to us. While efforts are made to prevent GPT models from simply repeating their training data (like using the largest set of training data possible), this can leave the line of demarcation for derivative content at best gray, and this is a battle already being fought out in the courts. The New York Times has already launched a widely-publicized lawsuit against OpenAI and Microsoft, alleging that ChatGPT can generate "verbatim excerpts" of New York Times content:

Defendants’ generative artificial intelligence (“GenAI”) tools rely on large-language models (“LLMs”) that were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more

This problem goes deeper — Google researchers have previously been able to extract raw training data from ChatGPT, and it seems likely that these issues will continue. There will be ongoing questions about the extent to which content produced by ChatGPT is capable of being original, and whether their use of copyrighted work to generate these language-prediction models constitutes fair-use, but in the meantime, major corporations who are seeing their data repeated by ChatGPT will likely continue to seek legal remedies and make ongoing efforts to prevent their data being used without permission.

Accessing API data is only going to get harder

The immediate impact of ChatGPT is already here.

Source: X

This is where the real problem for new LLMs lies. Labeled training data is expensive to produce, so research around LLMs has focused on making better use of unlabeled training data. Previously, companies have been consuming vast amounts of training data from the wider internet. But this is changing fast. The huge wave of hype around LLMs/AI (OpenAI are reportedly targeting $100B valuation), as well as some widely publicized lawsuits over ChatGPT's training data, is having a huge impact on the availability of this data online.

While lawsuits are ongoing, companies are taking more proactive steps to protect their data.

X led the charge

We know that ChatGPT was trained on X/Twitter's data (as evidenced by some users being able to regurgitate their tweets verbatim). But in March 2023 X/Twitter led the charge in locking down API pricing. While this might have been part of a drive towards profitability, its timing coincides with clear evidence that ChatGPT had been extensively trained on X's users' data, likely at non-negligible cost for X (in terms of API/hosting fees). X is likely to try to monetize this process in the future — including by introducing per-tweet pricing for its API access. Even a conservative estimate of the fees OpenAI would be required to cough up to access the same volume of Tweets ChatGPT is likely to have access to, would be in the region of an eight-figure dollar amount.

This change was also supported by X's decision to disallow any access to posts on the site unless a user was signed in.

Reddit then followed suit

In the wake of X/Twitter, Reddit quickly followed suit, locking down access to its APIs despite the uproar caused by the damage to third-party app ecosystems. Again, while AI wasn't explicitly mentioned in Reddit's blogposts, their new Data API Terms explicitly forbid the training of large-language models with Reddit content without a separate commercial arrangement.

Except as expressly permitted by this section, no other rights or licenses are granted or implied, including any right to use User Content for other purposes, such as for training a machine learning or AI model, without the express permission of rightsholders in the applicable User Content.

The industry has aligned

This is an issue that's quietly been getting more and more serious. Others, especially those either publishing a large amount of user-generated content or publishing a small volume of highly valuable, informative content (like news organizations), have quickly followed. The list is growing all the time, but many of the web's largest publications, including the BBC, Financial Times, and Stack Overflow, now explicitly block the training of LLMs for commercial purposes. Some sites, however, have remained lenient in allowing training for personal use.

In some cases, such as Adobe's, their terms of use for their own Generative AI's prevent users from using one AI to train another.

When using our generative AI features, you agree you will use them only for your creative and productivity work product and not to train artificial intelligence or machine learning models. - Adobe

ChatGPT places similar restrictions on its terms of use, asking that users not use ChatGPT to "develop models that compete with OpenAI", nor attempt to extract data directly from the model.

Big tech is taking the backdoor route to access data

While many companies are locking down access to their data to protect it from third parties, many are opening up access to the first party, so to speak. As reported by Gizmodo in July 2023, Google has opened up its privacy policies to allow all data it scrapes to be used in the training of AI models, effectively forcing all companies who wish to be publicly listed on Google to consent to their content being used to train models. Microsoft's privacy policy is nebulous on the topic, but seems to allow the training of AI models on user data, and a sharp-eyed X user in September 2023 noticed that their privacy policy had been updated to allow training of AI models on user-data.

Alternative datasets are available

Selling labeled data is already a market of its own

Close

It isn't all doom-and-gloom for AI training though; there are alternative datasets available, such as the Wikipedia Corpus and those provided by Common Crawl (an organization dedicated to providing high-quality freely available datasets); though these datasets may struggle to generate the same breadth of interactions currently possible with ChatGPT (and similar LLMs).

Similarly, serious effort is being put into both labeled and annotated datasets designed specifically for pre-training and fine-tuning of AI. Some of these include datasets like Dolly by DataBricks, a crowdsourced dataset generated through a competition with DataBricks own 5000+ employees. Machine Learning PhD Sebastian Raschka has an excellent roundup of some open-source datasets lined up for LLM training on his blog. It's likely that some datasets are also being produced internally by companies working to build generative AIs, or with private commercial agreements.

There are also a range of free and for-profit data marketplaces opening up online - hugging-face.co has quickly become a dominant player here.

In the meantime, expect to pay for more APIs

While the future of AI remains murky, there's one thing that's clear. AI is fundamentally changing the way we access data on the internet, here and now. Increasingly, we can expect access to social networks and sites hosting significant user-generated content to be paywalled, causing more situations like Reddit's Apollo debacle and frustration as more sites like Twitter are inaccessible without an account.

LLMs are undoubtedly one of the most exciting areas of research right now, with even their creators surprised by the power of their advanced models in doing everything from writing a quick email to generating code. But their existing model of relying on large existing datasets to train on is having real-world effects outside the direct reach of AI itself. We risk a situation where AI development is hindered by a "have versus have-not" scenario, where only companies with the ability to leverage existing platforms can afford to license the requisite data to train these models effectively.