Text & Data Mining On Copyright Protected Works For Use By Generative AI
TDM versus Training: The difference between making and using a dataset.
Part 3 of CIPPIC's Fair Dealing Week Series:
The difference between TDM and Training
The world is rapidly adopting Generative AI (gen-AI) models as tools for creating and innovating. In the wake of this proliferation, many Canadians are expressing concern regarding the application of copyright law to these technologies. Of particular concern is the use of copyright-protected works in the in the creation of these tools through the AI “training” process. Authors, artists, musicians, and other creators have taken issue with AI developers scraping their protected works through text and data mining (TDM) activities and subsequently using those works in AI training datasets without authorization, which may lead to gen-AI outputs that infringe existing works.
Gen-AI training and TDM are distinct activities. Canadian courts considering copyright infringement claims should avoid conflating these different activities. As defined by IBM, TDM is the process of “transforming unstructured text into a structured format to identify meaningful patterns and new insights.” This generally involves the scraping and combining of data from digital sources into a tabular dataset format, and thereby focuses on extracting and analyzing existing information. A developer may also implement TDM techniques to clean and pre-process the dataset prior to the training process. This definition starkly contrasts with that of AI training, which is described by IBM as the application of a “model’s mathematical framework to a sample dataset whose data points serve as the basis for the model’s future real-world predictions.” As detailed by Telus International, training involves the developer exposing their chosen model type to a clean training dataset and then making iterative adjustments to the model’s parameters to improve its overall accuracy. The developer must continuously monitor, debug, and fine-tune the model until it is able to generate realistic and accurate outputs based on inputs, or “prompts,” which is achieved through validation and testing of the model. Generative outputs are a product of the model learning from its training dataset and making predictive decisions as to the appropriate response based off the prompt. Therefore, unless the developer opts to integrate TDM into the training process through employing continuous learning (see below), TDM activities are typically entirely complete prior to training.
Why is this distinction important?
The separability of TDM from actual gen-AI model training is key to determining whether copyright infringement arises from the use of copyright-protected materials at any stage during the AI development process. Despite the fact that TDM can be an essential component of developing a gen-AI training dataset, that is not its sole purpose. Rather, humans have employed TDM since the 1980s as a method of gathering and analyzing large datasets to seek out patterns, trends, and insights without any integration to AI models. In other words, TDM techniques are a practice and not a technology.
TDM activities are only one specific component of a support system for a gen-AI model, meaning that TDM is not equivalent to the training process nor to a gen-AI program itself. After a developer completes TDM for the purpose of creating a dataset, they must structure and pre-process the raw data prior to use in training. This typically involves further TDM activities for the conversion of the raw data into features, the removal of duplicate records and outliers, data normalization, as well as the actual splitting of data into training and testing sets. As a specific example, the developer may choose to employ feature extraction to improve model outcomes. Feature extraction is a pre-processing technique that involves the developer applying an algorithm that identifies and selects the most relevant features from the dataset such that any underlying patterns can be better observed. For raw text data, this may involve TDM techniques like tokenization, which is the computerized conversion of sentences into their component parts (typically words or key phrases known as “tokens”). For already structured data, the developer may use an algorithm that parses the data to select the most relevant columns (data categories) for training.
While TDM can therefore play a crucial role in gathering raw data, preprocessing it, and identifying relevant features prior to gen-AI model training, it is a clearly distinguishable, yet occasionally integrated, activity. This potential for integration flows from continuous learning, also known as “Online TDM.” Where a developer implements online TDM, the training dataset is continuously updated and pre-processed as new data becomes available, meaning that the model is continuously learning in real time. However, this concurrent use of TDM with training is optional and typically only employed by large technology companies due to the high costs associated with dynamic data collection.
Even with continuous learning methods, TDM does not generally result in the reproduction of durable or accessible copies of copyright protected works scraped from the web, whether text or image based. Instead, developers use TDM to process data in a transformative manner, such that the model can extract and tokenize patterns, meta-data, and other insights. Where reproduction does occur, the copies made are technical and temporary in nature. Such copies benefit from a number of user rights designed to facilitate innovation and its ensuing scientific, economic, creative, and artistic benefits, such as fair dealing (see below). Consequently, CIPPIC is of the position that TDM activities alone should not be grounds for copyright infringement, as it does not give rise to an infringing reproduction, publication, or performance of the work that can be accessed nor distributed. Extending owner’s rights to TDM would unduly extend protection to ideas, information, and data rather than the original expression of the work.
The Defence of Fair Dealing: TDM
Under the Canadian copyright regime, one of the key defences against a finding of infringement is fair dealing, which permits infringement for research, private study, education, parody, satire, critique, or news reporting purposes, with such purposes being broadly construed by the courts. Once the defendant establishes that the dealing at issue is for one of the statutorily permissible purposes, a court must find that the dealing is fair, which is a factual determination that looks to the purpose, character, nature, amount, and effect of, as well as alternatives to, the dealing. As illustrated by this test, the applicability of fair dealing is highly factual and contextual in nature.
Canadian courts have yet to opine on whether TDM for the purpose of building a gen-AI training dataset constitutes a statutory purpose for fair dealing. However, considering the broad interpretation of this provision, TDM activities likely fall within the scope of a research purpose. As stated by the Canadian government in their 2021 consultation on copyright and AI, “TDM allows researchers and analysts to gain knowledge at a scale and speed that would be impossible to achieve manually, thereby making research more efficient and effective.” Notably, the research purpose is not limited to non-commercial or private contexts, though Canadian courts have stated that research done for commercial purposes is more likely to be found unfair compared to research done for non-commercial purposes.
Assuming that TDM activities do constitute a statutorily permissible research purpose, we must then determine the overall fairness of the dealing through its character, nature, amount, effect, and an assessment of its alternatives. While infringement will occur during the TDM process, any copies are internal to the model and do not result in a generally accessible, wholesale copy of the work. Often, the TDM output is a summary of analyzed data, not a reproduction nor a generative work. Thus, the character and nature of the reproduction is technical, temporary, and inaccessible. Even where the developer has commercialized the model and its algorithm, the profitability of the model typically does not flow from reproductions within training data – it flows from how the training data advanced the model to create original, transformative outputs. More so, the effect of permitting TDM on copyrighted works is to permit for technological and computational innovation, with the alternative being to delay such advancements both within and outside of the gen-AI sector.
The Defence of Fair Dealing: Training
Recall that assembling a corpus of data via TDM for the purpose of building a training dataset is distinct from actually using the dataset to train a gen-AI model, with the exception of continuous learning. One key issue, therefore, is whether the same fair dealing exception will apply to the act of using a TDM-sourced dataset to train a gen-AI model.
Rather than simply scraping and processing the data to create a trainable dataset through TDM, the training process involves the developer feeding data into the model and iteratively adjusting parameters to improve the model’s performance. These adjustments optimize the gen-AI’s predictive capabilities so that outputs are more accurate in nature.
Canadian courts have remained silent on whether using a TDM-sourced dataset containing copyrighted materials for the purpose of training a generative AI model amounts to infringement. While judicial guidance is much needed, the Copyright Act currently suggests that the answer is no. Section 30.71 of the Act sets out an exception to infringement for “Temporary Reproductions for Technological Processes” as follows:
30.71 It is not an infringement of copyright to make a reproduction of a work or other subject-matter if:
(a) the reproduction forms an essential part of a technological process;
(b) the reproduction’s only purpose is to facilitate a use that is not an infringement of copyright; and
(c) the reproduction exists only for the duration of the technological process.
If the Canadian legislature or courts deem this provision applicable to the use of copyrighted material in training data, it would provide an exception to infringement. Generally, training data is essential to the AI development process and does not result in memorization nor reproduction, which means that the reproduction is limited to the AI training process. Under this scheme, AI training processes would not constitute infringement, with the key factor being that it does not facilitate infringement in subsequent use of the AI.
In the alternative, fair dealing may also apply to the training process, but with caveats. While training and refining an AI model is clearly also a type of research, and thus fits within a statutory purpose for fair dealing, the question of overall fairness is much more complex. This is because the profitability of a gen-AI model is rooted in its accuracy, and that accuracy is gained through the training data used and the training process itself. Here, the commercial nature of a model becomes vital to determining whether the dealing was fair, as does the character of the model in the types of outputs it generates. For example, a court may view a commercial gen-AI model that produces summaries of the literary works within its training data as unfair since the outputs compete with the original author’s works. Simply put, another party is infringing on the economic rights associated with authorship of a copyright-protected work by profiting from its reproduction and use during the training process.
The issue with Separability and Fair Dealing In theory, Canadian courts could apply the defence of fair dealing TDM independent of its potential application to the AI training process. For instance, where a developer conducts TDM outside of the gen-AI context, i.e., where its purpose is other than assembling and structuring a training dataset, fair dealing principles should consistently apply. Furthermore, even where an individual employs TDM to build a training dataset, they may be doing so for the purpose of selling the dataset to a gen-AI developer, meaning that the individual is not actively engaging with the training process at any point. In practice, however, the AI training process frequently incorporates some form of TDM to curate the model’s training dataset, take continuous learning for example, such that the issue of TDM and training within the gen-AI context persists despite theoretical distinctions. Given this, it becomes apparent that there is need for legislative direction on the extent of fair dealing and its application to copyrighted works utilized at any stage of the gen-AI training process. While engaging in TDM on copyrighted materials alone may not be a cause of infringement per se, training a model on such materials and deriving economic benefits from its subsequent public use raises fundamental fairness concerns for the original authors of the works. Thus, in our view, the fairness of the training process hinges on the commercial nature of the developed model, suggesting that owners of commercial gen-AI models should obtain licenses for the use of a copyright protected works in their training datasets.
This opinion was written by a CIPPIC student. The opinion is the author's, and does not necessarily reflect CIPPIC's policy position.