In November 2024, several Canadian media companies sued OpenAI, alleging it circumvented technological protections to scrape their sites and copied their works, infringing copyright. This lawsuit and other ongoing legal battles underscore the growing tension between AI technology and existing copyright law, as well as the evidentiary challenges inherent in large-scale data-scraping claims.

Despite alleging massive infringement, the suit offers little concrete evidence, raising the question of whether broad accusations can compel an AI company to open its “black box” without a firmer factual basis. While these cases highlight legitimate concerns about using copyrighted material in AI training, they rest largely on speculation and circumstantial inferences. Indeed, the Canadian media companies are effectively asking OpenAI to prove a negative: to show it neither scraped their sites nor relied on their data for training.

The claims may be plausible, but outside of the confidential walls of OpenAI, they appear to lack evidentiary causal links to damages: details of circumvented protections and clear proof of copying. As filed, their chance of success appears uncertain, and the independent evidence suggested by the pleadings alone likely falls short of what is needed to prevail at trial.

The Problem of Circumstantial Evidence

Many of the claims against OpenAI (and other AI developers) rely heavily on inference, assumptions, and circumstantial evidence. The Canadian media companies’ arguments hinge on a few key points:

The "Likely Included" Argument: The plaintiffs state that because their works were accessible on the internet, and AI models are trained using vast amounts of online data, their works were likely included in training datasets. However, the statement of claim also acknowledges that this is assumed and not known: “The full particulars of when, from where, and exactly how, the Works were accessed, scraped, and/or copied is within the knowledge of OpenAI and not the News Media Companies.”

The pleadings refer broadly to OpenAI “scraping” millions of articles from the media companies’ sites, or from authorized third parties. But the plaintiffs have not identified specific articles or excerpts, nor provided examples of ChatGPT outputs that replicate or closely paraphrase those articles. Copyright infringement requires proof of copying a work (or a substantial part of it). Alleging that “you copied an enormous number of articles” typically requires at least some reference to a sample set of works or examples of outputs that reproduce distinctive content. This is notably absent from the Canadian media companies’ statement of claim.

Circumvention and Scraping Allegations: The plaintiffs also allege that OpenAI “scraped” their websites, and that since “paywalls” or “robots.txt” files were in place to prevent scraping, OpenAI must have bypassed those controls.

Yet there are no concrete facts explaining how they know OpenAI circumvented these restrictions, nor any technical details of detected intrusions. Stating the conclusion that “OpenAI circumvented our technological measures” is different from pleading the material facts establishing circumvention.

Model Outputs as "Proof": Although not specifically pled by the Canadian media companies, other plaintiffs point to the ability of AI models to reproduce outputs that are similar to their works as evidence of "memorization" and, therefore, of inclusion in the training data. For instance, the New York Times claims that GPT-4 can produce near-verbatim excerpts of some of their articles, although this alone does not prove direct, unauthorized use of their content.

The Canadian media companies make a similar claim alleging OpenAI broadcast, displayed, transmitted, and published their works. But unlike the NY Times, the Canadian media companies do not include any examples of this occurring, and their statement of claim is vague as to whether they have any evidence in hand.

The Lack of Specifics: A Trial of Vague Claims

These arguments share a common thread: they demonstrate that the media companies do not possess evidence that specific content was used, where it was obtained, or how it was used within the training process. Instead, they suggest it is likely that this was the case. They rely on a combination of availability (content was available on the internet), common practices (AI models are trained on vast amounts of data including scraped content), and output similarity (outputs are sometimes similar to plaintiffs’ works). These points, while suggestive, fall short of the burden of proof typically required in copyright infringement cases unless they can be substantiated by direct evidence.

Beyond the circumstantial nature of the evidence, the claims often suffer from a distinct lack of specificity. Consider these issues:

No Specific Datasets: Plaintiffs rarely identify specific datasets that contain their works. They often refer to large, publicly available datasets without naming the precise datasets or their composition. In the New York Times and Canadian media companies’ lawsuits, they mention datasets like Common Crawl (an open dataset of web data), but the lawsuit does not specify which content came from which sources. Notably, there is no suggestion that the Common Crawl dataset is infringing; only its use by OpenAI is alleged as unauthorized copying.

Vague Allegations of Copying: The Canadian media companies refer to their millions of works and infer that OpenAI copied and ingested them without concrete details about how, when, or from where this occurred. The claim states that OpenAI scraped their content directly from their websites, or from third-party datasets, but does not clarify which content was taken from each source.

Unclear or No Real-World User Infringement: Many claims fail to clearly explain how the use of the content constitutes copyright infringement. They claim the models were trained using their content and/or can reproduce their content, but do not show how this breaks any copyright law or results in harm.

In the New York Times lawsuit, the claims cite examples where the AI model was prompted with specific article excerpts and output similar text. However, these prompts were created by the plaintiffs, not typical users. There are no allegations that real-world users are doing this, or why such behaviour would cause harm.

The Canadian media companies allege that by making its models available, OpenAI effectively authorized copyright infringement. However, it remains unclear whether they contend that OpenAI’s conduct meets the Canadian legal standard of authorization — that it expressly or implicitly sanctioned, approved, or countenanced the infringing activity — or whether they are arguing that merely providing the means to infringe should be sufficient to establish authorization. The burden of proof is on the plaintiff to show that the defendant intended to authorize the infringing activity, and this cannot be assumed just because the defendant has the ability to prevent it. This will be difficult to prove considering the efforts undertaken by OpenAI to specifically prevent verbatim outputs and memorization.

The Problem of Proving a Negative

At the pleading stage, plaintiffs are not required to give every factual detail necessary to substantiate a legal claim, especially when the details are in the defendant’s control. The Statement of Claim appears to plead the material facts necessary to support its causes of action, but the pleadings at times mix factual allegations with inferences and conclusions. Whether these allegations will hold up if challenged at trial is a separate question that depends on whether the plaintiffs can produce actual evidence in discovery to prove those facts: a few ‘smoking-gun’ examples, or server logs, or AI outputs that strongly indicate copying.

This lack of specifics makes it difficult to establish a clear case of copyright infringement. It also creates a situation where the plaintiffs are, in effect, asking AI companies to prove a negative: to prove that their data was not used. This approach challenges the established legal principle that the burden of proof rests with the plaintiff.

Demonstrating that specific data was not used is a difficult, if not impossible, task. It is much easier to demonstrate that an action did occur than to prove that an action did not occur. Because training datasets are complex and often opaque, the plaintiffs risk shifting the burden of proof, demanding that OpenAI demonstrate that the plaintiff's content was not present, or was not used in an infringing way.

This situation creates an unfair imbalance. The plaintiffs, by relying on the claim of "likely" inclusion and circumstantial evidence, are essentially asking the court to compel OpenAI to disclose their training data records, but without first providing specific evidence of copyright infringement.

Implications and Questions

This situation raises critical questions about how AI training applies to copyright law. The lawsuits against OpenAI highlight the challenges of litigating copyright issues related to AI, particularly due to the opaqueness of training datasets and the difficulty for copyright holders to obtain direct evidence of infringement.

The Canadian media companies must make a more compelling case based on demonstrable facts rather than broad assertions and circumstantial arguments. The statement of claim repeatedly states that the particulars are “… solely within the knowledge and control of the OpenAI Entities and not the News Media Companies,” implying the plaintiffs lack the evidence to support their claims. Unlike the New York Times claim that includes examples of specific infringement, the Canadian media companies rely on inferences, assumptions, and circumstantial evidence to support their vague claims.

To be successful, the Canadian media companies will have to adduce evidence to support their claims: log or forensic evidence of scraping and clear proof of circumventing technological protection measures (TPM); specific examples of copied files (relying on outputs); and evidence verifying that training sets included plaintiff’s articles.

While the media companies’ statement of claim likely pleads the material facts necessary to support the cause of action, there appears to be evidentiary gaps supporting these facts. Relying on OpenAI to fill these gaps in discovery is risky.

As litigation unfolds, courts may articulate new legal tests or processes for AI-based copyright claims that could guide both future plaintiffs and AI developers on how to plead, defend, or avoid allegations of large-scale data scraping. Legislatures worldwide may also intervene as several countries consider allowing text and data mining for training AI. Ultimately, these lawsuits will test whether broad, circumstantial allegations can compel AI developers to disclose their entire data pipeline, or whether courts will insist on stronger initial proof. Whichever way it goes, the result could shape AI copyright jurisprudence for years to come.

The opinion is the author's, and does not necessarily reflect CIPPIC's policy position.