Generative AI training data and copyright: Decoding the U.S. Copyright Office’s soon-to-be published report

Written by Leigh Ebrom | May 15, 2025 3:02:08 PM

Why a pre‑publication report deserves your attention

I know. “Pre‑publication” sounds as thrilling as waiting for your coffee to brew. But the U.S. Copyright Office’s 108-page draft report, Copyright and Artificial Intelligence; Part 3: Generative AI Training gives companies that are training AI models and creating content a peek behind the regulatory curtain. If you build marketing campaigns, train machine‑learning models, or lose sleep over whether tomorrow’s AI‑generated summaries will mimic your blog posts, the report should be on your desk.

The Copyright Office hasn’t finalized its stance yet, but the document sketches the policy lines we’ll all have to color inside. It asks three not-so-simple questions:

Is copying protected works to train generative‑AI models prima facie infringement?
If yes, when (if ever) does fair use save the day?
How should the government balance the interests of AI companies and content creators?

Behind those questions sit billion‑dollar AI labs, centuries of creative labor, and your next marketing brief. Let’s explore where the Copyright Office is headed, regulatorily.

The USCO is trying to balance the competing interests of AI developers and content creators

Interested parties, including trade organizations, individuals, and big businesses like Meta, submitted over 10,000 comments to the USCO. The Office acknowledges the intensity of the debate about AI training data and the swath of lawsuits that are making their way through U.S. courts. People have big feelings about AI training data and intellectual property.

The USCO realizes that this it is wading into a morass. It comments:

Some warn that requiring AI companies to license copyrighted works would throttle a transformative technology, because it is not practically possible to obtain licenses for the volume and diversity of content necessary to power cutting-edge systems. Others fear that unlicensed training will corrode the creative ecosystem, with artists’ entire bodies of work used against their will to produce content that competes with them in the marketplace. The public interest requires striking an effective balance, allowing technological innovation to flourish while maintaining a thriving creative community.

Furthermore, artificial intelligence and its training systems are rapidly evolving. Andfair use doctrine is nuanced. You almost get a sense that the Office is begging us for patience and assuring us that they see both sides of the debate in Part 3 of their AI guidance.

However, in addition to outlining the essentials of how AI training systems work, they do outline their general direction moving forward.

Copying content is prima facie infringement

The Office’s opening move is blunt: when a developer scrapes a copyrighted novel, photograph, or song to train its model, that act checks the boxes of an infringement claim. Prima facie, Latin for “at first glance,” means the claimant has cleared the low bar of showing two facts: they own the work and you copied it. That alone gets you into court, even if stronger defenses (like fair use) might still carry the day.

But an accusation isn’t the end of the story. Think of it like a speeding ticket: the officer clocks you going 15 miles over the limit and hands you the citation. That’s the prima facie case, proof you were on the road and over the threshold. You still have a chance to show why the ticket shouldn’t stick: maybe the radar gun was faulty or you were trying to avoid an accident. If a prima facie case is the ticket; fair use may be your day in traffic court.

Fair use will be assessed on a case-by-case basis

In U.S. copyright law, fair use is a safety valve that lets content creators borrow snippets of someone else’s work when doing so serves a broader public interest. It isn’t a blanket permission slip; it’s a context‑driven analysis that weighs several elements before deciding whether permission was truly needed.

Courts balance four factors:

Purpose and character of the use: Is the model’s training phase transformative or purely commercial?
Nature of the work: Are we talking about factual spreadsheets or a Pulitzer‑winning novel?
Amount and substantiality: How much of the original ended up inside the model’s parameters—and can it be extracted again?
Market impact: Does the AI’s output compete with or undermine the original creator’s revenue?

The USCO report doesn’t declare winners. Instead, it offers guideposts: research‑oriented, transformative uses lean toward fair use. Outputs that mimic or substitute for the original tilt against. Each model, dataset, and business plan will get its own day in court, metaphorical or literal.

“Publicly available” does not equal “free to use”

Scraping the open web feels democratic until you realize how many copyrighted works hide in plain sight. The Books3 dataset includes full novels from living authors. Common Crawl vacuumed up entire news sites.

The report’s takeaway is clear: location does not override ownership. If your pipeline relies on public URLs, audit it like you would a new vendor contract. Ignorance is not a defense; you must perform due diligence when building your training data sets.

Licensing is already rewriting the rules of engagement

Here’s the optimistic angle the USCO highlights: creative industries and AI developers are starting to talk business instead of lobbing lawsuits. Universal Music cut deals with the biggest AI song generators. Getty Images inked agreements that let model builders tap its vast photo library without stepping on landmines.

These early deals matter because they prove a market can form. The Copyright Office says, in polite government prose, “Let’s see how far voluntary licensing can take us before we impose blanket solutions.” In other words, if industry can self‑organize, Congress will keep its hands in its pockets a little longer.

There’s a full policy toolkit on the table. Compulsory licensing is the last option

Should voluntary deals stall, the report floats extended collective licensing (ECL) as a softer statutory nudge. Under ECL, creators can opt in to a collective that negotiates on their behalf, while users get predictable rates. It’s already common in Scandinavia for photocopying and streaming rights.

Compulsory licensing, the powerful tool that forces access at a set fee, remains the option of last resort. The Copyright Office warns that compulsory schemes make sense only when markets fail entirely. So far, the regulators remain hopeful that businesses can hammer out the details of their own.

A creative commons for the AI age?

The report hints at a future where creators, platforms, and tech firms co‑design a licensing fabric robust enough to support large‑scale training while ensuring artists get paid. Call it Creative Commons 2.0, an ecosystem where permissions travel with the file, royalty micro‑payments flow automatically, and attribution is baked into metadata.

We’re not there yet, but the seeds are visible: the Content Authenticity Initiative’s provenance tags, watermarking proposals from OpenAI and Anthropic, and blockchain‑based rights registries. The Copyright Office effectively says, “Keep tinkering; we’re watching.”

How AI‑training rules affect marketers, agencies, and in‑house creatives

The Copyright Office’s report draws a clear distinction between what goes into a model (training data) and what comes out of it (generated content). Both stages carry unique obligations for marketing teams that rely on or build their own AI tools.

Training‑stage risks

Unlicensed or poorly documented datasets create the highest exposure here. Common pitfalls include:

Web‑scraped content without permission. Public URLs still carry copyright, and “fair use” is not automatic.
Third‑party data of uncertain origin. Vendor‑supplied data might blend licensed and unlicensed works; you inherit the liability if you fine‑tune on them.
Competitor or client materials. Proprietary text pulled in by mistake can violate NDAs and privacy laws in addition to copyright.

Practical controls for training data

Source audits before ingestion. Verify copyright status, licence terms, and any usage restrictions.
Written licences or clear terms of service. For stock libraries, trade journals, and niche creators, negotiate explicit training allowances.
Granular documentation. Maintain hashes, timestamps, and chain‑of‑custody logs for every dataset version.
Data minimization. Retain only the excerpts necessary for the model’s objective; delete non‑essential files.
Periodic reevaluation. Rescan legacy datasets when licenses expire or laws change.

Output‑stage safeguards

Even perfectly curated training data can still produce infringing material if guardrails are lax. Reduce risk by:

Similarity scanning. Run generated text and images through automated overlap detectors before publication.
Human review checkpoints. Treat model drafts as raw material that requires editorial sign‑off.
Content filters and prompt constraints. Block requests likely to elicit verbatim excerpts from copyrighted works.
Persistent logging. Store prompts, outputs, and reviewer notes so you can reconstruct events if a claim arises.

Why it matters

Marketers protect brand equity and avoid takedowns when their training inputs and outputs are traceable and licensed.
Agencies turn rigorous compliance into a value‑add for clients who expect both speed and safety.
In‑house creatives and technologists minimise legal firefighting, freeing time for strategy and experimentation.

Adopt these controls early and you’ll spend more energy on creative optimisation—and less on cease‑and‑desist responses.

The cautionary tale of “AcmeGPT”

Picture this: Acme Inc. scrapes a million blog posts, containing recipes, travel diaries, legal advice, to build AcmeGPT, a consumer‑facing writing assistant. Early beta testers love it. Then authors notice paragraphs lifted wholesale from their copyrighted works.

Acme’s legal team scrambles. Their dataset included everything under the sun because “public is public,” right? Wrong. They’re hit with takedown notices and a class‑action lawsuit. Investor confidence wobbles. A nine‑figure valuation evaporates.

Now rewind. Imagine Acme had licensed content from three specialty publishers, logged its data provenance, and filtered outputs to avoid verbatim excerpts. The launch might have cost more upfront, but the legal runway would be clear, and the company’s brand equity intact.

That, in miniature, is the decision facing every modern marketer.

What marketers should do tomorrow morning

Tomorrow’s to‑do list doesn’t require a PhD, just practical steps:

Map your data sources. Who owns them? Do you have a license? How easily can you swap them out?
Assess your outputs. Could the text, images, or audio be traced back to a single creator? If yes, you’re too close for comfort.
Negotiate proactive licenses. Reach out to stock‑content providers, trade journals, or even individual influencers. It’s cheaper than litigation.
Bake review loops into your workflow. Human editors should remain the last mile before publication, especially in regulated industries like legal, healthcare, and finance.
Stay nimble. The policy canvas will shift. Build processes that can flex without tearing down the whole house.

LaFleur’s compliance‑first, creativity‑always approach

At LaFleur, we live at the intersection of bold creativity and careful compliance. Our clients, law firms, healthcare innovators, financial‑service leaders, don’t have the luxury of “move fast and break things.” They need to move smart and build trust.

For us, compliance isn’t an add‑on. It’s built into every AI engagement. We vet data sources, run risk assessments, keep detailed records about our data sets, and review outputs before they go live so our clients can experiment with confidence, not worry.

Ready to navigate AI safely? Let’s talk.

If you’d like a clear, practical roadmap for compliant AI, whether you’re choosing training data, setting up review steps, or evaluating a vendor, schedule an initial consultation with our team.

Schedule an AI consultation

Resources

Copyright and Artificial Intelligence; Part 3: Generative AI Training (Pre-Publication Version). (May 2025). U.S. Copyright Office. Retrieved from https://chatgpt.com/c/6821e5d5-0e08-8001-90cf-7ce101958778?model=o3

View full post