I know. “Pre‑publication” sounds as thrilling as waiting for your coffee to brew. But the U.S. Copyright Office’s 108-page draft report, Copyright and Artificial Intelligence; Part 3: Generative AI Training gives companies that are training AI models and creating content a peek behind the regulatory curtain. If you build marketing campaigns, train machine‑learning models, or lose sleep over whether tomorrow’s AI‑generated summaries will mimic your blog posts, the report should be on your desk.
The Copyright Office hasn’t finalized its stance yet, but the document sketches the policy lines we’ll all have to color inside. It asks three not-so-simple questions:
Behind those questions sit billion‑dollar AI labs, centuries of creative labor, and your next marketing brief. Let’s explore where the Copyright Office is headed, regulatorily.
Interested parties, including trade organizations, individuals, and big businesses like Meta, submitted over 10,000 comments to the USCO. The Office acknowledges the intensity of the debate about AI training data and the swath of lawsuits that are making their way through U.S. courts. People have big feelings about AI training data and intellectual property.
The USCO realizes that this it is wading into a morass. It comments:
Some warn that requiring AI companies to license copyrighted works would throttle a transformative technology, because it is not practically possible to obtain licenses for the volume and diversity of content necessary to power cutting-edge systems. Others fear that unlicensed training will corrode the creative ecosystem, with artists’ entire bodies of work used against their will to produce content that competes with them in the marketplace. The public interest requires striking an effective balance, allowing technological innovation to flourish while maintaining a thriving creative community.
Furthermore, artificial intelligence and its training systems are rapidly evolving. Andfair use doctrine is nuanced. You almost get a sense that the Office is begging us for patience and assuring us that they see both sides of the debate in Part 3 of their AI guidance.
However, in addition to outlining the essentials of how AI training systems work, they do outline their general direction moving forward.
The Office’s opening move is blunt: when a developer scrapes a copyrighted novel, photograph, or song to train its model, that act checks the boxes of an infringement claim. Prima facie, Latin for “at first glance,” means the claimant has cleared the low bar of showing two facts: they own the work and you copied it. That alone gets you into court, even if stronger defenses (like fair use) might still carry the day.
But an accusation isn’t the end of the story. Think of it like a speeding ticket: the officer clocks you going 15 miles over the limit and hands you the citation. That’s the prima facie case, proof you were on the road and over the threshold. You still have a chance to show why the ticket shouldn’t stick: maybe the radar gun was faulty or you were trying to avoid an accident. If a prima facie case is the ticket; fair use may be your day in traffic court.
RELATED: Copyright and AI: The question of human authorship
In U.S. copyright law, fair use is a safety valve that lets content creators borrow snippets of someone else’s work when doing so serves a broader public interest. It isn’t a blanket permission slip; it’s a context‑driven analysis that weighs several elements before deciding whether permission was truly needed.
Courts balance four factors:
The USCO report doesn’t declare winners. Instead, it offers guideposts: research‑oriented, transformative uses lean toward fair use. Outputs that mimic or substitute for the original tilt against. Each model, dataset, and business plan will get its own day in court, metaphorical or literal.
Scraping the open web feels democratic until you realize how many copyrighted works hide in plain sight. The Books3 dataset includes full novels from living authors. Common Crawl vacuumed up entire news sites.
The report’s takeaway is clear: location does not override ownership. If your pipeline relies on public URLs, audit it like you would a new vendor contract. Ignorance is not a defense; you must perform due diligence when building your training data sets.
Here’s the optimistic angle the USCO highlights: creative industries and AI developers are starting to talk business instead of lobbing lawsuits. Universal Music cut deals with the biggest AI song generators. Getty Images inked agreements that let model builders tap its vast photo library without stepping on landmines.
These early deals matter because they prove a market can form. The Copyright Office says, in polite government prose, “Let’s see how far voluntary licensing can take us before we impose blanket solutions.” In other words, if industry can self‑organize, Congress will keep its hands in its pockets a little longer.
Should voluntary deals stall, the report floats extended collective licensing (ECL) as a softer statutory nudge. Under ECL, creators can opt in to a collective that negotiates on their behalf, while users get predictable rates. It’s already common in Scandinavia for photocopying and streaming rights.
Compulsory licensing, the powerful tool that forces access at a set fee, remains the option of last resort. The Copyright Office warns that compulsory schemes make sense only when markets fail entirely. So far, the regulators remain hopeful that businesses can hammer out the details of their own.
The report hints at a future where creators, platforms, and tech firms co‑design a licensing fabric robust enough to support large‑scale training while ensuring artists get paid. Call it Creative Commons 2.0, an ecosystem where permissions travel with the file, royalty micro‑payments flow automatically, and attribution is baked into metadata.
We’re not there yet, but the seeds are visible: the Content Authenticity Initiative’s provenance tags, watermarking proposals from OpenAI and Anthropic, and blockchain‑based rights registries. The Copyright Office effectively says, “Keep tinkering; we’re watching.”
The Copyright Office’s report draws a clear distinction between what goes into a model (training data) and what comes out of it (generated content). Both stages carry unique obligations for marketing teams that rely on or build their own AI tools.
Unlicensed or poorly documented datasets create the highest exposure here. Common pitfalls include:
RELATED: The invisible work behind effective content
Even perfectly curated training data can still produce infringing material if guardrails are lax. Reduce risk by:
Adopt these controls early and you’ll spend more energy on creative optimisation—and less on cease‑and‑desist responses.
Picture this: Acme Inc. scrapes a million blog posts, containing recipes, travel diaries, legal advice, to build AcmeGPT, a consumer‑facing writing assistant. Early beta testers love it. Then authors notice paragraphs lifted wholesale from their copyrighted works.
Acme’s legal team scrambles. Their dataset included everything under the sun because “public is public,” right? Wrong. They’re hit with takedown notices and a class‑action lawsuit. Investor confidence wobbles. A nine‑figure valuation evaporates.
Now rewind. Imagine Acme had licensed content from three specialty publishers, logged its data provenance, and filtered outputs to avoid verbatim excerpts. The launch might have cost more upfront, but the legal runway would be clear, and the company’s brand equity intact.
That, in miniature, is the decision facing every modern marketer.
RELATED: How to spot a legaltech snake oil salesman
Tomorrow’s to‑do list doesn’t require a PhD, just practical steps:
At LaFleur, we live at the intersection of bold creativity and careful compliance. Our clients, law firms, healthcare innovators, financial‑service leaders, don’t have the luxury of “move fast and break things.” They need to move smart and build trust.
For us, compliance isn’t an add‑on. It’s built into every AI engagement. We vet data sources, run risk assessments, keep detailed records about our data sets, and review outputs before they go live so our clients can experiment with confidence, not worry.
If you’d like a clear, practical roadmap for compliant AI, whether you’re choosing training data, setting up review steps, or evaluating a vendor, schedule an initial consultation with our team.
Copyright and Artificial Intelligence; Part 3: Generative AI Training (Pre-Publication Version). (May 2025). U.S. Copyright Office. Retrieved from https://chatgpt.com/c/6821e5d5-0e08-8001-90cf-7ce101958778?model=o3