Advanced Topics
Copyright and Legal Issues
The legal landscape of LLMs is rapidly evolving—training data rights, output ownership, fair use doctrine, and emerging regulatory frameworks shape how these models can be built and deployed.
- Training Data — Copyright, licensing, and data mining rights
- Output Ownership — Who owns AI-generated content?
- Regulation — EU AI Act, US executive orders, global frameworks
The law is reason, free from passion.
Copyright and Legal Issues
The legal landscape of LLMs is rapidly evolving—training data rights, output ownership, fair use doctrine, and emerging regulatory frameworks shape how these models can be built and deployed.
DfAI Copyright Challenge
The core legal question: Can copyrighted text be used to train a commercial LLM, and who owns the output? This question sits at the intersection of copyright law, fair use doctrine, and the nature of transformative AI systems.
Training Data and Copyright
The Data Mining Problem
LLMs are trained on massive corpora that include copyrighted works—books, articles, code, and web content. This raises fundamental questions:
DfText and Data Mining (TDM)
Text and data mining is the automated extraction of information from large datasets. In the EU, TDM is permitted under the DSM Directive (2019) for research purposes, and for commercial purposes if the copyright holder has not explicitly opted out.
| Jurisdiction | Training on Copyrighted Data | Legal Basis |
|---|---|---|
| United States | Likely fair use (transformative) | 17 U.S.C. § 107 |
| European Union | Permitted with opt-out mechanism | DSM Directive Art. 3-4 |
| United Kingdom | Permitted for non-commercial research | CDPA 1988 § 29A |
| Japan | Permitted (no opt-out required) | JRC Act Art. 30-47 |
| China | Permitted with restrictions | CJSC Copyright Law Art. 24 |
Fair Use Analysis (US)
The fair use doctrine considers four factors:
Fair Use Factors
Here,
- =Transformative vs. reproductive use
- =Factual vs. creative nature of original
- =Portion used relative to whole work
- =Impact on market for original work
Most legal scholars believe training LLMs on copyrighted data constitutes fair use in the US because: (1) it is transformative (the model does not reproduce the data), (2) it uses the data for a different purpose (pattern learning, not consumption), and (3) it does not substitute for the original market. However, this has not been definitively settled by courts.
Key Legal Cases
Several landmark cases are shaping the legal landscape:
| Case | Parties | Issue | Status |
|---|---|---|---|
| NYT v. OpenAI | New York Times vs OpenAI/Microsoft | Training on copyrighted articles | Pending |
| Getty v. Stability AI | Getty Images vs Stability AI | Training on copyrighted images | Pending |
| Authors v. OpenAI | Class action by authors | Training on copyrighted books | Pending |
| Thaler v. Perlmutter | AI as inventor | AI-generated inventions | Ruled against AI |
The outcomes of these cases will likely set precedent for how AI training data is treated under copyright law globally. A ruling that training requires licensing could fundamentally change the economics of LLM development.
Output Ownership
Who Owns AI-Generated Content?
DfAuthorship Requirement
Traditional copyright law requires human authorship. A work generated entirely by AI without human creative input is generally not copyrightable in most jurisdictions. The US Copyright Office has stated that "the use of AI tools does not make a work uncopyrightable, but the output of AI tools without human authorship is not copyrightable."
The spectrum of human-AI collaboration:
| Scenario | Copyrightability | Example |
|---|---|---|
| AI-generated, no human input | Not copyrightable | Raw GPT output |
| AI-generated with selection/arrangement | Possibly copyrightable | Curated AI outputs |
| AI as tool, human directs | Copyrightable | Human writes with AI suggestions |
| AI-assisted editing | Copyrightable | Human uses AI for grammar/spelling |
The legal landscape is rapidly evolving. In 2023, the US Copyright Office granted copyright for a graphic novel that included AI-generated images, but only because the human author selected and arranged them creatively.
Licensing and Data Rights
Data Licensing Models
Training Data License Value
Here,
- =Number of works in corpus
- =Probability work i is used in training
- =Market value of work i
- Quality}_i=Quality/relevance of work i
Open-Source Data Licenses
| License | Commercial Use | Training Data | Attribution |
|---|---|---|---|
| CC-BY-4.0 | Yes | Yes | Required |
| CC-BY-NC-4.0 | No | Yes | Required |
| CC0 | Yes | Yes | None |
| ODC-By | Yes | Yes | Required |
| Llama 2 Community | Yes | Restrictions | Required |
Data Marketplace Economics
As demand for training data grows, new economic models are emerging:
Data Value Pricing
Here,
- =Quality score (accuracy, consistency)
- =Rarity score (how unique is this data)
- =Utility score (task relevance)
- =Market-determined weights
Some companies are already paying for data licenses: Reddit charges ~$60/year per 1000 API calls for training data access. This model may become the norm.
Emerging Regulations
EU AI Act (2024)
The EU AI Act classifies AI systems by risk level:
- Unacceptable risk: Banned (social scoring, real-time biometric surveillance)
- High risk: Strict requirements (transparency, human oversight, documentation)
- Limited risk: Transparency obligations (chatbots must disclose AI nature)
- Minimal risk: No restrictions
US Executive Order on AI (2023)
Key requirements for powerful AI systems:
- Safety testing and reporting
- Watermarking of AI-generated content
- Privacy protections for training data
- Civil rights protections
International Harmonization
The lack of international harmonization creates compliance challenges for global AI deployment:
DfRegulatory Fragmentation
Regulatory fragmentation refers to the divergence in AI regulations across jurisdictions. This creates compliance costs for companies operating globally and may lead to "regulatory arbitrage" where AI development shifts to less regulated jurisdictions.
Key differences:
- EU: Risk-based, prescriptive requirements (AI Act)
- US: Sector-specific, voluntary frameworks (executive orders)
- China: Content control focused, government oversight
- Japan: Innovation-focused, minimal restrictions
For production LLM applications: (1) document your training data sources, (2) implement watermarking, (3) provide clear disclosure of AI use, (4) maintain human oversight for high-stakes decisions, and (5) consult legal counsel for jurisdiction-specific compliance.
Practice Exercises
-
Conceptual: Explain the difference between "training" and "derivative work" in the context of copyright law. Why is the classification of LLM training as one or the other legally significant?
-
Mathematical: If a licensing fee of $0.001 per work is required for training, and a model is trained on 1T tokens with an average work length of 5,000 tokens, compute the total licensing cost.
-
Practical: Research the current legal status of AI-generated content in three different jurisdictions. What are the key differences in approach?
-
Research: Compare the EU AI Act's risk-based classification with the US approach of sector-specific regulation. Which framework better balances innovation and safety?
Key Takeaways:
- Training LLMs on copyrighted data is likely fair use in the US but varies by jurisdiction
- AI-generated content without human authorship is generally not copyrightable
- Data licensing models are emerging but face challenges with scale and enforcement
- The EU AI Act introduces risk-based classification for AI systems
- Watermarking and transparency are increasingly required by regulation
What to Learn Next
-> LLM Watermarking Statistical watermarks for AI-generated content detection.
-> Environmental Impact of LLMs Energy costs and sustainable AI practices.
-> Bias and Fairness Legal frameworks governing AI discrimination.
-> Open Source LLM Ecosystem Open-source licensing and community models.
-> Future of LLMs Trends, predictions, and regulatory developments.
-> LLM Benchmarking Suites Comprehensive evaluation including safety benchmarks.