CW

Copyright and Legal Issues

Advanced TopicsEthicsFree Lesson

Advertisement

Advanced Topics

Copyright and Legal Issues

The legal landscape of LLMs is rapidly evolving—training data rights, output ownership, fair use doctrine, and emerging regulatory frameworks shape how these models can be built and deployed.

  • Training Data — Copyright, licensing, and data mining rights
  • Output Ownership — Who owns AI-generated content?
  • Regulation — EU AI Act, US executive orders, global frameworks

The law is reason, free from passion.

Copyright and Legal Issues

The legal landscape of LLMs is rapidly evolving—training data rights, output ownership, fair use doctrine, and emerging regulatory frameworks shape how these models can be built and deployed.

DfAI Copyright Challenge

The core legal question: Can copyrighted text be used to train a commercial LLM, and who owns the output? This question sits at the intersection of copyright law, fair use doctrine, and the nature of transformative AI systems.

Training Data and Copyright

The Data Mining Problem

LLMs are trained on massive corpora that include copyrighted works—books, articles, code, and web content. This raises fundamental questions:

DfText and Data Mining (TDM)

Text and data mining is the automated extraction of information from large datasets. In the EU, TDM is permitted under the DSM Directive (2019) for research purposes, and for commercial purposes if the copyright holder has not explicitly opted out.

JurisdictionTraining on Copyrighted DataLegal Basis
United StatesLikely fair use (transformative)17 U.S.C. § 107
European UnionPermitted with opt-out mechanismDSM Directive Art. 3-4
United KingdomPermitted for non-commercial researchCDPA 1988 § 29A
JapanPermitted (no opt-out required)JRC Act Art. 30-47
ChinaPermitted with restrictionsCJSC Copyright Law Art. 24

Fair Use Analysis (US)

The fair use doctrine considers four factors:

Fair Use Factors

textFairUse=fleft(fractextPurposetextNature,fractextAmounttextMarketEffectright)\\text{Fair Use} = f\\left(\\frac{\\text{Purpose}}{\\text{Nature}}, \\frac{\\text{Amount}}{\\text{Market Effect}}\\right)

Here,

  • Purpose\text{Purpose}=Transformative vs. reproductive use
  • Nature\text{Nature}=Factual vs. creative nature of original
  • Amount\text{Amount}=Portion used relative to whole work
  • Market Effect\text{Market Effect}=Impact on market for original work

Most legal scholars believe training LLMs on copyrighted data constitutes fair use in the US because: (1) it is transformative (the model does not reproduce the data), (2) it uses the data for a different purpose (pattern learning, not consumption), and (3) it does not substitute for the original market. However, this has not been definitively settled by courts.

Key Legal Cases

Several landmark cases are shaping the legal landscape:

CasePartiesIssueStatus
NYT v. OpenAINew York Times vs OpenAI/MicrosoftTraining on copyrighted articlesPending
Getty v. Stability AIGetty Images vs Stability AITraining on copyrighted imagesPending
Authors v. OpenAIClass action by authorsTraining on copyrighted booksPending
Thaler v. PerlmutterAI as inventorAI-generated inventionsRuled against AI

The outcomes of these cases will likely set precedent for how AI training data is treated under copyright law globally. A ruling that training requires licensing could fundamentally change the economics of LLM development.

Output Ownership

Who Owns AI-Generated Content?

DfAuthorship Requirement

Traditional copyright law requires human authorship. A work generated entirely by AI without human creative input is generally not copyrightable in most jurisdictions. The US Copyright Office has stated that "the use of AI tools does not make a work uncopyrightable, but the output of AI tools without human authorship is not copyrightable."

The spectrum of human-AI collaboration:

ScenarioCopyrightabilityExample
AI-generated, no human inputNot copyrightableRaw GPT output
AI-generated with selection/arrangementPossibly copyrightableCurated AI outputs
AI as tool, human directsCopyrightableHuman writes with AI suggestions
AI-assisted editingCopyrightableHuman uses AI for grammar/spelling

The legal landscape is rapidly evolving. In 2023, the US Copyright Office granted copyright for a graphic novel that included AI-generated images, but only because the human author selected and arranged them creatively.

Licensing and Data Rights

Data Licensing Models

Training Data License Value

Vtextlicense=sumi=1NP(textusei)cdotVtextworkicdottextQualityiV_{\\text{license}} = \\sum_{i=1}^{N} P(\\text{use}_i) \\cdot V_{\\text{work}_i} \\cdot \\text{Quality}_i

Here,

  • NN=Number of works in corpus
  • P(usei)P(\text{use}_i)=Probability work i is used in training
  • VworkiV_{\text{work}_i}=Market value of work i
  • Quality}_i=Quality/relevance of work i

Open-Source Data Licenses

LicenseCommercial UseTraining DataAttribution
CC-BY-4.0YesYesRequired
CC-BY-NC-4.0NoYesRequired
CC0YesYesNone
ODC-ByYesYesRequired
Llama 2 CommunityYesRestrictionsRequired

Data Marketplace Economics

As demand for training data grows, new economic models are emerging:

Data Value Pricing

Ptextdata=alphacdotQ+betacdotR+gammacdotUP_{\\text{data}} = \\alpha \\cdot Q + \\beta \\cdot R + \\gamma \\cdot U

Here,

  • QQ=Quality score (accuracy, consistency)
  • RR=Rarity score (how unique is this data)
  • UU=Utility score (task relevance)
  • α,β,γ\alpha, \beta, \gamma=Market-determined weights

Some companies are already paying for data licenses: Reddit charges ~$60/year per 1000 API calls for training data access. This model may become the norm.

Emerging Regulations

EU AI Act (2024)

The EU AI Act classifies AI systems by risk level:

  • Unacceptable risk: Banned (social scoring, real-time biometric surveillance)
  • High risk: Strict requirements (transparency, human oversight, documentation)
  • Limited risk: Transparency obligations (chatbots must disclose AI nature)
  • Minimal risk: No restrictions

US Executive Order on AI (2023)

Key requirements for powerful AI systems:

  • Safety testing and reporting
  • Watermarking of AI-generated content
  • Privacy protections for training data
  • Civil rights protections

International Harmonization

The lack of international harmonization creates compliance challenges for global AI deployment:

DfRegulatory Fragmentation

Regulatory fragmentation refers to the divergence in AI regulations across jurisdictions. This creates compliance costs for companies operating globally and may lead to "regulatory arbitrage" where AI development shifts to less regulated jurisdictions.

Key differences:

  • EU: Risk-based, prescriptive requirements (AI Act)
  • US: Sector-specific, voluntary frameworks (executive orders)
  • China: Content control focused, government oversight
  • Japan: Innovation-focused, minimal restrictions

For production LLM applications: (1) document your training data sources, (2) implement watermarking, (3) provide clear disclosure of AI use, (4) maintain human oversight for high-stakes decisions, and (5) consult legal counsel for jurisdiction-specific compliance.

Practice Exercises

  1. Conceptual: Explain the difference between "training" and "derivative work" in the context of copyright law. Why is the classification of LLM training as one or the other legally significant?

  2. Mathematical: If a licensing fee of $0.001 per work is required for training, and a model is trained on 1T tokens with an average work length of 5,000 tokens, compute the total licensing cost.

  3. Practical: Research the current legal status of AI-generated content in three different jurisdictions. What are the key differences in approach?

  4. Research: Compare the EU AI Act's risk-based classification with the US approach of sector-specific regulation. Which framework better balances innovation and safety?

Key Takeaways:

  • Training LLMs on copyrighted data is likely fair use in the US but varies by jurisdiction
  • AI-generated content without human authorship is generally not copyrightable
  • Data licensing models are emerging but face challenges with scale and enforcement
  • The EU AI Act introduces risk-based classification for AI systems
  • Watermarking and transparency are increasingly required by regulation

What to Learn Next

-> LLM Watermarking Statistical watermarks for AI-generated content detection.

-> Environmental Impact of LLMs Energy costs and sustainable AI practices.

-> Bias and Fairness Legal frameworks governing AI discrimination.

-> Open Source LLM Ecosystem Open-source licensing and community models.

-> Future of LLMs Trends, predictions, and regulatory developments.

-> LLM Benchmarking Suites Comprehensive evaluation including safety benchmarks.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement