Advanced Topics
Environmental Impact of LLMs
Training and running LLMs has significant environmental costs. Understanding and mitigating these costs is essential for sustainable AI development.
- Carbon Footprint — Training emissions, inference costs, lifecycle analysis
- Energy Efficiency — Hardware optimization, algorithmic improvements
- Sustainable AI — Renewable energy, model efficiency, responsible deployment
We do not inherit the earth from our ancestors—we borrow it from our children.
Environmental Impact of LLMs
Training and running LLMs has significant environmental costs. Understanding and mitigating these costs is essential for sustainable AI development.
DfAI Carbon Footprint
The carbon footprint of an LLM includes: (1) embodied carbon from hardware manufacturing, (2) training energy consumption, (3) inference energy consumption over the model's lifetime, and (4) cooling infrastructure energy. The total lifecycle emissions are the sum of these components.
Training Carbon Emissions
Estimating Training Emissions
Training Carbon Emissions
Here,
- =Total Training Compute (FLOPs)
- =Power Usage Effectiveness (typically 1.1-1.5)
- =Carbon Intensity of grid electricity (gCO2/kWh)
Empirical Estimates
| Model | Training Compute (FLOPs) | Energy (MWh) | CO₂ (tonnes) | Equivalent |
|---|---|---|---|---|
| GPT-3 175B | 3.14 × 10²³ | ~1,287 | ~552 | 5 cars/year |
| LLaMA 2 70B | 1.7 × 10²⁴ | ~690 | ~300 | 3 cars/year |
| GPT-4 (est.) | 2.1 × 10²⁵ | ~8,500 | ~3,600 | 360 cars/year |
| Gemini Ultra | ~5 × 10²⁵ | ~20,000 | ~8,500 | 850 cars/year |
The carbon intensity of electricity varies dramatically by location: ~50 gCO2/kWh in France (nuclear), ~300 gCO2/kWh in the US average, ~600 gCO2/kWh in China (coal-heavy). Training in locations with clean energy reduces emissions by 6-12x.
Embodied Carbon
The carbon cost of AI goes beyond electricity—hardware manufacturing has significant embodied emissions:
DfEmbodied Carbon
Embodied carbon refers to the total greenhouse gas emissions produced during the manufacturing, transportation, and disposal of hardware. A single GPU (e.g., NVIDIA H100) has an estimated embodied carbon of 150-300 kg CO2e, which is amortized over its 3-5 year lifespan.
| Component | Embodied Carbon | Lifespan | Annual Carbon |
|---|---|---|---|
| NVIDIA H100 GPU | 200 kg CO2e | 4 years | 50 kg/year |
| Server rack | 500 kg CO2e | 5 years | 100 kg/year |
| Data center building | 10,000+ tonnes | 20 years | 500+ tonnes/year |
| Cooling system | 2,000+ tonnes | 15 years | 133+ tonnes/year |
Training vs Inference Emissions
Lifetime Emissions Ratio
Here,
- =One-time training emissions
- =Emissions per query per second
- =Total queries over model lifetime
- =Average response length (seconds)
For popular models, inference typically dominates lifetime emissions (70-90%) because models serve millions of queries daily for months or years.
Inference Carbon Footprint
Per-Query Carbon Footprint
Here,
- =Energy per GPU per second (W)
- =Grid carbon intensity (gCO2/kWh)
- =Queries per GPU per second
Inference Efficiency Comparison
| Model | Parameters | Queries/GPU/hour | Energy/1M tokens |
|---|---|---|---|
| GPT-3.5 | ~20B (est.) | ~500 | ~0.3 kWh |
| LLaMA 2 70B | 70B | ~100 | ~1.2 kWh |
| GPT-4 | ~1.8T (est.) | ~5 | ~12 kWh |
| Claude 3 Opus | ~2T (est.) | ~3 | ~18 kWh |
Energy Efficiency Improvements
Model-Level Efficiency
Energy-Performance Tradeoff
Here,
- =Model accuracy/capability
- =Total energy for training + inference
- =Model size/compute budget
Key efficiency strategies:
- Quantization: 4-bit inference reduces energy by ~4x vs FP16
- Pruning: Sparse models (50% sparsity) reduce energy proportionally
- Knowledge distillation: Smaller models capture most capability at fraction of energy
- Mixture of Experts: Activate only relevant subnetworks per query
Hardware-Level Efficiency
| Technology | Energy Reduction | Maturity |
|---|---|---|
| GPU optimization (H100 vs A100) | ~2x | Production |
| Sparse tensor cores | ~1.5x | Production |
| Analog AI chips | ~10x | Research |
| Optical computing | ~100x | Early research |
The most impactful action for reducing LLM environmental impact is not training smaller models, but rather: (1) serving more queries per GPU (throughput optimization), (2) using renewable energy for data centers, and (3) implementing aggressive caching to avoid redundant inference.
Water Consumption
AI data centers also consume significant water for cooling:
DfWater Usage Effectiveness (WUE)
WUE measures the liters of water consumed per kWh of IT energy. Modern data centers have WUE of 0.5-2.0 L/kWh. A large AI training run may consume millions of liters of water for cooling.
| Facility | WUE (L/kWh) | Annual Water | Notes |
|---|---|---|---|
| Google (average) | 0.5 | 4.3B gallons | Includes all data centers |
| Microsoft (average) | 0.6 | 3.5B gallons | Water-stressed regions |
| Typical AI data center | 1.0-1.5 | Varies | Air-cooled |
| Liquid-cooled facility | 0.1-0.3 | Minimal | Emerging technology |
Water consumption is often overlooked in AI environmental assessments. In water-stressed regions, AI data center water use can compete with agricultural and domestic needs.
Sustainable AI Practices
Carbon-Aware Computing
Carbon-Aware Scheduling
Here,
- =Optimal start time for compute job
- =Grid carbon intensity at time t
- =Energy consumption at time t
Scheduling compute jobs when the grid is cleanest can reduce emissions by 30-50% without changing the compute itself.
Reporting Standards
| Metric | Description | Standard |
|---|---|---|
| PUE | Power Usage Effectiveness | ISO 30134-2 |
| CUE | Carbon Usage Effectiveness | ISO 30134-3 |
| WUE | Water Usage Effectiveness | ISO 30134-5 |
| Green Software Score | Combined efficiency metric | Green Software Foundation |
Practice Exercises
-
Conceptual: Explain why inference emissions typically dominate training emissions for production LLMs. Under what circumstances would training emissions dominate?
-
Mathematical: If a 70B parameter model is served 1M queries/day at 500 tokens average length, and each query uses 0.001 kWh of energy, compute the annual inference carbon footprint assuming a grid intensity of 200 gCO2/kWh.
-
Practical: Compare the energy consumption of running inference on the same model using FP16, INT8, and INT4 quantization. What is the energy savings at each level?
-
Research: Design a carbon-aware scheduling system for LLM inference. What are the tradeoffs between latency and carbon footprint?
Key Takeaways:
- Training emissions depend on compute, PUE, and grid carbon intensity
- Inference typically dominates lifetime emissions (70-90%) for popular models
- Quantization, pruning, and distillation reduce energy proportionally
- Carbon-aware scheduling can reduce emissions by 30-50%
- Reporting standards (PUE, CUE, WUE) enable transparent environmental accounting
What to Learn Next
-> Model Compression Pipeline End-to-end compression: quantization + pruning + distillation.
-> Quantization Techniques Deep dive into quantization for energy efficiency.
-> Future of LLMs Trends toward more efficient AI systems.
-> Hardware-Aware LLM Design Co-designing models and hardware for efficiency.
-> Copyright and Legal Issues Legal frameworks governing AI sustainability.
-> LLM Optimization for Mobile Edge deployment for energy-efficient inference.