← Math|5 of 100
Mathematics for Data Science & AI

Information Theory — Measuring Information

Master information theory for AI: entropy, cross-entropy, KL divergence, and their applications in machine learning loss functions.

📂 Information Theory📖 Lesson 5 of 100🎓 Free Course

Advertisement

Information Theory — Measuring Information

ℹ️ Why It Matters

Information theory quantifies information, measures surprise, and forms the basis for many ML loss functions (cross-entropy, KL divergence).


What is Information?

DfInformation

Surprise = Information

If something very unlikely happens → HIGH information If something very likely happens → LOW information

Information Content

I(x)=log2(P(x))I(x) = -\log_2(P(x))

Here,

  • P(x)P(x)=Probability of event x
  • I(x)I(x)=Information content (in bits)

📝Example: Information Content

  • P(heads)=0.5I(heads)=log2(0.5)=1P(\text{heads}) = 0.5 \rightarrow I(\text{heads}) = -\log_2(0.5) = 1 bit
  • P(rolling a 6)=1/6I(rolling 6)=log2(1/6)=2.58P(\text{rolling a 6}) = 1/6 \rightarrow I(\text{rolling 6}) = -\log_2(1/6) = 2.58 bits
  • P(sun rises)=0.999I(sun rises)=log2(0.999)=0.00144P(\text{sun rises}) = 0.999 \rightarrow I(\text{sun rises}) = -\log_2(0.999) = 0.00144 bits

Entropy

DfEntropy

The average information (surprise) of a random variable.

Entropy

H(X)=iP(xi)×log2(P(xi))H(X) = -\sum_i P(x_i) \times \log_2(P(x_i))

Here,

  • H(X)H(X)=Entropy of random variable X

📝Example: Entropy

  • Fair coin: H=(0.5×log2(0.5)+0.5×log2(0.5))=1H = -(0.5 \times \log_2(0.5) + 0.5 \times \log_2(0.5)) = 1 bit
  • Loaded coin (90% heads): H=(0.9×log2(0.9)+0.1×log2(0.1))=0.469H = -(0.9 \times \log_2(0.9) + 0.1 \times \log_2(0.1)) = 0.469 bits
  • Certain outcome: H=0H = 0 bits

Properties:

  • H0H \geq 0 (entropy is never negative)
  • HH is maximized when all outcomes are equally likely
  • HH measures uncertainty — more entropy = more uncertainty

Cross-Entropy

DfCross-Entropy

Measures how different two probability distributions are.

Cross-Entropy

H(P,Q)=iP(xi)×log2(Q(xi))H(P, Q) = -\sum_i P(x_i) \times \log_2(Q(x_i))

Here,

  • PP=True distribution
  • QQ=Predicted distribution

ℹ️ In ML

Loss=iyi×log(y^i)\text{Loss} = -\sum_i y_i \times \log(\hat{y}_i)
  • yy = true labels (one-hot encoded)
  • y^\hat{y} = predicted probabilities

💡 Why cross-entropy is used in classification

  • If the model predicts the correct class with high probability → low loss
  • If the model predicts the wrong class → high loss
  • It heavily penalizes confident wrong predictions

KL Divergence

DfKL Divergence

Measures how much information is lost when QQ approximates PP.

KL Divergence

KL(PQ)=iP(xi)×log(P(xi)Q(xi))KL(P \| Q) = \sum_i P(x_i) \times \log\left(\frac{P(x_i)}{Q(x_i)}\right)

Here,

  • KL(PQ)KL(P \| Q)=KL divergence from Q to P
KL(PQ)=H(P,Q)H(P)=Cross-EntropyEntropyKL(P \| Q) = H(P, Q) - H(P) = \text{Cross-Entropy} - \text{Entropy}

Properties:

  • KL(PQ)0KL(P \| Q) \geq 0 (always non-negative)
  • KL(PQ)=0KL(P \| Q) = 0 if and only if P=QP = Q
  • KL is NOT symmetric: KL(PQ)KL(QP)KL(P \| Q) \neq KL(Q \| P)

Applications:

  • Variational Autoencoders (VAE): KL divergence between learned distribution and prior
  • Knowledge distillation: Student model mimics teacher model
  • GAN training: Measures distance between real and generated distributions

Mutual Information

DfMutual Information

How much knowing one variable tells you about another.

Mutual Information

I(X;Y)=H(X)H(XY)=H(Y)H(YX)I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

Here,

  • I(X;Y)I(X; Y)=Mutual information between X and Y
I(X;Y)=x,yP(x,y)×log(P(x,y)P(x)×P(y))I(X; Y) = \sum_{x,y} P(x,y) \times \log\left(\frac{P(x,y)}{P(x) \times P(y)}\right)

Interpretation:

  • I(X;Y)=0I(X; Y) = 0: X and Y are independent
  • I(X;Y)=H(X)I(X; Y) = H(X): Knowing Y tells you everything about X

Applications:

  • Feature selection: Choose features with high mutual information with the target
  • Information bottleneck: Compress input while retaining relevant information
  • Clustering: Evaluate cluster quality

Data Compression Connection

Shannon's Source Coding Theorem:

ThSource Coding Theorem

You cannot compress data to fewer than H(X)H(X) bits per symbol on average. H(X)H(X) is the theoretical minimum number of bits needed.

Huffman Coding:

  • Assign shorter codes to more frequent symbols
  • Optimal prefix-free coding

Applications in AI:

  • Neural network pruning: Remove redundant weights
  • Knowledge distillation: Compress large model into small one
  • Model compression for edge devices

📋Key Takeaways

  • Information measures surprise. I(x)=log2(P(x))I(x) = -\log_2(P(x)) — rare events carry more information; certain events carry none. This is the fundamental unit (bits) of information theory.

  • Entropy is average surprise. H(X)=iP(xi)log2(P(xi))H(X) = -\sum_i P(x_i) \log_2(P(x_i)) measures the uncertainty of a random variable — maximum when all outcomes are equally likely, zero when certain.

  • Cross-Entropy is THE classification loss. H(P,Q)=iP(xi)log2(Q(xi))H(P, Q) = -\sum_i P(x_i) \log_2(Q(x_i)) measures how bad your predicted distribution QQ is compared to the true distribution PP — directly used as the loss function in neural network classification.

  • KL Divergence measures information loss. KL(PQ)=H(P,Q)H(P)KL(P \| Q) = H(P, Q) - H(P) quantifies how much information is lost when QQ approximates PP — used in VAEs, knowledge distillation, and GAN training. Note: KL(PQ)KL(QP)KL(P \| Q) \neq KL(Q \| P).

  • Mutual Information finds dependencies. I(X;Y)=H(X)H(XY)I(X; Y) = H(X) - H(X|Y) measures how much knowing one variable tells you about another — powerful for feature selection and evaluating clustering quality.

  • Shannon's theorem sets compression limits. You cannot compress data below H(X)H(X) bits per symbol on average — connecting information theory to practical model compression and pruning in neural networks.

Lesson Progress5 / 100