← Math|68 of 100
Optimization

Metaheuristic Optimization

Understand genetic algorithms, simulated annealing, particle swarm optimization, ant colony optimization, tabu search, and other metaheuristic methods for solving complex optimization problems.

πŸ“‚ GlobalπŸ“– Lesson 68 of 100πŸŽ“ Free Course

Advertisement

Metaheuristic Optimization

πŸ’‘ Why It Matters

Many real-world optimization problems β€” scheduling routes, training neural networks, designing circuits β€” involve non-convex landscapes with thousands of local minima, discrete variables, or no computable gradient. Classical gradient descent gets trapped or cannot even be applied. Metaheuristics are problem-independent strategies that navigate such rugged landscapes efficiently, finding high-quality (if not provably optimal) solutions in reasonable time. They are the backbone of modern AI hyperparameter tuning, logistics planning, and combinatorial optimization.


What Are Metaheuristics?

DfMetaheuristic

A metaheuristic is a high-level optimization framework that guides a underlying heuristic search procedure. Unlike exact methods (branch-and-bound, simplex), metaheuristics do not guarantee finding the global optimum but are designed to escape local optima through mechanisms like randomization, population-based search, memory structures, or temperature schedules. The term "meta" means "about" β€” a metaheuristic is a strategy for managing heuristics.

Key characteristics shared by most metaheuristics:

  • No gradient required β€” only function evaluations are needed
  • Balances exploration (searching new regions) and exploitation (refining known good regions)
  • Stochastic β€” randomness helps escape local minima but means results vary between runs
  • Applicable to black-box functions β€” the objective function can be a simulation, neural network, or oracle

Simulated Annealing

DfSimulated Annealing

Simulated annealing (SA) is a single-solution metaheuristic inspired by the metallurgical process of annealing, where a material is heated and slowly cooled to reduce defects. At high temperatures, the algorithm freely accepts worse solutions (exploration); as temperature decreases, it becomes more greedy (exploitation), converging to a local β€” hopefully global β€” optimum.

Acceptance Probability (Metropolis Criterion)

P(acceptΒ worse)=exp⁑ ⁣(βˆ’Ξ”ET)P(\text{accept worse}) = \exp\!\left(-\frac{\Delta E}{T}\right)

Here,

  • Ξ”E\Delta E=Energy difference: f(new) βˆ’ f(current)
  • TT=Temperature parameter (controls acceptance rate)

The temperature schedule T(k)T(k) typically follows:

Cooling Schedule

T(k)=T0β‹…Ξ±kT(k) = T_0 \cdot \alpha^k

Here,

  • T0T_0=Initial temperature
  • Ξ±\alpha=Cooling rate (e.g., 0.95–0.99)
  • kk=Iteration counter

ThConvergence Property

With a logarithmic cooling schedule T(k)=cln⁑(1+k)T(k) = \frac{c}{\ln(1+k)} for sufficiently large constant cc, simulated annealing converges to the global optimum with probability 1 as kβ†’βˆžk \to \infty. In practice, faster geometric schedules are used and convergence is not guaranteed, but solutions are usually good.


Genetic Algorithms

DfGenetic Algorithm

A genetic algorithm (GA) maintains a population of candidate solutions that evolve over generations. Each individual is evaluated by a fitness function. The fittest individuals are selected as parents, which produce offspring through crossover (combining parent traits) and mutation (random perturbation). Over many generations, the population converges toward high-fitness regions of the search space, mimicking natural selection.

Core operators:

  1. Selection β€” tournament, roulette wheel, or rank-based; chooses parents proportional to fitness
  2. Crossover β€” single-point, multi-point, or uniform; recombines two parents to produce children
  3. Mutation β€” bit-flip (binary), Gaussian perturbation (continuous); introduces genetic diversity
  4. Elitism β€” preserves the best individuals across generations to prevent regression

Crossover (Blend)

xchild=Ξ²β‹…xparent1+(1βˆ’Ξ²)β‹…xparent2,β∈[0,1]x_{\text{child}} = \beta \cdot x_{\text{parent1}} + (1-\beta) \cdot x_{\text{parent2}}, \quad \beta \in [0,1]

Here,

  • Ξ²\beta=Crossover coefficient (often uniform random)
  • xparentx_{\text{parent}}=Parent solution vector

Particle Swarm Optimization

DfParticle Swarm Optimization (PSO)

Particle swarm optimization models a flock of birds or school of fish. Each particle has a position xix_i and velocity viv_i in the search space. At each iteration, velocity is updated toward the particle's personal best (pip_i) and the swarm's global best (gg), producing a balance between individual experience and social information.

Velocity Update

vi(t+1)=ω vi(t)+c1r1(piβˆ’xi(t))+c2r2(gβˆ’xi(t))v_i^{(t+1)} = \omega \, v_i^{(t)} + c_1 r_1 \bigl(p_i - x_i^{(t)}\bigr) + c_2 r_2 \bigl(g - x_i^{(t)}\bigr)

Here,

  • Ο‰\omega=Inertia weight (controls momentum)
  • c1c_1=Cognitive coefficient (personal best pull)
  • c2c_2=Social coefficient (global best pull)
  • r1,r2r_1, r_2=Uniform random numbers in [0,1]

Position Update

xi(t+1)=xi(t)+vi(t+1)x_i^{(t+1)} = x_i^{(t)} + v_i^{(t+1)}

Here,

  • xix_i=Position of particle i
  • viv_i=Velocity of particle i

πŸ’‘ Inertia Weight Tuning

Start with Ο‰=0.9\omega = 0.9 (exploration) and linearly decrease to Ο‰=0.4\omega = 0.4 (exploitation) over iterations. This adaptive inertia balances global search early and local refinement later.


Ant Colony Optimization

DfAnt Colony Optimization (ACO)

Ant colony optimization models how real ants find shortest paths using pheromone trails. Artificial ants construct solutions incrementally by traversing a construction graph. Edges carry pheromone levels Ο„\tau that are reinforced by good solutions and evaporate over time. Probabilistic selection balances pheromone intensity (exploitation) and heuristic information (exploration).

Transition Probability

Pij=[Ο„ij]Ξ±β‹…[Ξ·ij]Ξ²βˆ‘k∈Ni[Ο„ik]Ξ±β‹…[Ξ·ik]Ξ²P_{ij} = \frac{[\tau_{ij}]^\alpha \cdot [\eta_{ij}]^\beta}{\sum_{k \in \mathcal{N}_i} [\tau_{ik}]^\alpha \cdot [\eta_{ik}]^\beta}

Here,

  • Ο„ij\tau_{ij}=Pheromone on edge (i,j)
  • Ξ·ij\eta_{ij}=Heuristic desirability (e.g., 1/distance)
  • Ξ±\alpha=Pheromone influence exponent
  • Ξ²\beta=Heuristic influence exponent
  • Ni\mathcal{N}_i=Feasible neighbors of node i

Pheromone update:

Pheromone Update

Ο„ij←(1βˆ’Ο) τij+βˆ‘k=1mΔτijk\tau_{ij} \leftarrow (1-\rho)\,\tau_{ij} + \sum_{k=1}^{m} \Delta\tau_{ij}^k

Here,

  • ρ\rho=Evaporation rate (0 < ρ ≀ 1)
  • Δτijk\Delta\tau_{ij}^k=Pheromone deposited by ant k on edge (i,j)

Tabu Search

DfTabu Search

Tabu search is a local-search metaheuristic that uses memory to prevent revisiting recently explored solutions. A tabu list stores recent moves that are forbidden for a certain number of iterations (the tabu tenure). This forces the search into unexplored regions even when all neighboring moves worsen the objective. Aspiration criteria allow overriding tabu status if a move produces a solution better than the best known.

Key components:

  • Tabu list β€” FIFO queue of forbidden moves (size = tabu tenure)
  • Aspiration criteria β€” override tabu if the move yields the best-known solution
  • Neighborhood structure β€” defines what constitutes a single move (swap, insert, reverse)
  • Diversification β€” periodically reset search to unexplored areas if stagnation is detected
  • Intensification β€” revisit the neighborhood of the best-known solution periodically

Random Restart Hill Climbing

DfRandom Restart Hill Climbing

A simple baseline: run hill climbing (always move to the best neighbor) multiple times from random initial points. Each restart is independent. The best result across all restarts is returned. Though naive, this method is surprisingly effective for problems with a moderate number of local optima and cheap function evaluations.

ThCompleteness via Restarts

If the probability of reaching the global optimum basin from a random start is p>0p > 0, the probability that at least one of nn independent restarts finds it is 1βˆ’(1βˆ’p)n1 - (1-p)^n, which approaches 1 as nβ†’βˆžn \to \infty.


Comparison of Methods

MethodSolution TypeMemoryParallelismBest ForWeakness
Simulated AnnealingSingleNoneLowContinuous/combinatorialSlow cooling schedule needed
Genetic AlgorithmPopulationNone (elitism optional)HighMulti-modal, discretePremature convergence risk
Particle SwarmPopulationPersonal + global bestHighContinuous parameter tuningCan stagnate in high dimensions
Ant ColonyPopulationPheromone matrixMediumGraph routing, TSPSlow convergence on large instances
Tabu SearchSingleTabu listMediumCombinatorial local searchSensitive to tabu tenure tuning
Random Restart HCSingleNoneVery HighSimple landscapesExponential restarts for hard problems

Python Implementation

import numpy as np
from typing import Callable, List, Tuple

# ============================================================
# Simulated Annealing
# ============================================================
def simulated_annealing(
    f: Callable,
    x0: np.ndarray,
    T0: float = 100.0,
    alpha: float = 0.95,
    n_iter: int = 5000,
    step_size: float = 0.5,
    seed: int = None
) -> Tuple[np.ndarray, float]:
    """Simulated annealing for continuous optimization."""
    if seed is not None:
        np.random.seed(seed)

    current = x0.copy()
    current_val = f(current)
    best = current.copy()
    best_val = current_val
    T = T0

    for i in range(n_iter):
        neighbor = current + np.random.randn(len(current)) * step_size
        neighbor_val = f(neighbor)
        delta = neighbor_val - current_val

        if delta < 0 or np.random.rand() < np.exp(-delta / T):
            current = neighbor
            current_val = neighbor_val

        if current_val < best_val:
            best = current.copy()
            best_val = current_val

        T *= alpha

    return best, best_val

# ============================================================
# Genetic Algorithm
# ============================================================
def genetic_algorithm(
    f: Callable,
    n_vars: int,
    bounds: Tuple[float, float] = (-5.0, 5.0),
    n_pop: int = 100,
    n_gen: int = 200,
    crossover_rate: float = 0.8,
    mutation_rate: float = 0.1,
    seed: int = None
) -> Tuple[np.ndarray, float]:
    """Genetic algorithm with tournament selection and blend crossover."""
    if seed is not None:
        np.random.seed(seed)

    lo, hi = bounds
    pop = np.random.uniform(lo, hi, (n_pop, n_vars))
    fitness = np.array([f(ind) for ind in pop])
    best_idx = np.argmin(fitness)
    best = pop[best_idx].copy()
    best_val = fitness[best_idx]

    for gen in range(n_gen):
        # Tournament selection (size 3)
        def tournament():
            idx = np.random.choice(n_pop, 3, replace=False)
            return pop[idx[np.argmin(fitness[idx])]]

        new_pop = []
        # Elitism: keep best
        new_pop.append(best.copy())

        while len(new_pop) < n_pop:
            p1, p2 = tournament(), tournament()
            if np.random.rand() < crossover_rate:
                beta = np.random.rand(n_vars)
                child = beta * p1 + (1 - beta) * p2
            else:
                child = p1.copy()

            if np.random.rand() < mutation_rate:
                child += np.random.randn(n_vars) * 0.3
                child = np.clip(child, lo, hi)

            new_pop.append(child)

        pop = np.array(new_pop[:n_pop])
        fitness = np.array([f(ind) for ind in pop])
        gen_best = np.argmin(fitness)
        if fitness[gen_best] < best_val:
            best = pop[gen_best].copy()
            best_val = fitness[gen_best]

    return best, best_val

# ============================================================
# Particle Swarm Optimization
# ============================================================
def particle_swarm_optimization(
    f: Callable,
    n_vars: int,
    bounds: Tuple[float, float] = (-5.0, 5.0),
    n_particles: int = 30,
    n_iter: int = 200,
    omega: float = 0.7,
    c1: float = 1.5,
    c2: float = 1.5,
    seed: int = None
) -> Tuple[np.ndarray, float]:
    """Particle swarm optimization for continuous functions."""
    if seed is not None:
        np.random.seed(seed)

    lo, hi = bounds
    x = np.random.uniform(lo, hi, (n_particles, n_vars))
    v = np.random.uniform(-1, 1, (n_particles, n_vars))
    p_best = x.copy()
    p_best_val = np.array([f(p) for p in x])
    g_best = p_best[np.argmin(p_best_val)].copy()
    g_best_val = np.min(p_best_val)

    for t in range(n_iter):
        r1 = np.random.rand(n_particles, n_vars)
        r2 = np.random.rand(n_particles, n_vars)
        v = (omega * v
             + c1 * r1 * (p_best - x)
             + c2 * r2 * (g_best - x))
        x = x + v
        x = np.clip(x, lo, hi)

        vals = np.array([f(xi) for xi in x])
        for i in range(n_particles):
            if vals[i] < p_best_val[i]:
                p_best[i] = x[i].copy()
                p_best_val[i] = vals[i]

        best_idx = np.argmin(p_best_val)
        if p_best_val[best_idx] < g_best_val:
            g_best = p_best[best_idx].copy()
            g_best_val = p_best_val[best_idx]

    return g_best, g_best_val

# ============================================================
# Demo
# ============================================================
if __name__ == "__main__":
    rosenbrock = lambda x: (1 - x[0])**2 + 100*(x[1] - x[0]**2)**2
    x0 = np.array([-2.0, 3.0])

    sa_best, sa_val = simulated_annealing(rosenbrock, x0, seed=42)
    ga_best, ga_val = genetic_algorithm(rosenbrock, 2, seed=42)
    pso_best, pso_val = particle_swarm_optimization(rosenbrock, 2, seed=42)

    print(f"SA:  f({sa_best}) = {sa_val:.6f}")
    print(f"GA:  f({ga_best}) = {ga_val:.6f}")
    print(f"PSO: f({pso_best}) = {pso_val:.6f}")

Applications in AI/ML

πŸ’‘ Neural Architecture Search (NAS)

The design of neural network architectures β€” number of layers, filter sizes, skip connections β€” is itself a combinatorial optimization problem. Metaheuristics like genetic algorithms and evolutionary strategies have been used to discover architectures that rival hand-crafted designs. Real-world NAS systems (e.g., AmoebaNet, EfficientNet evolution) use large-scale evolutionary search over architecture encoding graphs, evaluating thousands of candidate architectures on proxy tasks before scaling the best to full training. This replaces months of human expert tuning with automated search.

Other applications:

  • Hyperparameter tuning β€” PSO and genetic algorithms optimize learning rate, batch size, regularization strength
  • Feature selection β€” binary GA selects subsets of features; reduces dimensionality without gradient computation
  • Clustering β€” ACO and GA solve k-medoids and graph partitioning problems
  • Reinforcement learning β€” evolutionary strategies (ES) optimize policy parameters without backpropagation (OpenAI ES)
  • GAN training β€” NSGA-II balances generator and discriminator objectives in multi-objective GAN design

Common Mistakes

πŸ“Common Mistakes

MistakeWhy It FailsFix
Setting initial temperature too low in SAAlgorithm acts like greedy hill climbing; no explorationSet T0T_0 so initial acceptance rate is ~80%: T0β‰ˆβˆ’Ξ”Eβ€Ύln⁑(0.8)T_0 \approx \frac{-\overline{\Delta E}}{\ln(0.8)}
Using too-small population in GAPremature convergence; loss of diversityUse npopβ‰₯20Γ—nvarsn_{\text{pop}} \geq 20 \times n_{\text{vars}} as a rule of thumb
Ignoring mutation in GAPopulation collapses to identical individualsMaintain mutation rate of 5–20% depending on problem
Fixed step size in SAToo small β†’ slow convergence; too large β†’ poor precisionDecrease step size with temperature: Οƒk=Οƒ0β‹…Ξ±k\sigma_k = \sigma_0 \cdot \alpha^k
No velocity clamping in PSOParticles fly to infinity; numerical overflowClamp velocity to [βˆ’vmax⁑,vmax⁑][-v_{\max}, v_{\max}] where vmax⁑=0.2Γ—(xmaxβ‘βˆ’xmin⁑)v_{\max} = 0.2 \times (x_{\max} - x_{\min})
Forgetting pheromone evaporation in ACOPheromone accumulates without bound; early paths dominateAlways include evaporation term (1βˆ’Ο)(1-\rho) each iteration
Setting tabu tenure too short in Tabu SearchAlgorithm cycles between the same few solutionsUse tenure β‰ˆβˆ£neighborhood∣\approx \sqrt{|\text{neighborhood}|} or tune empirically
Running too few restarts in hill climbingHigh probability of missing the global optimumUse nβ‰₯50n \geq 50 restarts for moderate problems

Interview Questions

Q1: When would you choose a genetic algorithm over simulated annealing?

A: GAs are preferred when the search space has many disjoint promising regions (multi-modal), because the population naturally explores multiple basins in parallel. SA, being single-solution, may get trapped unless the temperature is carefully managed. GAs also naturally handle discrete/combinatorial encodings (e.g., permutations for scheduling) via specialized crossover operators.

Q2: How does PSO differ from gradient descent?

A: PSO requires no gradient β€” it uses velocity updates based on personal and global best positions. Gradient descent follows the steepest local slope and converges faster on smooth convex functions. PSO excels when the objective is non-differentiable, noisy, or has many local minima where gradient methods stall.

Q3: What is the role of the cooling schedule in simulated annealing?

A: The cooling schedule T(k)T(k) controls the balance between exploration and exploitation. High TT accepts worse solutions (escaping local minima); low TT behaves greedily (refining). Too fast β†’ stuck in local optimum. Too slow β†’ wasteful computation. The geometric schedule Tk=T0Ξ±kT_k = T_0 \alpha^k with α∈[0.95,0.99]\alpha \in [0.95, 0.99] is a practical default.

Q4: Why is elitism important in genetic algorithms?

A: Without elitism, the best solution found so far can be lost due to crossover or mutation in the next generation. Elitism copies the top kk individuals directly to the next generation, ensuring monotonic improvement of the best fitness. Most practical GA implementations use elitism of 1–5% of the population.

Q5: How do you handle constraints in metaheuristics?

A: Common approaches: (1) penalty functions β€” add a cost for constraint violation to the objective; (2) repair operators β€” map infeasible solutions to the nearest feasible one; (3) Decoder-based β€” encode solutions such that they are always feasible (e.g., for TSP, use permutation encoding); (4) Lagrangian relaxation β€” incorporate constraints into the objective with multipliers updated adaptively.


Practice Problems

πŸ“Problem 1: SA on the Traveling Salesman Problem

Implement simulated annealing for a 20-city TSP instance. Use 2-opt neighborhood (reverse a sub-tour). Start with T0=100T_0 = 100, Ξ±=0.995\alpha = 0.995, and 10,000 iterations. Compare the result with a random restart 2-opt local search.

πŸ’‘Solution

The key decisions are: (1) neighborhood = 2-opt swap (pick two edges, reverse the segment between them); (2) temperature schedule must be slow enough for the discrete TSP landscape; (3) step size is implicitly the neighborhood structure. Random restart 2-opt with 100 restarts typically finds a solution within 5–15% of the optimal, while SA with 10,000 iterations often reaches comparable quality with a single run. SA's advantage is no restart management; the disadvantage is sensitivity to cooling schedule.

πŸ“Problem 2: GA for Feature Selection

Use a binary-encoded GA to select features for a classification task. Encode each individual as a binary string of length dd (one bit per feature). Fitness = classification accuracy on a validation set using only selected features. Run for 50 generations with population 100, crossover rate 0.8, mutation rate 0.01.

πŸ’‘Solution

Encoding: binary vector x∈{0,1}dx \in \{0,1\}^d. Selection: tournament size 3. Crossover: uniform crossover. Mutation: flip each bit independently with probability 0.01. The fitness function is expensive (trains a classifier each evaluation), so use a fast model (logistic regression or small tree). Elitism of top 2 prevents losing good feature sets. After convergence, examine the frequency of each bit being 1 to identify consistently selected features.

πŸ“Problem 3: PSO for Neural Network Weights

Treat the weight vector of a small feedforward network as the particle position. Use PSO with 50 particles to optimize weights for a regression task. Compare with standard backpropagation in terms of final MSE and convergence speed.

πŸ’‘Solution

Each particle's position is the flattened weight vector θ∈Rn\theta \in \mathbb{R}^n. Fitness = MSE on training set. PSO avoids vanishing gradient problems but is significantly slower than backpropagation for networks with many weights (curse of dimensionality). PSO with 50 particles on a network with 1000 weights requires 50 forward passes per iteration versus 1 backward pass for gradient descent. Use PSO only for small networks or when gradients are unavailable (e.g., non-differentiable activation functions, discrete weights).

πŸ“Problem 4: ACO for Vehicle Routing

Apply ant colony optimization to a vehicle routing problem with 15 customers, 3 vehicles, and capacity constraints. Each ant constructs a complete route by probabilistically selecting the next customer based on pheromone and distance heuristic.

πŸ’‘Solution

Construction graph: nodes = customers + depot. Ants start at the depot, greedily probabilistically select next unvisited customer until vehicle capacity is reached, then return to depot and start a new route. Pheromone is deposited proportional to 1/total route length1/\text{total route length}. Use α=1\alpha=1, β=3\beta=3 (favor short edges), ρ=0.5\rho=0.5. After 100 iterations with 20 ants, ACO typically finds routes within 10% of the optimal for small instances. The key advantage over pure greedy is that early pheromone trails encode partial route structure that guides later ants.


Quick Reference

πŸ“‹Quick Reference β€” Metaheuristic Methods

Simulated Annealing

  • Single-solution; temperature controls exploration vs. exploitation
  • Acceptance: P=eβˆ’Ξ”E/TP = e^{-\Delta E / T}; cooling: Tk=T0Ξ±kT_k = T_0 \alpha^k
  • Best for: continuous and combinatorial; simple to implement

Genetic Algorithm

  • Population-based; selection + crossover + mutation
  • Key parameters: pop size, crossover rate, mutation rate
  • Best for: multi-modal, discrete/combinatorial, parallelizable

Particle Swarm Optimization

  • Population-based; velocity toward personal best + global best
  • Velocity: v=Ο‰v+c1r1(pbestβˆ’x)+c2r2(gbestβˆ’x)v = \omega v + c_1 r_1 (p_{\text{best}} - x) + c_2 r_2 (g_{\text{best}} - x)
  • Best for: continuous parameter tuning, neural network weights

Ant Colony Optimization

  • Population-based; pheromone-guided probabilistic construction
  • Transition: PijβˆΟ„ijΞ±β‹…Ξ·ijΞ²P_{ij} \propto \tau_{ij}^\alpha \cdot \eta_{ij}^\beta
  • Best for: graph problems, TSP, vehicle routing

Tabu Search

  • Single-solution; tabu list prevents cycling
  • Aspiration criteria override tabu when beneficial
  • Best for: combinatorial local search with structure

Random Restart Hill Climbing

  • Simple baseline; multiple independent runs
  • Probability of finding optimum: 1βˆ’(1βˆ’p)n1 - (1-p)^n
  • Best for: cheap objectives, few local optima

Cross-References

  • Gradient Descent (Topic: First-Order Methods) β€” use when gradients are available and the problem is smooth
  • Newton's Method (Topic: Second-Order Methods) β€” faster convergence but requires Hessian computation
  • Convex Optimization (Topic: Convexity) β€” when the problem is convex, exact methods are preferred
  • Linear Programming (Topic: Linear Programming) β€” simplex or interior-point methods for linear objectives with linear constraints
  • Constraint Programming (Topic: Constraint Satisfaction) β€” exact method for discrete combinatorial problems with complex constraints
  • Bayesian Optimization (Topic: Sequential Model-Based Optimization) β€” sample-efficient alternative to grid/random search for hyperparameter tuning
  • Reinforcement Learning (Topic: Policy Optimization) β€” evolutionary strategies (ES) as a gradient-free alternative to policy gradient methods
  • Combinatorial Optimization (Topic: Graph Algorithms) β€” branch-and-bound, LP relaxation, and approximation algorithms as exact or near-exact alternatives to metaheuristics
Lesson Progress68 / 100