Metaheuristic Optimization

💡 Why It Matters

Many real-world optimization problems — scheduling routes, training neural networks, designing circuits — involve non-convex landscapes with thousands of local minima, discrete variables, or no computable gradient. Classical gradient descent gets trapped or cannot even be applied. Metaheuristics are problem-independent strategies that navigate such rugged landscapes efficiently, finding high-quality (if not provably optimal) solutions in reasonable time. They are the backbone of modern AI hyperparameter tuning, logistics planning, and combinatorial optimization.

What Are Metaheuristics?

DfMetaheuristic

A metaheuristic is a high-level optimization framework that guides a underlying heuristic search procedure. Unlike exact methods (branch-and-bound, simplex), metaheuristics do not guarantee finding the global optimum but are designed to escape local optima through mechanisms like randomization, population-based search, memory structures, or temperature schedules. The term "meta" means "about" — a metaheuristic is a strategy for managing heuristics.

Key characteristics shared by most metaheuristics:

No gradient required — only function evaluations are needed
Balances exploration (searching new regions) and exploitation (refining known good regions)
Stochastic — randomness helps escape local minima but means results vary between runs
Applicable to black-box functions — the objective function can be a simulation, neural network, or oracle

Simulated Annealing

DfSimulated Annealing

Simulated annealing (SA) is a single-solution metaheuristic inspired by the metallurgical process of annealing, where a material is heated and slowly cooled to reduce defects. At high temperatures, the algorithm freely accepts worse solutions (exploration); as temperature decreases, it becomes more greedy (exploitation), converging to a local — hopefully global — optimum.

Acceptance Probability (Metropolis Criterion)

P(\text{accept worse}) = \exp\!\left(-\frac{\Delta E}{T}\right)

Here,

$\Delta E$ =Energy difference: f(new) − f(current)
$T$ =Temperature parameter (controls acceptance rate)

The temperature schedule $T(k)$ typically follows:

Cooling Schedule

T(k) = T_0 \cdot \alpha^k

Here,

$T_0$ =Initial temperature
$\alpha$ =Cooling rate (e.g., 0.95–0.99)
$k$ =Iteration counter

ThConvergence Property

With a logarithmic cooling schedule $T(k) = \frac{c}{\ln(1+k)}$ for sufficiently large constant $c$ , simulated annealing converges to the global optimum with probability 1 as $k \to \infty$ . In practice, faster geometric schedules are used and convergence is not guaranteed, but solutions are usually good.

Genetic Algorithms

DfGenetic Algorithm

A genetic algorithm (GA) maintains a population of candidate solutions that evolve over generations. Each individual is evaluated by a fitness function. The fittest individuals are selected as parents, which produce offspring through crossover (combining parent traits) and mutation (random perturbation). Over many generations, the population converges toward high-fitness regions of the search space, mimicking natural selection.

Core operators:

Selection — tournament, roulette wheel, or rank-based; chooses parents proportional to fitness
Crossover — single-point, multi-point, or uniform; recombines two parents to produce children
Mutation — bit-flip (binary), Gaussian perturbation (continuous); introduces genetic diversity
Elitism — preserves the best individuals across generations to prevent regression

Crossover (Blend)

x_{\text{child}} = \beta \cdot x_{\text{parent1}} + (1-\beta) \cdot x_{\text{parent2}}, \quad \beta \in [0,1]

Here,

$\beta$ =Crossover coefficient (often uniform random)
$x_{\text{parent}}$ =Parent solution vector

Particle Swarm Optimization

DfParticle Swarm Optimization (PSO)

Particle swarm optimization models a flock of birds or school of fish. Each particle has a position $x_i$ and velocity $v_i$ in the search space. At each iteration, velocity is updated toward the particle's personal best ( $p_i$ ) and the swarm's global best ( $g$ ), producing a balance between individual experience and social information.

Velocity Update

v_i^{(t+1)} = \omega \, v_i^{(t)} + c_1 r_1 \bigl(p_i - x_i^{(t)}\bigr) + c_2 r_2 \bigl(g - x_i^{(t)}\bigr)

Here,

$\omega$ =Inertia weight (controls momentum)
$c_1$ =Cognitive coefficient (personal best pull)
$c_2$ =Social coefficient (global best pull)
$r_1, r_2$ =Uniform random numbers in [0,1]

Position Update

x_i^{(t+1)} = x_i^{(t)} + v_i^{(t+1)}

Here,

$x_i$ =Position of particle i
$v_i$ =Velocity of particle i

💡 Inertia Weight Tuning

Start with $\omega = 0.9$ (exploration) and linearly decrease to $\omega = 0.4$ (exploitation) over iterations. This adaptive inertia balances global search early and local refinement later.

Ant Colony Optimization

DfAnt Colony Optimization (ACO)

Ant colony optimization models how real ants find shortest paths using pheromone trails. Artificial ants construct solutions incrementally by traversing a construction graph. Edges carry pheromone levels $\tau$ that are reinforced by good solutions and evaporate over time. Probabilistic selection balances pheromone intensity (exploitation) and heuristic information (exploration).

Transition Probability

P_{ij} = \frac{[\tau_{ij}]^\alpha \cdot [\eta_{ij}]^\beta}{\sum_{k \in \mathcal{N}_i} [\tau_{ik}]^\alpha \cdot [\eta_{ik}]^\beta}

Here,

$\tau_{ij}$ =Pheromone on edge (i,j)
$\eta_{ij}$ =Heuristic desirability (e.g., 1/distance)
$\alpha$ =Pheromone influence exponent
$\beta$ =Heuristic influence exponent
$\mathcal{N}_i$ =Feasible neighbors of node i

Pheromone update:

Pheromone Update

\tau_{ij} \leftarrow (1-\rho)\,\tau_{ij} + \sum_{k=1}^{m} \Delta\tau_{ij}^k

Here,

$\rho$ =Evaporation rate (0 < ρ ≤ 1)
$\Delta\tau_{ij}^k$ =Pheromone deposited by ant k on edge (i,j)

Tabu Search

DfTabu Search

Tabu search is a local-search metaheuristic that uses memory to prevent revisiting recently explored solutions. A tabu list stores recent moves that are forbidden for a certain number of iterations (the tabu tenure). This forces the search into unexplored regions even when all neighboring moves worsen the objective. Aspiration criteria allow overriding tabu status if a move produces a solution better than the best known.

Key components:

Tabu list — FIFO queue of forbidden moves (size = tabu tenure)
Aspiration criteria — override tabu if the move yields the best-known solution
Neighborhood structure — defines what constitutes a single move (swap, insert, reverse)
Diversification — periodically reset search to unexplored areas if stagnation is detected
Intensification — revisit the neighborhood of the best-known solution periodically

Random Restart Hill Climbing

DfRandom Restart Hill Climbing

A simple baseline: run hill climbing (always move to the best neighbor) multiple times from random initial points. Each restart is independent. The best result across all restarts is returned. Though naive, this method is surprisingly effective for problems with a moderate number of local optima and cheap function evaluations.

ThCompleteness via Restarts

If the probability of reaching the global optimum basin from a random start is $p > 0$ , the probability that at least one of $n$ independent restarts finds it is $1 - (1-p)^n$ , which approaches 1 as $n \to \infty$ .

Comparison of Methods

Method	Solution Type	Memory	Parallelism	Best For	Weakness
Simulated Annealing	Single	None	Low	Continuous/combinatorial	Slow cooling schedule needed
Genetic Algorithm	Population	None (elitism optional)	High	Multi-modal, discrete	Premature convergence risk
Particle Swarm	Population	Personal + global best	High	Continuous parameter tuning	Can stagnate in high dimensions
Ant Colony	Population	Pheromone matrix	Medium	Graph routing, TSP	Slow convergence on large instances
Tabu Search	Single	Tabu list	Medium	Combinatorial local search	Sensitive to tabu tenure tuning
Random Restart HC	Single	None	Very High	Simple landscapes	Exponential restarts for hard problems

Python Implementation

import numpy as np
from typing import Callable, List, Tuple

# ============================================================
# Simulated Annealing
# ============================================================
def simulated_annealing(
    f: Callable,
    x0: np.ndarray,
    T0: float = 100.0,
    alpha: float = 0.95,
    n_iter: int = 5000,
    step_size: float = 0.5,
    seed: int = None
) -> Tuple[np.ndarray, float]:
    """Simulated annealing for continuous optimization."""
    if seed is not None:
        np.random.seed(seed)

    current = x0.copy()
    current_val = f(current)
    best = current.copy()
    best_val = current_val
    T = T0

    for i in range(n_iter):
        neighbor = current + np.random.randn(len(current)) * step_size
        neighbor_val = f(neighbor)
        delta = neighbor_val - current_val

        if delta < 0 or np.random.rand() < np.exp(-delta / T):
            current = neighbor
            current_val = neighbor_val

        if current_val < best_val:
            best = current.copy()
            best_val = current_val

        T *= alpha

    return best, best_val

# ============================================================
# Genetic Algorithm
# ============================================================
def genetic_algorithm(
    f: Callable,
    n_vars: int,
    bounds: Tuple[float, float] = (-5.0, 5.0),
    n_pop: int = 100,
    n_gen: int = 200,
    crossover_rate: float = 0.8,
    mutation_rate: float = 0.1,
    seed: int = None
) -> Tuple[np.ndarray, float]:
    """Genetic algorithm with tournament selection and blend crossover."""
    if seed is not None:
        np.random.seed(seed)

    lo, hi = bounds
    pop = np.random.uniform(lo, hi, (n_pop, n_vars))
    fitness = np.array([f(ind) for ind in pop])
    best_idx = np.argmin(fitness)
    best = pop[best_idx].copy()
    best_val = fitness[best_idx]

    for gen in range(n_gen):
        # Tournament selection (size 3)
        def tournament():
            idx = np.random.choice(n_pop, 3, replace=False)
            return pop[idx[np.argmin(fitness[idx])]]

        new_pop = []
        # Elitism: keep best
        new_pop.append(best.copy())

        while len(new_pop) < n_pop:
            p1, p2 = tournament(), tournament()
            if np.random.rand() < crossover_rate:
                beta = np.random.rand(n_vars)
                child = beta * p1 + (1 - beta) * p2
            else:
                child = p1.copy()

            if np.random.rand() < mutation_rate:
                child += np.random.randn(n_vars) * 0.3
                child = np.clip(child, lo, hi)

            new_pop.append(child)

        pop = np.array(new_pop[:n_pop])
        fitness = np.array([f(ind) for ind in pop])
        gen_best = np.argmin(fitness)
        if fitness[gen_best] < best_val:
            best = pop[gen_best].copy()
            best_val = fitness[gen_best]

    return best, best_val

# ============================================================
# Particle Swarm Optimization
# ============================================================
def particle_swarm_optimization(
    f: Callable,
    n_vars: int,
    bounds: Tuple[float, float] = (-5.0, 5.0),
    n_particles: int = 30,
    n_iter: int = 200,
    omega: float = 0.7,
    c1: float = 1.5,
    c2: float = 1.5,
    seed: int = None
) -> Tuple[np.ndarray, float]:
    """Particle swarm optimization for continuous functions."""
    if seed is not None:
        np.random.seed(seed)

    lo, hi = bounds
    x = np.random.uniform(lo, hi, (n_particles, n_vars))
    v = np.random.uniform(-1, 1, (n_particles, n_vars))
    p_best = x.copy()
    p_best_val = np.array([f(p) for p in x])
    g_best = p_best[np.argmin(p_best_val)].copy()
    g_best_val = np.min(p_best_val)

    for t in range(n_iter):
        r1 = np.random.rand(n_particles, n_vars)
        r2 = np.random.rand(n_particles, n_vars)
        v = (omega * v
             + c1 * r1 * (p_best - x)
             + c2 * r2 * (g_best - x))
        x = x + v
        x = np.clip(x, lo, hi)

        vals = np.array([f(xi) for xi in x])
        for i in range(n_particles):
            if vals[i] < p_best_val[i]:
                p_best[i] = x[i].copy()
                p_best_val[i] = vals[i]

        best_idx = np.argmin(p_best_val)
        if p_best_val[best_idx] < g_best_val:
            g_best = p_best[best_idx].copy()
            g_best_val = p_best_val[best_idx]

    return g_best, g_best_val

# ============================================================
# Demo
# ============================================================
if __name__ == "__main__":
    rosenbrock = lambda x: (1 - x[0])**2 + 100*(x[1] - x[0]**2)**2
    x0 = np.array([-2.0, 3.0])

    sa_best, sa_val = simulated_annealing(rosenbrock, x0, seed=42)
    ga_best, ga_val = genetic_algorithm(rosenbrock, 2, seed=42)
    pso_best, pso_val = particle_swarm_optimization(rosenbrock, 2, seed=42)

    print(f"SA:  f({sa_best}) = {sa_val:.6f}")
    print(f"GA:  f({ga_best}) = {ga_val:.6f}")
    print(f"PSO: f({pso_best}) = {pso_val:.6f}")

Applications in AI/ML

💡 Neural Architecture Search (NAS)

The design of neural network architectures — number of layers, filter sizes, skip connections — is itself a combinatorial optimization problem. Metaheuristics like genetic algorithms and evolutionary strategies have been used to discover architectures that rival hand-crafted designs. Real-world NAS systems (e.g., AmoebaNet, EfficientNet evolution) use large-scale evolutionary search over architecture encoding graphs, evaluating thousands of candidate architectures on proxy tasks before scaling the best to full training. This replaces months of human expert tuning with automated search.

Other applications:

Hyperparameter tuning — PSO and genetic algorithms optimize learning rate, batch size, regularization strength
Feature selection — binary GA selects subsets of features; reduces dimensionality without gradient computation
Clustering — ACO and GA solve k-medoids and graph partitioning problems
Reinforcement learning — evolutionary strategies (ES) optimize policy parameters without backpropagation (OpenAI ES)
GAN training — NSGA-II balances generator and discriminator objectives in multi-objective GAN design

Common Mistakes

📝Common Mistakes

Mistake	Why It Fails	Fix
Setting initial temperature too low in SA	Algorithm acts like greedy hill climbing; no exploration	Set $T_0$ so initial acceptance rate is ~80%: $T_0 \approx \frac{-\overline{\Delta E}}{\ln(0.8)}$
Using too-small population in GA	Premature convergence; loss of diversity	Use $n_{\text{pop}} \geq 20 \times n_{\text{vars}}$ as a rule of thumb
Ignoring mutation in GA	Population collapses to identical individuals	Maintain mutation rate of 5–20% depending on problem
Fixed step size in SA	Too small → slow convergence; too large → poor precision	Decrease step size with temperature: $\sigma_k = \sigma_0 \cdot \alpha^k$
No velocity clamping in PSO	Particles fly to infinity; numerical overflow	Clamp velocity to $[-v_{\max}, v_{\max}]$ where $v_{\max} = 0.2 \times (x_{\max} - x_{\min})$
Forgetting pheromone evaporation in ACO	Pheromone accumulates without bound; early paths dominate	Always include evaporation term $(1-\rho)$ each iteration
Setting tabu tenure too short in Tabu Search	Algorithm cycles between the same few solutions	Use tenure $\approx \sqrt{\|\text{neighborhood}\|}$ or tune empirically
Running too few restarts in hill climbing	High probability of missing the global optimum	Use $n \geq 50$ restarts for moderate problems

Interview Questions

Q1: When would you choose a genetic algorithm over simulated annealing?

A: GAs are preferred when the search space has many disjoint promising regions (multi-modal), because the population naturally explores multiple basins in parallel. SA, being single-solution, may get trapped unless the temperature is carefully managed. GAs also naturally handle discrete/combinatorial encodings (e.g., permutations for scheduling) via specialized crossover operators.

Q2: How does PSO differ from gradient descent?

A: PSO requires no gradient — it uses velocity updates based on personal and global best positions. Gradient descent follows the steepest local slope and converges faster on smooth convex functions. PSO excels when the objective is non-differentiable, noisy, or has many local minima where gradient methods stall.

Q3: What is the role of the cooling schedule in simulated annealing?

A: The cooling schedule $T(k)$ controls the balance between exploration and exploitation. High $T$ accepts worse solutions (escaping local minima); low $T$ behaves greedily (refining). Too fast → stuck in local optimum. Too slow → wasteful computation. The geometric schedule $T_k = T_0 \alpha^k$ with $\alpha \in [0.95, 0.99]$ is a practical default.

Q4: Why is elitism important in genetic algorithms?

A: Without elitism, the best solution found so far can be lost due to crossover or mutation in the next generation. Elitism copies the top $k$ individuals directly to the next generation, ensuring monotonic improvement of the best fitness. Most practical GA implementations use elitism of 1–5% of the population.

Q5: How do you handle constraints in metaheuristics?

A: Common approaches: (1) penalty functions — add a cost for constraint violation to the objective; (2) repair operators — map infeasible solutions to the nearest feasible one; (3) Decoder-based — encode solutions such that they are always feasible (e.g., for TSP, use permutation encoding); (4) Lagrangian relaxation — incorporate constraints into the objective with multipliers updated adaptively.

Practice Problems

📝Problem 1: SA on the Traveling Salesman Problem

Implement simulated annealing for a 20-city TSP instance. Use 2-opt neighborhood (reverse a sub-tour). Start with $T_0 = 100$ , $\alpha = 0.995$ , and 10,000 iterations. Compare the result with a random restart 2-opt local search.

💡Solution

The key decisions are: (1) neighborhood = 2-opt swap (pick two edges, reverse the segment between them); (2) temperature schedule must be slow enough for the discrete TSP landscape; (3) step size is implicitly the neighborhood structure. Random restart 2-opt with 100 restarts typically finds a solution within 5–15% of the optimal, while SA with 10,000 iterations often reaches comparable quality with a single run. SA's advantage is no restart management; the disadvantage is sensitivity to cooling schedule.

📝Problem 2: GA for Feature Selection

Use a binary-encoded GA to select features for a classification task. Encode each individual as a binary string of length $d$ (one bit per feature). Fitness = classification accuracy on a validation set using only selected features. Run for 50 generations with population 100, crossover rate 0.8, mutation rate 0.01.

💡Solution

Encoding: binary vector $x \in \{0,1\}^d$ . Selection: tournament size 3. Crossover: uniform crossover. Mutation: flip each bit independently with probability 0.01. The fitness function is expensive (trains a classifier each evaluation), so use a fast model (logistic regression or small tree). Elitism of top 2 prevents losing good feature sets. After convergence, examine the frequency of each bit being 1 to identify consistently selected features.

📝Problem 3: PSO for Neural Network Weights

Treat the weight vector of a small feedforward network as the particle position. Use PSO with 50 particles to optimize weights for a regression task. Compare with standard backpropagation in terms of final MSE and convergence speed.

💡Solution

Each particle's position is the flattened weight vector $\theta \in \mathbb{R}^n$ . Fitness = MSE on training set. PSO avoids vanishing gradient problems but is significantly slower than backpropagation for networks with many weights (curse of dimensionality). PSO with 50 particles on a network with 1000 weights requires 50 forward passes per iteration versus 1 backward pass for gradient descent. Use PSO only for small networks or when gradients are unavailable (e.g., non-differentiable activation functions, discrete weights).

📝Problem 4: ACO for Vehicle Routing

Apply ant colony optimization to a vehicle routing problem with 15 customers, 3 vehicles, and capacity constraints. Each ant constructs a complete route by probabilistically selecting the next customer based on pheromone and distance heuristic.

💡Solution

Construction graph: nodes = customers + depot. Ants start at the depot, greedily probabilistically select next unvisited customer until vehicle capacity is reached, then return to depot and start a new route. Pheromone is deposited proportional to $1/\text{total route length}$ . Use $\alpha=1$ , $\beta=3$ (favor short edges), $\rho=0.5$ . After 100 iterations with 20 ants, ACO typically finds routes within 10% of the optimal for small instances. The key advantage over pure greedy is that early pheromone trails encode partial route structure that guides later ants.

Quick Reference

📋Quick Reference — Metaheuristic Methods

Simulated Annealing

Single-solution; temperature controls exploration vs. exploitation
Acceptance: $P = e^{-\Delta E / T}$ ; cooling: $T_k = T_0 \alpha^k$
Best for: continuous and combinatorial; simple to implement

Genetic Algorithm

Population-based; selection + crossover + mutation
Key parameters: pop size, crossover rate, mutation rate
Best for: multi-modal, discrete/combinatorial, parallelizable

Particle Swarm Optimization

Population-based; velocity toward personal best + global best
Velocity: $v = \omega v + c_1 r_1 (p_{\text{best}} - x) + c_2 r_2 (g_{\text{best}} - x)$
Best for: continuous parameter tuning, neural network weights

Ant Colony Optimization

Population-based; pheromone-guided probabilistic construction
Transition: $P_{ij} \propto \tau_{ij}^\alpha \cdot \eta_{ij}^\beta$
Best for: graph problems, TSP, vehicle routing

Tabu Search

Single-solution; tabu list prevents cycling
Aspiration criteria override tabu when beneficial
Best for: combinatorial local search with structure

Random Restart Hill Climbing

Simple baseline; multiple independent runs
Probability of finding optimum: $1 - (1-p)^n$
Best for: cheap objectives, few local optima

Cross-References

Gradient Descent (Topic: First-Order Methods) — use when gradients are available and the problem is smooth
Newton's Method (Topic: Second-Order Methods) — faster convergence but requires Hessian computation
Convex Optimization (Topic: Convexity) — when the problem is convex, exact methods are preferred
Linear Programming (Topic: Linear Programming) — simplex or interior-point methods for linear objectives with linear constraints
Constraint Programming (Topic: Constraint Satisfaction) — exact method for discrete combinatorial problems with complex constraints
Bayesian Optimization (Topic: Sequential Model-Based Optimization) — sample-efficient alternative to grid/random search for hyperparameter tuning
Reinforcement Learning (Topic: Policy Optimization) — evolutionary strategies (ES) as a gradient-free alternative to policy gradient methods
Combinatorial Optimization (Topic: Graph Algorithms) — branch-and-bound, LP relaxation, and approximation algorithms as exact or near-exact alternatives to metaheuristics