Metaheuristic Optimization
π‘ Why It Matters
Many real-world optimization problems β scheduling routes, training neural networks, designing circuits β involve non-convex landscapes with thousands of local minima, discrete variables, or no computable gradient. Classical gradient descent gets trapped or cannot even be applied. Metaheuristics are problem-independent strategies that navigate such rugged landscapes efficiently, finding high-quality (if not provably optimal) solutions in reasonable time. They are the backbone of modern AI hyperparameter tuning, logistics planning, and combinatorial optimization.
What Are Metaheuristics?
DfMetaheuristic
A metaheuristic is a high-level optimization framework that guides a underlying heuristic search procedure. Unlike exact methods (branch-and-bound, simplex), metaheuristics do not guarantee finding the global optimum but are designed to escape local optima through mechanisms like randomization, population-based search, memory structures, or temperature schedules. The term "meta" means "about" β a metaheuristic is a strategy for managing heuristics.
Key characteristics shared by most metaheuristics:
- No gradient required β only function evaluations are needed
- Balances exploration (searching new regions) and exploitation (refining known good regions)
- Stochastic β randomness helps escape local minima but means results vary between runs
- Applicable to black-box functions β the objective function can be a simulation, neural network, or oracle
Simulated Annealing
DfSimulated Annealing
Simulated annealing (SA) is a single-solution metaheuristic inspired by the metallurgical process of annealing, where a material is heated and slowly cooled to reduce defects. At high temperatures, the algorithm freely accepts worse solutions (exploration); as temperature decreases, it becomes more greedy (exploitation), converging to a local β hopefully global β optimum.
Acceptance Probability (Metropolis Criterion)
Here,
- =Energy difference: f(new) β f(current)
- =Temperature parameter (controls acceptance rate)
The temperature schedule typically follows:
Cooling Schedule
Here,
- =Initial temperature
- =Cooling rate (e.g., 0.95β0.99)
- =Iteration counter
ThConvergence Property
With a logarithmic cooling schedule for sufficiently large constant , simulated annealing converges to the global optimum with probability 1 as . In practice, faster geometric schedules are used and convergence is not guaranteed, but solutions are usually good.
Genetic Algorithms
DfGenetic Algorithm
A genetic algorithm (GA) maintains a population of candidate solutions that evolve over generations. Each individual is evaluated by a fitness function. The fittest individuals are selected as parents, which produce offspring through crossover (combining parent traits) and mutation (random perturbation). Over many generations, the population converges toward high-fitness regions of the search space, mimicking natural selection.
Core operators:
- Selection β tournament, roulette wheel, or rank-based; chooses parents proportional to fitness
- Crossover β single-point, multi-point, or uniform; recombines two parents to produce children
- Mutation β bit-flip (binary), Gaussian perturbation (continuous); introduces genetic diversity
- Elitism β preserves the best individuals across generations to prevent regression
Crossover (Blend)
Here,
- =Crossover coefficient (often uniform random)
- =Parent solution vector
Particle Swarm Optimization
DfParticle Swarm Optimization (PSO)
Particle swarm optimization models a flock of birds or school of fish. Each particle has a position and velocity in the search space. At each iteration, velocity is updated toward the particle's personal best () and the swarm's global best (), producing a balance between individual experience and social information.
Velocity Update
Here,
- =Inertia weight (controls momentum)
- =Cognitive coefficient (personal best pull)
- =Social coefficient (global best pull)
- =Uniform random numbers in [0,1]
Position Update
Here,
- =Position of particle i
- =Velocity of particle i
π‘ Inertia Weight Tuning
Start with (exploration) and linearly decrease to (exploitation) over iterations. This adaptive inertia balances global search early and local refinement later.
Ant Colony Optimization
DfAnt Colony Optimization (ACO)
Ant colony optimization models how real ants find shortest paths using pheromone trails. Artificial ants construct solutions incrementally by traversing a construction graph. Edges carry pheromone levels that are reinforced by good solutions and evaporate over time. Probabilistic selection balances pheromone intensity (exploitation) and heuristic information (exploration).
Transition Probability
Here,
- =Pheromone on edge (i,j)
- =Heuristic desirability (e.g., 1/distance)
- =Pheromone influence exponent
- =Heuristic influence exponent
- =Feasible neighbors of node i
Pheromone update:
Pheromone Update
Here,
- =Evaporation rate (0 < Ο β€ 1)
- =Pheromone deposited by ant k on edge (i,j)
Tabu Search
DfTabu Search
Tabu search is a local-search metaheuristic that uses memory to prevent revisiting recently explored solutions. A tabu list stores recent moves that are forbidden for a certain number of iterations (the tabu tenure). This forces the search into unexplored regions even when all neighboring moves worsen the objective. Aspiration criteria allow overriding tabu status if a move produces a solution better than the best known.
Key components:
- Tabu list β FIFO queue of forbidden moves (size = tabu tenure)
- Aspiration criteria β override tabu if the move yields the best-known solution
- Neighborhood structure β defines what constitutes a single move (swap, insert, reverse)
- Diversification β periodically reset search to unexplored areas if stagnation is detected
- Intensification β revisit the neighborhood of the best-known solution periodically
Random Restart Hill Climbing
DfRandom Restart Hill Climbing
A simple baseline: run hill climbing (always move to the best neighbor) multiple times from random initial points. Each restart is independent. The best result across all restarts is returned. Though naive, this method is surprisingly effective for problems with a moderate number of local optima and cheap function evaluations.
ThCompleteness via Restarts
If the probability of reaching the global optimum basin from a random start is , the probability that at least one of independent restarts finds it is , which approaches 1 as .
Comparison of Methods
| Method | Solution Type | Memory | Parallelism | Best For | Weakness |
|---|---|---|---|---|---|
| Simulated Annealing | Single | None | Low | Continuous/combinatorial | Slow cooling schedule needed |
| Genetic Algorithm | Population | None (elitism optional) | High | Multi-modal, discrete | Premature convergence risk |
| Particle Swarm | Population | Personal + global best | High | Continuous parameter tuning | Can stagnate in high dimensions |
| Ant Colony | Population | Pheromone matrix | Medium | Graph routing, TSP | Slow convergence on large instances |
| Tabu Search | Single | Tabu list | Medium | Combinatorial local search | Sensitive to tabu tenure tuning |
| Random Restart HC | Single | None | Very High | Simple landscapes | Exponential restarts for hard problems |
Python Implementation
import numpy as np
from typing import Callable, List, Tuple
# ============================================================
# Simulated Annealing
# ============================================================
def simulated_annealing(
f: Callable,
x0: np.ndarray,
T0: float = 100.0,
alpha: float = 0.95,
n_iter: int = 5000,
step_size: float = 0.5,
seed: int = None
) -> Tuple[np.ndarray, float]:
"""Simulated annealing for continuous optimization."""
if seed is not None:
np.random.seed(seed)
current = x0.copy()
current_val = f(current)
best = current.copy()
best_val = current_val
T = T0
for i in range(n_iter):
neighbor = current + np.random.randn(len(current)) * step_size
neighbor_val = f(neighbor)
delta = neighbor_val - current_val
if delta < 0 or np.random.rand() < np.exp(-delta / T):
current = neighbor
current_val = neighbor_val
if current_val < best_val:
best = current.copy()
best_val = current_val
T *= alpha
return best, best_val
# ============================================================
# Genetic Algorithm
# ============================================================
def genetic_algorithm(
f: Callable,
n_vars: int,
bounds: Tuple[float, float] = (-5.0, 5.0),
n_pop: int = 100,
n_gen: int = 200,
crossover_rate: float = 0.8,
mutation_rate: float = 0.1,
seed: int = None
) -> Tuple[np.ndarray, float]:
"""Genetic algorithm with tournament selection and blend crossover."""
if seed is not None:
np.random.seed(seed)
lo, hi = bounds
pop = np.random.uniform(lo, hi, (n_pop, n_vars))
fitness = np.array([f(ind) for ind in pop])
best_idx = np.argmin(fitness)
best = pop[best_idx].copy()
best_val = fitness[best_idx]
for gen in range(n_gen):
# Tournament selection (size 3)
def tournament():
idx = np.random.choice(n_pop, 3, replace=False)
return pop[idx[np.argmin(fitness[idx])]]
new_pop = []
# Elitism: keep best
new_pop.append(best.copy())
while len(new_pop) < n_pop:
p1, p2 = tournament(), tournament()
if np.random.rand() < crossover_rate:
beta = np.random.rand(n_vars)
child = beta * p1 + (1 - beta) * p2
else:
child = p1.copy()
if np.random.rand() < mutation_rate:
child += np.random.randn(n_vars) * 0.3
child = np.clip(child, lo, hi)
new_pop.append(child)
pop = np.array(new_pop[:n_pop])
fitness = np.array([f(ind) for ind in pop])
gen_best = np.argmin(fitness)
if fitness[gen_best] < best_val:
best = pop[gen_best].copy()
best_val = fitness[gen_best]
return best, best_val
# ============================================================
# Particle Swarm Optimization
# ============================================================
def particle_swarm_optimization(
f: Callable,
n_vars: int,
bounds: Tuple[float, float] = (-5.0, 5.0),
n_particles: int = 30,
n_iter: int = 200,
omega: float = 0.7,
c1: float = 1.5,
c2: float = 1.5,
seed: int = None
) -> Tuple[np.ndarray, float]:
"""Particle swarm optimization for continuous functions."""
if seed is not None:
np.random.seed(seed)
lo, hi = bounds
x = np.random.uniform(lo, hi, (n_particles, n_vars))
v = np.random.uniform(-1, 1, (n_particles, n_vars))
p_best = x.copy()
p_best_val = np.array([f(p) for p in x])
g_best = p_best[np.argmin(p_best_val)].copy()
g_best_val = np.min(p_best_val)
for t in range(n_iter):
r1 = np.random.rand(n_particles, n_vars)
r2 = np.random.rand(n_particles, n_vars)
v = (omega * v
+ c1 * r1 * (p_best - x)
+ c2 * r2 * (g_best - x))
x = x + v
x = np.clip(x, lo, hi)
vals = np.array([f(xi) for xi in x])
for i in range(n_particles):
if vals[i] < p_best_val[i]:
p_best[i] = x[i].copy()
p_best_val[i] = vals[i]
best_idx = np.argmin(p_best_val)
if p_best_val[best_idx] < g_best_val:
g_best = p_best[best_idx].copy()
g_best_val = p_best_val[best_idx]
return g_best, g_best_val
# ============================================================
# Demo
# ============================================================
if __name__ == "__main__":
rosenbrock = lambda x: (1 - x[0])**2 + 100*(x[1] - x[0]**2)**2
x0 = np.array([-2.0, 3.0])
sa_best, sa_val = simulated_annealing(rosenbrock, x0, seed=42)
ga_best, ga_val = genetic_algorithm(rosenbrock, 2, seed=42)
pso_best, pso_val = particle_swarm_optimization(rosenbrock, 2, seed=42)
print(f"SA: f({sa_best}) = {sa_val:.6f}")
print(f"GA: f({ga_best}) = {ga_val:.6f}")
print(f"PSO: f({pso_best}) = {pso_val:.6f}")
Applications in AI/ML
π‘ Neural Architecture Search (NAS)
The design of neural network architectures β number of layers, filter sizes, skip connections β is itself a combinatorial optimization problem. Metaheuristics like genetic algorithms and evolutionary strategies have been used to discover architectures that rival hand-crafted designs. Real-world NAS systems (e.g., AmoebaNet, EfficientNet evolution) use large-scale evolutionary search over architecture encoding graphs, evaluating thousands of candidate architectures on proxy tasks before scaling the best to full training. This replaces months of human expert tuning with automated search.
Other applications:
- Hyperparameter tuning β PSO and genetic algorithms optimize learning rate, batch size, regularization strength
- Feature selection β binary GA selects subsets of features; reduces dimensionality without gradient computation
- Clustering β ACO and GA solve k-medoids and graph partitioning problems
- Reinforcement learning β evolutionary strategies (ES) optimize policy parameters without backpropagation (OpenAI ES)
- GAN training β NSGA-II balances generator and discriminator objectives in multi-objective GAN design
Common Mistakes
πCommon Mistakes
| Mistake | Why It Fails | Fix |
|---|---|---|
| Setting initial temperature too low in SA | Algorithm acts like greedy hill climbing; no exploration | Set so initial acceptance rate is ~80%: |
| Using too-small population in GA | Premature convergence; loss of diversity | Use as a rule of thumb |
| Ignoring mutation in GA | Population collapses to identical individuals | Maintain mutation rate of 5β20% depending on problem |
| Fixed step size in SA | Too small β slow convergence; too large β poor precision | Decrease step size with temperature: |
| No velocity clamping in PSO | Particles fly to infinity; numerical overflow | Clamp velocity to where |
| Forgetting pheromone evaporation in ACO | Pheromone accumulates without bound; early paths dominate | Always include evaporation term each iteration |
| Setting tabu tenure too short in Tabu Search | Algorithm cycles between the same few solutions | Use tenure or tune empirically |
| Running too few restarts in hill climbing | High probability of missing the global optimum | Use restarts for moderate problems |
Interview Questions
Q1: When would you choose a genetic algorithm over simulated annealing?
A: GAs are preferred when the search space has many disjoint promising regions (multi-modal), because the population naturally explores multiple basins in parallel. SA, being single-solution, may get trapped unless the temperature is carefully managed. GAs also naturally handle discrete/combinatorial encodings (e.g., permutations for scheduling) via specialized crossover operators.
Q2: How does PSO differ from gradient descent?
A: PSO requires no gradient β it uses velocity updates based on personal and global best positions. Gradient descent follows the steepest local slope and converges faster on smooth convex functions. PSO excels when the objective is non-differentiable, noisy, or has many local minima where gradient methods stall.
Q3: What is the role of the cooling schedule in simulated annealing?
A: The cooling schedule controls the balance between exploration and exploitation. High accepts worse solutions (escaping local minima); low behaves greedily (refining). Too fast β stuck in local optimum. Too slow β wasteful computation. The geometric schedule with is a practical default.
Q4: Why is elitism important in genetic algorithms?
A: Without elitism, the best solution found so far can be lost due to crossover or mutation in the next generation. Elitism copies the top individuals directly to the next generation, ensuring monotonic improvement of the best fitness. Most practical GA implementations use elitism of 1β5% of the population.
Q5: How do you handle constraints in metaheuristics?
A: Common approaches: (1) penalty functions β add a cost for constraint violation to the objective; (2) repair operators β map infeasible solutions to the nearest feasible one; (3) Decoder-based β encode solutions such that they are always feasible (e.g., for TSP, use permutation encoding); (4) Lagrangian relaxation β incorporate constraints into the objective with multipliers updated adaptively.
Practice Problems
πProblem 1: SA on the Traveling Salesman Problem
Implement simulated annealing for a 20-city TSP instance. Use 2-opt neighborhood (reverse a sub-tour). Start with , , and 10,000 iterations. Compare the result with a random restart 2-opt local search.
π‘Solution
The key decisions are: (1) neighborhood = 2-opt swap (pick two edges, reverse the segment between them); (2) temperature schedule must be slow enough for the discrete TSP landscape; (3) step size is implicitly the neighborhood structure. Random restart 2-opt with 100 restarts typically finds a solution within 5β15% of the optimal, while SA with 10,000 iterations often reaches comparable quality with a single run. SA's advantage is no restart management; the disadvantage is sensitivity to cooling schedule.
πProblem 2: GA for Feature Selection
Use a binary-encoded GA to select features for a classification task. Encode each individual as a binary string of length (one bit per feature). Fitness = classification accuracy on a validation set using only selected features. Run for 50 generations with population 100, crossover rate 0.8, mutation rate 0.01.
π‘Solution
Encoding: binary vector . Selection: tournament size 3. Crossover: uniform crossover. Mutation: flip each bit independently with probability 0.01. The fitness function is expensive (trains a classifier each evaluation), so use a fast model (logistic regression or small tree). Elitism of top 2 prevents losing good feature sets. After convergence, examine the frequency of each bit being 1 to identify consistently selected features.
πProblem 3: PSO for Neural Network Weights
Treat the weight vector of a small feedforward network as the particle position. Use PSO with 50 particles to optimize weights for a regression task. Compare with standard backpropagation in terms of final MSE and convergence speed.
π‘Solution
Each particle's position is the flattened weight vector . Fitness = MSE on training set. PSO avoids vanishing gradient problems but is significantly slower than backpropagation for networks with many weights (curse of dimensionality). PSO with 50 particles on a network with 1000 weights requires 50 forward passes per iteration versus 1 backward pass for gradient descent. Use PSO only for small networks or when gradients are unavailable (e.g., non-differentiable activation functions, discrete weights).
πProblem 4: ACO for Vehicle Routing
Apply ant colony optimization to a vehicle routing problem with 15 customers, 3 vehicles, and capacity constraints. Each ant constructs a complete route by probabilistically selecting the next customer based on pheromone and distance heuristic.
π‘Solution
Construction graph: nodes = customers + depot. Ants start at the depot, greedily probabilistically select next unvisited customer until vehicle capacity is reached, then return to depot and start a new route. Pheromone is deposited proportional to . Use , (favor short edges), . After 100 iterations with 20 ants, ACO typically finds routes within 10% of the optimal for small instances. The key advantage over pure greedy is that early pheromone trails encode partial route structure that guides later ants.
Quick Reference
πQuick Reference β Metaheuristic Methods
Simulated Annealing
- Single-solution; temperature controls exploration vs. exploitation
- Acceptance: ; cooling:
- Best for: continuous and combinatorial; simple to implement
Genetic Algorithm
- Population-based; selection + crossover + mutation
- Key parameters: pop size, crossover rate, mutation rate
- Best for: multi-modal, discrete/combinatorial, parallelizable
Particle Swarm Optimization
- Population-based; velocity toward personal best + global best
- Velocity:
- Best for: continuous parameter tuning, neural network weights
Ant Colony Optimization
- Population-based; pheromone-guided probabilistic construction
- Transition:
- Best for: graph problems, TSP, vehicle routing
Tabu Search
- Single-solution; tabu list prevents cycling
- Aspiration criteria override tabu when beneficial
- Best for: combinatorial local search with structure
Random Restart Hill Climbing
- Simple baseline; multiple independent runs
- Probability of finding optimum:
- Best for: cheap objectives, few local optima
Cross-References
- Gradient Descent (Topic: First-Order Methods) β use when gradients are available and the problem is smooth
- Newton's Method (Topic: Second-Order Methods) β faster convergence but requires Hessian computation
- Convex Optimization (Topic: Convexity) β when the problem is convex, exact methods are preferred
- Linear Programming (Topic: Linear Programming) β simplex or interior-point methods for linear objectives with linear constraints
- Constraint Programming (Topic: Constraint Satisfaction) β exact method for discrete combinatorial problems with complex constraints
- Bayesian Optimization (Topic: Sequential Model-Based Optimization) β sample-efficient alternative to grid/random search for hyperparameter tuning
- Reinforcement Learning (Topic: Policy Optimization) β evolutionary strategies (ES) as a gradient-free alternative to policy gradient methods
- Combinatorial Optimization (Topic: Graph Algorithms) β branch-and-bound, LP relaxation, and approximation algorithms as exact or near-exact alternatives to metaheuristics