Optimization
Model Merging and Fusion â Combining Knowledge Without Training
What if you could combine the strengths of multiple specialized models into one generalist model without additional training? Model merging achieves this by averaging, interpolating, or strategically combining model weights.
- Model Soups â Averaging weights from fine-tuned models
- TIES-Merging â Resolving interference through trim, elect, and disjoint merge
- DARE â Drop and rescale for massive model merging
- Task Arithmetic â Treating fine-tuning as task vectors
The whole can be greater than the sum of its parts if you know how to combine them.
Model Merging and Fusion
When you fine-tune a base model on different tasks, each specialized model learns task-specific knowledge in its weights. Model merging combines these specialized models into a single model that inherits the capabilities of all of them without requiring any training data or compute.
DfModel Merging
Model merging is the process of combining the parameters of multiple models into a single model. The goal is to create a model that performs well on all the tasks the source models were specialized for.
Model Soups
Linear Interpolation
DfModel Soups
Model soups (Wortsman et al., 2022) combine models by averaging their weights. The key insight is that fine-tuned models from the same pre-trained initialization often lie in the same loss basin, making weight averaging effective.
Weight Averaging
Here,
- =Merged model weights
- =Weight for model i
- =Weights of model i
- =Number of models to merge
import torch
import copy
from typing import Dict, List
class ModelSoupMerger:
"""Merge models by weight averaging."""
def __init__(self, models: List[Dict], weights: List[float] = None):
self.models = models
self.n_models = len(models)
if weights is None:
self.weights = [1.0 / self.n_models] * self.n_models
else:
total = sum(weights)
self.weights = [w / total for w in weights]
def merge(self) -> Dict:
"""Perform linear weight averaging."""
merged = {key: torch.zeros_like(v, dtype=torch.float32)
for key, v in self.models[0].items()}
for model, weight in zip(self.models, self.weights):
for key in merged:
merged[key] += weight * model[key].float()
return merged
Types of Model Soups
| Type | Description | When to Use |
|---|---|---|
| Uniform Soup | Equal weights for all models | Models are equally good |
| Greedy Soup | Add models only if they improve validation | Have validation data |
| Learned Soup | Optimize weights on validation set | Have enough data |
Greedy soup iteratively adds models to the merge if they improve performance on a validation set. This often outperforms uniform averaging because not all fine-tuned models contribute positively.
Task Arithmetic
Task Vectors
DfTask Vector
A task vector is the difference between fine-tuned and pre-trained weights: Ď = θ_finetuned - θ_pretrained. This vector captures the knowledge needed for a specific task.
Task Arithmetic
Here,
- =Base pre-trained model weights
- =Task vector for task i
- =Scaling factor
- =Number of tasks
class TaskArithmetic:
"""Combine models using task vectors."""
def __init__(self, base_model, fine_tuned_models):
self.base_model = base_model
self.fine_tuned_models = fine_tuned_models
def compute_task_vectors(self):
"""Compute task vector for each fine-tuned model."""
task_vectors = []
for model in self.fine_tuned_models:
tv = {key: model[key] - self.base_model[key]
for key in self.base_model}
task_vectors.append(tv)
return task_vectors
def merge_with_task_arithmetic(self, scaling_factor=1.0):
"""Merge using task arithmetic."""
merged = copy.deepcopy(self.base_model)
for tv in self.compute_task_vectors():
for key in merged:
merged[key] += scaling_factor * tv[key]
return merged
def merge_with_negation(self, task_to_negate, scaling_factor=-1.0):
"""Remove a task's knowledge by negating its task vector."""
merged = copy.deepcopy(self.base_model)
task_vectors = self.compute_task_vectors()
for i, tv in enumerate(task_vectors):
factor = scaling_factor if i == task_to_negate else 1.0
for key in merged:
merged[key] += factor * tv[key]
return merged
Task arithmetic enables model algebra: you can add tasks (combine capabilities), negate tasks (remove unwanted behavior), or scale tasks (adjust importance).
TIES-Merging
The Interference Problem
DfWeight Interference
Weight interference occurs when different models modify the same weights in opposite directions, causing conflicts during merging. TIES-Merging addresses this by identifying and resolving these conflicts.
TIES Algorithm
DfTIES-Merging
TIES-Merging (Yadav et al., 2023) resolves interference through three steps: (1) Trim small updates, (2) Elect sign consensus, (3) Disjoint merge of non-conflicting parameters.
class TIESMerger:
"""TIES-Merging implementation."""
def __init__(self, base_model, fine_tuned_models, top_k=0.2):
self.base_model = base_model
self.fine_tuned_models = fine_tuned_models
self.top_k = top_k
def merge(self):
"""Perform TIES merging."""
task_vectors = self._compute_task_vectors()
trimmed = self._trim_updates(task_vectors)
consensus = self._elect_signs(trimmed)
return self._disjoint_merge(trimmed, consensus)
def _compute_task_vectors(self):
"""Compute task vectors for each model."""
return [{key: model[key] - self.base_model[key]
for key in self.base_model}
for model in self.fine_tuned_models]
def _trim_updates(self, task_vectors):
"""Keep only top-k% of updates by magnitude."""
trimmed = []
for tv in task_vectors:
trimmed_tv = {}
for key in tv:
flat = tv[key].flatten()
n_keep = int(len(flat) * self.top_k)
threshold = flat.abs().topk(n_keep).values[-1]
mask = tv[key].abs() >= threshold
trimmed_tv[key] = tv[key] * mask.float()
trimmed.append(trimmed_tv)
return trimmed
def _elect_signs(self, trimmed_vectors):
"""Determine consensus sign for each parameter."""
consensus = {}
for key in trimmed_vectors[0]:
sum_tv = torch.zeros_like(trimmed_vectors[0][key])
for tv in trimmed_vectors:
sum_tv += tv[key].sign()
consensus[key] = sum_tv.sign()
return consensus
def _disjoint_merge(self, trimmed_vectors, consensus_signs):
"""Merge only parameters where all models agree on sign."""
merged = copy.deepcopy(self.base_model)
for key in merged:
agreement = torch.ones_like(consensus_signs[key])
for tv in trimmed_vectors:
mask = (tv[key] != 0) & (tv[key].sign() == consensus_signs[key])
agreement *= mask.float()
sum_values = torch.zeros_like(merged[key])
count = torch.zeros_like(merged[key])
for tv in trimmed_vectors:
sum_values += tv[key] * agreement
count += (tv[key] != 0).float() * agreement
count = count.clamp(min=1)
merged[key] += sum_values / count
return merged
DARE (Drop And REscale)
Theory
DfDARE
DARE (Yu et al., 2024) randomly drops elements from task vectors and rescales the remaining ones. This reduces interference and enables merging of many models without conflicts.
DARE Drop and Rescale
Here,
- =Original task vector
- =Binary mask (Bernoulli(p))
- =Keep probability
- =Dropped and rescaled task vector
class DAREMerger:
"""DARE (Drop and REscale) merging."""
def __init__(self, base_model, fine_tuned_models, drop_rate=0.9):
self.base_model = base_model
self.fine_tuned_models = fine_tuned_models
self.keep_prob = 1 - drop_rate
def merge(self, n_samples=10):
"""Merge with DARE (multiple samples for stability)."""
merged_models = [self._single_merge() for _ in range(n_samples)]
final = copy.deepcopy(merged_models[0])
for key in final:
for m in merged_models[1:]:
final[key] += m[key]
final[key] /= n_samples
return final
def _single_merge(self):
"""Single DARE merge with random dropping."""
merged = copy.deepcopy(self.base_model)
for model in self.fine_tuned_models:
for key in merged:
task_vec = model[key] - self.base_model[key]
mask = torch.bernoulli(
torch.full_like(task_vec, self.keep_prob)
)
dropped = task_vec * mask / self.keep_prob
merged[key] += dropped
return merged
DARE's key insight is that most elements in a task vector are noise. By dropping 90% of elements and rescaling, the signal-to-noise ratio improves dramatically, enabling merging of 10+ models without degradation.
SLERP (Spherical Linear Interpolation)
SLERP
Here,
- =Two model weight vectors
- =Interpolation parameter (0 to 1)
- =Angle between vectors
SLERP interpolates along the surface of the hypersphere rather than in flat space. This preserves the magnitude of weights better than linear interpolation, often producing higher quality merges.
Comparison of Methods
| Method | Complexity | Interference Handling | Quality |
|---|---|---|---|
| Uniform Soup | O(n) | None | Good |
| Task Arithmetic | O(n) | Scaling factor | Good |
| TIES-Merging | O(n¡p) | Sign consensus | Excellent |
| DARE | O(n¡p) | Random dropping | Excellent |
| SLERP | O(n) | Pairwise only | Good |
p represents the parameter count. TIES and DARE are more expensive because they process each parameter individually, but they handle interference much better than simple averaging.
Practical Guidelines
When to Use Each Method
| Scenario | Recommended Method |
|---|---|
| Models from same pre-trained | Uniform Soup |
| Different tasks, same base | Task Arithmetic |
| Many models (10+) | DARE |
| Conflicting tasks | TIES-Merging |
| Two models only | SLERP |
| Need best quality | TIES + DARE |
Example Workflow
def merge_pipeline(base_model, specialized_models, val_data=None):
"""Complete model merging workflow."""
# Step 1: Try uniform soup first
merger = ModelSoupMerger(specialized_models)
uniform_merge = merger.merge()
# Step 2: If validation data available, try greedy soup
if val_data:
greedy_merge = merger.merge_with_optimization(val_data)
# Step 3: For many models, try DARE
if len(specialized_models) > 5:
dare_merger = DAREMerger(base_model, specialized_models, drop_rate=0.9)
dare_merge = dare_merger.merge(n_samples=10)
# Step 4: For conflicting tasks, try TIES
ties_merger = TIESMerger(base_model, specialized_models, top_k=0.2)
ties_merge = ties_merger.merge()
return ties_merge # Usually best quality
Practice Exercises
-
Conceptual: Explain why fine-tuned models from the same pre-trained initialization can be averaged successfully. What assumption about the loss landscape makes this possible?
-
Mathematical: For 5 models each fine-tuned on different tasks, compute the number of parameters that need to be stored for TIES-Merging vs DARE merging.
-
Practical: Implement model soups by averaging weights from 3 LoRA-fine-tuned models and measure the performance on all three tasks.
-
Research: Compare TIES-Merging and DARE on merging 10 task-specific models. Which method better handles conflicting task requirements?
Key Takeaways:
- Model soups average weights; effective when models share pre-trained initialization
- Task arithmetic treats fine-tuning as vectors that can be added or negated
- TIES-Merging resolves interference through sign consensus and disjoint merge
- DARE drops 90% of task vector elements and rescales for stability
- SLERP provides better interpolation than linear averaging for two models
- For conflicting tasks, TIES or DARE outperforms simple averaging
- Merging 10+ models is feasible with DARE without quality degradation
What to Learn Next
-> LoRA and PEFT Efficient fine-tuning using low-rank adaptation.
-> Knowledge Distillation for LLMs Training smaller models from larger teachers.
-> Low-Rank Factorization SVD decomposition and weight sharing techniques.
-> Quantization Techniques Deep Dive GPTQ, AWQ, GGUF, and INT4/INT8 methods.
-> Fine-Tuning LLMs Customizing language models for specific tasks.
-> Mixture of Experts Sparse architectures that scale efficiently.