CW

Model Merging and Fusion

OptimizationModel CombinationFree Lesson

Advertisement

Optimization

Model Merging and Fusion — Combining Knowledge Without Training

What if you could combine the strengths of multiple specialized models into one generalist model without additional training? Model merging achieves this by averaging, interpolating, or strategically combining model weights.

  • Model Soups — Averaging weights from fine-tuned models
  • TIES-Merging — Resolving interference through trim, elect, and disjoint merge
  • DARE — Drop and rescale for massive model merging
  • Task Arithmetic — Treating fine-tuning as task vectors

The whole can be greater than the sum of its parts if you know how to combine them.

Model Merging and Fusion

When you fine-tune a base model on different tasks, each specialized model learns task-specific knowledge in its weights. Model merging combines these specialized models into a single model that inherits the capabilities of all of them without requiring any training data or compute.

DfModel Merging

Model merging is the process of combining the parameters of multiple models into a single model. The goal is to create a model that performs well on all the tasks the source models were specialized for.

Model Soups

Linear Interpolation

DfModel Soups

Model soups (Wortsman et al., 2022) combine models by averaging their weights. The key insight is that fine-tuned models from the same pre-trained initialization often lie in the same loss basin, making weight averaging effective.

Weight Averaging

θmerged=∑i=1nwi⋅θi\theta_{\text{merged}} = \sum_{i=1}^{n} w_i \cdot \theta_i

Here,

  • θmerged\theta_{\text{merged}}=Merged model weights
  • wiw_i=Weight for model i
  • θi\theta_i=Weights of model i
  • nn=Number of models to merge
import torch
import copy
from typing import Dict, List

class ModelSoupMerger:
    """Merge models by weight averaging."""
    
    def __init__(self, models: List[Dict], weights: List[float] = None):
        self.models = models
        self.n_models = len(models)
        
        if weights is None:
            self.weights = [1.0 / self.n_models] * self.n_models
        else:
            total = sum(weights)
            self.weights = [w / total for w in weights]
    
    def merge(self) -> Dict:
        """Perform linear weight averaging."""
        merged = {key: torch.zeros_like(v, dtype=torch.float32) 
                  for key, v in self.models[0].items()}
        
        for model, weight in zip(self.models, self.weights):
            for key in merged:
                merged[key] += weight * model[key].float()
        
        return merged

Types of Model Soups

TypeDescriptionWhen to Use
Uniform SoupEqual weights for all modelsModels are equally good
Greedy SoupAdd models only if they improve validationHave validation data
Learned SoupOptimize weights on validation setHave enough data

Greedy soup iteratively adds models to the merge if they improve performance on a validation set. This often outperforms uniform averaging because not all fine-tuned models contribute positively.

Task Arithmetic

Task Vectors

DfTask Vector

A task vector is the difference between fine-tuned and pre-trained weights: τ = θ_finetuned - θ_pretrained. This vector captures the knowledge needed for a specific task.

Task Arithmetic

θmerged=θbase+λ∑i=1nτi\theta_{\text{merged}} = \theta_{\text{base}} + \lambda \sum_{i=1}^{n} \tau_i

Here,

  • θbase\theta_{\text{base}}=Base pre-trained model weights
  • τi\tau_i=Task vector for task i
  • Îť\lambda=Scaling factor
  • nn=Number of tasks
class TaskArithmetic:
    """Combine models using task vectors."""
    
    def __init__(self, base_model, fine_tuned_models):
        self.base_model = base_model
        self.fine_tuned_models = fine_tuned_models
    
    def compute_task_vectors(self):
        """Compute task vector for each fine-tuned model."""
        task_vectors = []
        for model in self.fine_tuned_models:
            tv = {key: model[key] - self.base_model[key] 
                  for key in self.base_model}
            task_vectors.append(tv)
        return task_vectors
    
    def merge_with_task_arithmetic(self, scaling_factor=1.0):
        """Merge using task arithmetic."""
        merged = copy.deepcopy(self.base_model)
        for tv in self.compute_task_vectors():
            for key in merged:
                merged[key] += scaling_factor * tv[key]
        return merged
    
    def merge_with_negation(self, task_to_negate, scaling_factor=-1.0):
        """Remove a task's knowledge by negating its task vector."""
        merged = copy.deepcopy(self.base_model)
        task_vectors = self.compute_task_vectors()
        
        for i, tv in enumerate(task_vectors):
            factor = scaling_factor if i == task_to_negate else 1.0
            for key in merged:
                merged[key] += factor * tv[key]
        return merged

Task arithmetic enables model algebra: you can add tasks (combine capabilities), negate tasks (remove unwanted behavior), or scale tasks (adjust importance).

TIES-Merging

The Interference Problem

DfWeight Interference

Weight interference occurs when different models modify the same weights in opposite directions, causing conflicts during merging. TIES-Merging addresses this by identifying and resolving these conflicts.

TIES Algorithm

DfTIES-Merging

TIES-Merging (Yadav et al., 2023) resolves interference through three steps: (1) Trim small updates, (2) Elect sign consensus, (3) Disjoint merge of non-conflicting parameters.

class TIESMerger:
    """TIES-Merging implementation."""
    
    def __init__(self, base_model, fine_tuned_models, top_k=0.2):
        self.base_model = base_model
        self.fine_tuned_models = fine_tuned_models
        self.top_k = top_k
    
    def merge(self):
        """Perform TIES merging."""
        task_vectors = self._compute_task_vectors()
        trimmed = self._trim_updates(task_vectors)
        consensus = self._elect_signs(trimmed)
        return self._disjoint_merge(trimmed, consensus)
    
    def _compute_task_vectors(self):
        """Compute task vectors for each model."""
        return [{key: model[key] - self.base_model[key] 
                 for key in self.base_model} 
                for model in self.fine_tuned_models]
    
    def _trim_updates(self, task_vectors):
        """Keep only top-k% of updates by magnitude."""
        trimmed = []
        for tv in task_vectors:
            trimmed_tv = {}
            for key in tv:
                flat = tv[key].flatten()
                n_keep = int(len(flat) * self.top_k)
                threshold = flat.abs().topk(n_keep).values[-1]
                mask = tv[key].abs() >= threshold
                trimmed_tv[key] = tv[key] * mask.float()
            trimmed.append(trimmed_tv)
        return trimmed
    
    def _elect_signs(self, trimmed_vectors):
        """Determine consensus sign for each parameter."""
        consensus = {}
        for key in trimmed_vectors[0]:
            sum_tv = torch.zeros_like(trimmed_vectors[0][key])
            for tv in trimmed_vectors:
                sum_tv += tv[key].sign()
            consensus[key] = sum_tv.sign()
        return consensus
    
    def _disjoint_merge(self, trimmed_vectors, consensus_signs):
        """Merge only parameters where all models agree on sign."""
        merged = copy.deepcopy(self.base_model)
        
        for key in merged:
            agreement = torch.ones_like(consensus_signs[key])
            for tv in trimmed_vectors:
                mask = (tv[key] != 0) & (tv[key].sign() == consensus_signs[key])
                agreement *= mask.float()
            
            sum_values = torch.zeros_like(merged[key])
            count = torch.zeros_like(merged[key])
            
            for tv in trimmed_vectors:
                sum_values += tv[key] * agreement
                count += (tv[key] != 0).float() * agreement
            
            count = count.clamp(min=1)
            merged[key] += sum_values / count
        
        return merged

DARE (Drop And REscale)

Theory

DfDARE

DARE (Yu et al., 2024) randomly drops elements from task vectors and rescales the remaining ones. This reduces interference and enables merging of many models without conflicts.

DARE Drop and Rescale

τi′=τi⊙mip\tau_i' = \frac{\tau_i \odot m_i}{p}

Here,

  • τi\tau_i=Original task vector
  • mim_i=Binary mask (Bernoulli(p))
  • pp=Keep probability
  • τi′\tau_i'=Dropped and rescaled task vector
class DAREMerger:
    """DARE (Drop and REscale) merging."""
    
    def __init__(self, base_model, fine_tuned_models, drop_rate=0.9):
        self.base_model = base_model
        self.fine_tuned_models = fine_tuned_models
        self.keep_prob = 1 - drop_rate
    
    def merge(self, n_samples=10):
        """Merge with DARE (multiple samples for stability)."""
        merged_models = [self._single_merge() for _ in range(n_samples)]
        
        final = copy.deepcopy(merged_models[0])
        for key in final:
            for m in merged_models[1:]:
                final[key] += m[key]
            final[key] /= n_samples
        return final
    
    def _single_merge(self):
        """Single DARE merge with random dropping."""
        merged = copy.deepcopy(self.base_model)
        
        for model in self.fine_tuned_models:
            for key in merged:
                task_vec = model[key] - self.base_model[key]
                mask = torch.bernoulli(
                    torch.full_like(task_vec, self.keep_prob)
                )
                dropped = task_vec * mask / self.keep_prob
                merged[key] += dropped
        
        return merged

DARE's key insight is that most elements in a task vector are noise. By dropping 90% of elements and rescaling, the signal-to-noise ratio improves dramatically, enabling merging of 10+ models without degradation.

SLERP (Spherical Linear Interpolation)

SLERP

θ(t)=sin⁡((1−t)Ω)sin⁡(Ω)θ0+sin⁡(tΩ)sin⁡(Ω)θ1\theta(t) = \frac{\sin((1-t)\Omega)}{\sin(\Omega)}\theta_0 + \frac{\sin(t\Omega)}{\sin(\Omega)}\theta_1

Here,

  • θ0,θ1\theta_0, \theta_1=Two model weight vectors
  • tt=Interpolation parameter (0 to 1)
  • Ί\Omega=Angle between vectors

SLERP interpolates along the surface of the hypersphere rather than in flat space. This preserves the magnitude of weights better than linear interpolation, often producing higher quality merges.

Comparison of Methods

MethodComplexityInterference HandlingQuality
Uniform SoupO(n)NoneGood
Task ArithmeticO(n)Scaling factorGood
TIES-MergingO(n¡p)Sign consensusExcellent
DAREO(n¡p)Random droppingExcellent
SLERPO(n)Pairwise onlyGood

p represents the parameter count. TIES and DARE are more expensive because they process each parameter individually, but they handle interference much better than simple averaging.

Practical Guidelines

When to Use Each Method

ScenarioRecommended Method
Models from same pre-trainedUniform Soup
Different tasks, same baseTask Arithmetic
Many models (10+)DARE
Conflicting tasksTIES-Merging
Two models onlySLERP
Need best qualityTIES + DARE

Example Workflow

def merge_pipeline(base_model, specialized_models, val_data=None):
    """Complete model merging workflow."""
    
    # Step 1: Try uniform soup first
    merger = ModelSoupMerger(specialized_models)
    uniform_merge = merger.merge()
    
    # Step 2: If validation data available, try greedy soup
    if val_data:
        greedy_merge = merger.merge_with_optimization(val_data)
    
    # Step 3: For many models, try DARE
    if len(specialized_models) > 5:
        dare_merger = DAREMerger(base_model, specialized_models, drop_rate=0.9)
        dare_merge = dare_merger.merge(n_samples=10)
    
    # Step 4: For conflicting tasks, try TIES
    ties_merger = TIESMerger(base_model, specialized_models, top_k=0.2)
    ties_merge = ties_merger.merge()
    
    return ties_merge  # Usually best quality

Practice Exercises

  1. Conceptual: Explain why fine-tuned models from the same pre-trained initialization can be averaged successfully. What assumption about the loss landscape makes this possible?

  2. Mathematical: For 5 models each fine-tuned on different tasks, compute the number of parameters that need to be stored for TIES-Merging vs DARE merging.

  3. Practical: Implement model soups by averaging weights from 3 LoRA-fine-tuned models and measure the performance on all three tasks.

  4. Research: Compare TIES-Merging and DARE on merging 10 task-specific models. Which method better handles conflicting task requirements?

Key Takeaways:

  • Model soups average weights; effective when models share pre-trained initialization
  • Task arithmetic treats fine-tuning as vectors that can be added or negated
  • TIES-Merging resolves interference through sign consensus and disjoint merge
  • DARE drops 90% of task vector elements and rescales for stability
  • SLERP provides better interpolation than linear averaging for two models
  • For conflicting tasks, TIES or DARE outperforms simple averaging
  • Merging 10+ models is feasible with DARE without quality degradation

What to Learn Next

-> LoRA and PEFT Efficient fine-tuning using low-rank adaptation.

-> Knowledge Distillation for LLMs Training smaller models from larger teachers.

-> Low-Rank Factorization SVD decomposition and weight sharing techniques.

-> Quantization Techniques Deep Dive GPTQ, AWQ, GGUF, and INT4/INT8 methods.

-> Fine-Tuning LLMs Customizing language models for specific tasks.

-> Mixture of Experts Sparse architectures that scale efficiently.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement