πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Mixture of Experts

🟒 Free Lesson

Advertisement

Mixture of Experts

Mixture of Experts ArchitectureInputTokensEmbeddingsRouter/GateCompute WeightsSelect Top-KExpert IndicesExperts PoolExpert 1Expert 2Expert 3Expert 4Expert 5Expert 6Each expert is a FFN layerOnly K experts active per tokenOutputWeighted SumCombinedResult

What is Mixture of Experts?

MoE architectures use multiple expert subnetworks with a gating mechanism that routes inputs to only a subset of experts, enabling efficient scaling.

Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class MoELayer(nn.Module):
    def __init__(self, d_model, num_experts, top_k=2):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k

        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(d_model, d_model * 4),
                nn.ReLU(),
                nn.Linear(d_model * 4, d_model)
            )
            for _ in range(num_experts)
        ])

        self.gate = nn.Linear(d_model, num_experts)

    def forward(self, x):
        batch_size, seq_len, d_model = x.shape
        x_flat = x.view(-1, d_model)

        # Compute gate scores
        gate_scores = F.softmax(self.gate(x_flat), dim=-1)

        # Select top-k experts
        top_k_scores, top_k_indices = torch.topk(gate_scores, self.top_k, dim=-1)

        # Process through experts
        output = torch.zeros_like(x_flat)
        for i, expert in enumerate(self.experts):
            mask = (top_k_indices == i).any(dim=-1)
            if mask.any():
                expert_output = expert(x_flat[mask])
                weights = top_k_scores[top_k_indices == i].unsqueeze(-1)
                output[mask] += expert_output * weights

        return output.view(batch_size, seq_len, d_model)

Load Balancing

def load_balancing_loss(gate_scores, num_experts):
    """Encourage balanced expert utilization."""
    # Average probability for each expert
    expert_probs = gate_scores.mean(dim=0)

    # Target uniform distribution
    target = torch.ones_like(expert_probs) / num_experts

    # KL divergence
    loss = F.kl_div(expert_probs.log(), target, reduction='batchmean')
    return loss

MoE Models

ModelExpertsActiveParameters
Mixtral 8x7B8247B
Switch Transformer1281-2Various
GShard20482600B

Summary

MoE enables efficient model scaling by activating only a subset of parameters per input. This allows larger models with manageable compute costs.

Next: We'll explore state space models.

⭐

Premium Content

Mixture of Experts

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Generative AI Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement