Open Source LLM Ecosystem
The open-source LLM ecosystem has rapidly expanded, offering powerful alternatives to proprietary models. This tutorial covers the major model families, tools, and strategies for choosing and deploying open-source LLMs.
Major Open Source Model Families
LLaMA Family (Meta)
| Model | Params | Context | License | Key Feature |
|---|---|---|---|---|
| LLaMA 1 | 7B-65B | 2K | Non-commercial | First widely available LLM |
| LLaMA 2 | 7B-70B | 4K | LLaMA 2 Community | Commercially available |
| LLaMA 3 | 8B-405B | 8K-128K | LLaMA 3 Community | Improved multilingual |
from transformers import AutoModelForCausalLM, AutoTokenizer
# Loading LLaMA models
def load_llama(model_name, device_map="auto"):
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map=device_map,
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
return model, tokenizer
# Example usage
model, tokenizer = load_llama("meta-llama/Llama-3.1-8B")
Mistral Family
| Model | Params | Context | License | Key Feature |
|---|---|---|---|---|
| Mistral 7B | 7.3B | 32K | Apache 2.0 | Sliding window attention |
| Mixtral 8x7B | 46.7B (12.9B active) | 32K | Apache 2.0 | Sparse MoE |
| Mistral Large | Unknown | 128K | Proprietary | Near GPT-4 performance |
# Mixtral MoE Architecture
from transformers import MixtralForCausalLM
model = MixtralForCausalLM.from_pretrained(
"mistralai/Mixtral-8x7B-Instruct-v0.1",
torch_dtype="auto",
device_map="auto"
)
Other Notable Families
| Family | Organization | Sizes | License | Specialty |
|---|---|---|---|---|
| Falcon | TII | 1.3B-180B | Apache 2.0 | Multilingual |
| Qwen | Alibaba | 0.5B-110B | Apache 2.0 | Chinese + English |
| Yi | 01.AI | 6B-34B | Yi License | Long context |
| Phi | Microsoft | 2.7B-14B | MIT | Small but capable |
| Gemma | 2B-27B | Gemma License | Research focused |
Licensing Considerations
License Comparison
| License | Commercial Use | Modification | Distribution | Restrictions |
|---|---|---|---|---|
| MIT | Yes | Yes | Yes | None |
| Apache 2.0 | Yes | Yes | Yes | Patent grant |
| LLaMA 2 | Yes (700M MAU) | Yes | Yes | Acceptable use policy |
| LLaMA 3 | Yes (700M MAU) | Yes | Yes | Acceptable use policy |
| Yi License | Yes | Yes | Yes | Similar to LLaMA |
| CC BY-NC | Non-commercial | Yes | Yes | Non-commercial only |
license_information = {
"mit": {
"commercial": True,
"restrictions": [],
"examples": ["phi-2", "gemma"]
},
"apache_2": {
"commercial": True,
"restrictions": ["patent grant"],
"examples": ["falcon", "mistral-7b", "mixtral", "qwen"]
},
"llama_2_community": {
"commercial": True,
"restrictions": ["700M monthly active users limit", "acceptable use policy"],
"examples": ["llama-2-7b", "llama-2-13b", "llama-2-70b"]
},
"llama_3_community": {
"commercial": True,
"restrictions": ["700M monthly active users limit", "acceptable use policy"],
"examples": ["llama-3-8b", "llama-3-70b", "llama-3.1-405b"]
}
}
When choosing an open-source model, licensing is as important as performance. Apache 2.0 and MIT licenses offer the most freedom. LLaMA licenses require Meta's acceptable use policy compliance and have a 700M monthly active user limit for commercial use.
HuggingFace Ecosystem
Transformers
The core library for model loading and inference:
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
pipeline,
BitsAndBytesConfig
)
# Load with quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="auto"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
# Quick inference
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
output = pipe("Explain quantum computing:", max_new_tokens=200)
PEFT (Parameter-Efficient Fine-Tuning)
from peft import (
LoraConfig,
get_peft_model,
TaskType,
prepare_model_for_kbit_training
)
# Prepare model for QLoRA
model = prepare_model_for_kbit_training(model)
# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.1,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]
)
# Apply LoRA
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 8,048,558,080 || 0.17%
TRL (Transformer Reinforcement Learning)
from trl import SFTTrainer, PPOTrainer, DPOTrainer
from datasets import load_dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")
# Supervised Fine-Tuning
sft_trainer = SFTTrainer(
model=peft_model,
tokenizer=tokenizer,
train_dataset=dataset,
max_seq_length=512,
args=TrainingArguments(
output_dir="./sft_output",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-4,
fp16=True
)
)
sft_trainer.train()
Datasets Library
from datasets import load_dataset, Dataset, DatasetDict
# Load existing datasets
alpaca = load_dataset("tatsu-lab/alpaca")
dolly = load_dataset("databricks/databricks-dolly-15k")
# Create custom dataset
custom_data = {
"instruction": ["Summarize this text:", "Translate to French:"],
"input": ["Long text here...", "Hello world"],
"output": ["Summary here...", "Bonjour le monde"]
}
dataset = Dataset.from_dict(custom_data)
# Format for instruction tuning
def format_alpaca(example):
if example["input"]:
prompt = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n"
else:
prompt = f"### Instruction:\n{example['instruction']}\n\n### Response:\n"
return {"text": prompt + example["output"]}
formatted = alpaca.map(format_alpaca)
Model Hubs and Quantized Models
Finding Models on HuggingFace
from huggingface_hub import HfApi
api = HfApi()
# Search for models
models = api.list_models(
search="llama",
sort="downloads",
direction=-1,
filter="text-generation"
)
# Find quantized versions
quantized = api.list_models(
search="llama-3-8b-GPTQ",
sort="downloads",
direction=-1
)
Popular Quantized Model Sources
| Source | Quantization | Models | Quality |
|---|---|---|---|
| TheBloke | GPTQ, AWQ, GGUF | 500+ | High |
| NeuroMercenary | GPTQ, AWQ | 50+ | High |
| Bartowski | GGUF | 100+ | High |
| TechxGenus | GPTQ, EXL2 | 50+ | Medium-High |
Open Source vs Proprietary
Performance Comparison
| Capability | GPT-4 | LLaMA 3.1 405B | Mixtral 8x7B | LLaMA 3.1 8B |
|---|---|---|---|---|
| MMLU | 86.4% | 87.3% | 77.3% | 73.0% |
| HumanEval | 67.0% | 89.0% | 40.2% | 62.2% |
| GSM8K | 92.0% | 96.8% | 74.4% | 84.5% |
| Cost per 1M tokens | 0 (self-host) | 0 (self-host) | ||
| Latency (TTFT) | 500ms | Variable | Variable | Variable |
When to Choose Which
decision_framework = {
"use_proprietary_when": [
"Need best overall quality (GPT-4, Claude)",
"Low-latency requirements without GPU infrastructure",
"Rapid prototyping without deployment concerns",
"Tasks requiring latest capabilities"
],
"use_open_source_when": [
"Data privacy requirements (healthcare, finance)",
"High volume inference (cost sensitivity)",
"Custom fine-tuning needed",
"On-premise deployment required",
"Full control over model behavior"
],
"use_smallest_model_when": [
"Edge deployment",
"Latency-critical applications",
"Limited compute budget",
"Simple tasks (classification, extraction)"
]
}
Start with the smallest model that meets your quality requirements. A fine-tuned 8B model often outperforms a general-purpose 70B model for specific tasks, at a fraction of the cost.
Model Selection Guide
def select_model(
task_type: str,
quality_requirement: str,
budget: str,
privacy: bool
) -> str:
if privacy or budget == "low":
if quality_requirement == "high":
return "meta-llama/Llama-3.1-70B or Qwen-2.5-72B"
elif quality_requirement == "medium":
return "mistralai/Mixtral-8x7B or Qwen-2.5-32B"
else:
return "meta-llama/Llama-3.1-8B or Qwen-2.5-7B"
if quality_requirement == "highest":
return "gpt-4o or claude-3.5-sonnet"
if task_type == "code":
return "codellama/CodeLlama-34b or deepseek-coder-33b"
if task_type == "chat":
return "meta-llama/Llama-3.1-8B-Instruct"
return "meta-llama/Llama-3.1-8B"
Summary
- LLaMA, Mistral, Qwen, and Falcon are leading open-source model families
- Apache 2.0 and MIT licenses offer maximum commercial freedom
- HuggingFace ecosystem (Transformers, PEFT, TRL, Datasets) enables efficient deployment
- Quantized models (GPTQ, AWQ, GGUF) reduce memory requirements 2-4x
- Open-source excels for privacy, cost, and customization; proprietary for highest quality
- Start with the smallest model that meets quality requirements
- Fine-tuned smaller models often outperform larger general-purpose models
Practice Exercises
-
Model Comparison: Compare LLaMA 3.1 8B, Mixtral 8x7B, and Qwen 2.5 72B on your specific task. Which provides the best quality-cost tradeoff?
-
License Analysis: Review licenses for 5 open-source models. Which ones can you use commercially in your application?
-
HuggingFace Pipeline: Build a complete inference pipeline using Transformers with quantization and batching.
-
PEFT Fine-tuning: Fine-tune LLaMA 3.1 8B with LoRA on a custom dataset. Compare performance with full fine-tuning.
-
Deployment Comparison: Deploy the same model using vLLM, TGI, and llama.cpp. Compare latency, throughput, and resource usage.
Previous: 22 - LLM Safety & Red Teaming <- | Next: 24 - Scaling Laws & Chinchilla ->