What Are Large Language Models?

FoundationsIntroductionFree Lesson

Advertisement

What Are Large Language Models?

Large Language Models (LLMs) represent a paradigm shift in artificial intelligence—deep neural networks trained on massive text corpora that exhibit remarkable abilities in language understanding, generation, and reasoning. This tutorial provides a comprehensive foundation for understanding what LLMs are, how they scale, and why they matter.

DfLarge Language Model

A Large Language Model (LLM) is a neural network parameterized by hundreds of millions to trillions of parameters, trained on large-scale text corpora using self-supervised learning objectives (typically next-token prediction), capable of performing a wide range of natural language tasks through in-context learning or fine-tuning.

The Scale of LLMs

The defining characteristic of LLMs is scale—not just in parameters, but across three dimensions:

DfComputing Scale

The performance of language models is governed by three scaling factors:

  • Parameters (N): The number of learnable weights in the model
  • Data (D): The number of tokens in the training corpus
  • Compute (C): The total FLOPs used during training

The relationship between these factors is captured by neural scaling laws, which show that test loss decreases as a power law with respect to each factor.

Chinchilla Scaling Law

L(N, D) = \\left(\\frac{N_c}{N}\\right)^{\\alpha_N} + \\left(\\frac{D_c}{D}\\right)^{\\alpha_D} + L_\\infty

Here,

  • LL=Test loss (cross-entropy)
  • NN=Number of model parameters
  • DD=Number of training tokens
  • Nc,DcN_c, D_c=Characteristic parameter/token counts
  • αN,αD\alpha_N, \alpha_D=Scaling exponents
  • LL_\infty=Irreducible loss

📝Scaling Law Calculation

Given a model with N = 7B parameters, compute the Chinchilla-optimal training tokens D. Using the Chinchilla ratio of approximately 20 tokens per parameter: D = 20 x 7 x 10^9 = 140 billion tokens This is the optimal data budget for a 7B model given current scaling laws.

Scale in Practice

ModelParametersTraining TokensCompute (FLOPs)
GPT-21.5B40B~3.1×10²⁰
GPT-3175B300B~3.1×10²³
LLaMA 2 70B70B2T~1.7×10²⁴
GPT-4~1.8T (estimated)~13T~2.1×10²⁵

The Chinchilla paper (Hoffmann et al., 2022) demonstrated that for optimal compute allocation, the number of training tokens should scale roughly proportionally with the number of parameters—challenging the prior paradigm of under-training large models.

Emergent Capabilities

One of the most fascinating properties of LLMs is emergence: abilities that appear suddenly as model scale increases, which are absent in smaller models.

DfEmergent Capabilities

Emergent capabilities are abilities that are not present in small-scale models but appear in large-scale models. These capabilities are not explicitly trained for but arise from the combination of scale, architecture, and training data. Examples include chain-of-thought reasoning, in-context learning, and few-shot analogy transfer.

Key emergent capabilities include:

  • In-context learning: Learning from examples provided in the prompt without gradient updates
  • Chain-of-thought reasoning: Breaking complex problems into intermediate reasoning steps
  • Instruction following: Generalizing to unseen instruction formats
  • Code generation: Writing and debugging programs from natural language descriptions
  • Multilingual transfer: Performing tasks in languages not explicitly represented in training

Emergence is controversial. Some researchers argue these are not truly emergent but rather smooth improvements that cross a perceptual threshold. Regardless, the practical observation holds: larger models exhibit qualitatively different behaviors than smaller ones.

A Brief History of LLMs

The Pre-Transformer Era

Language modeling has roots in statistical methods (n-gram models) and early neural approaches (RNNs, LSTMs). The key limitation was the inability to capture long-range dependencies effectively.

The Transformer Revolution (2017)

The "Attention Is All You Need" paper (Vaswani et al., 2017) introduced the Transformer architecture, replacing recurrence with self-attention. This enabled:

  • Parallelized training on modern hardware
  • Better capture of long-range dependencies
  • Scalability to billions of parameters

GPT-2 and the Zero-Shot Era (2019)

OpenAI's GPT-2 (1.5B parameters) demonstrated that a language model trained on a large corpus could perform many tasks zero-shot, simply by conditioning on a natural language prompt.

GPT-3 and In-Context Learning (2020)

GPT-3 (175B parameters) introduced the concept of in-context learning—providing a few examples in the prompt enables the model to learn new tasks without fine-tuning. This was a paradigm shift from task-specific fine-tuning to general-purpose prompting.

ChatGPT and the Instruction Following Era (2022)

ChatGPT combined GPT-3.5 with Reinforcement Learning from Human Feedback (RLHF), creating a model that could follow complex instructions and engage in multi-turn dialogue. This was the first LLM to achieve mainstream consumer adoption.

GPT-4 and Multimodal Models (2023)

GPT-4 introduced multimodal capabilities (text + images) and demonstrated significant improvements in reasoning, coding, and factual accuracy.

The Open-Source Revolution (2023-2024)

Models like LLaMA, Mistral, and Qwen demonstrated that open-source models could compete with proprietary ones. This democratized LLM research and enabled rapid innovation.

For a detailed treatment of the GPT architecture, see our module on GPT Architecture.

When to Use LLMs vs Traditional ML

Understanding when LLMs are appropriate is crucial for practical applications:

LLMs Excel At:

  • Few-shot or zero-shot tasks: When you have limited labeled data
  • Open-ended generation: Creative writing, brainstorming, summarization
  • Complex reasoning: Tasks requiring multi-step logical deduction
  • Code generation: Writing, debugging, and explaining code
  • Natural language interfaces: Chatbots, assistants, search

Traditional ML Is Often Better For:

  • Structured prediction: When you have large labeled datasets
  • Latency-critical applications: When inference speed is paramount
  • Resource-constrained environments: Edge devices, mobile
  • Well-defined classification/regression: When a simple model suffices

A practical heuristic: if your task can be framed as "given this text, generate the next text," an LLM is likely appropriate. If your task is "given these 1000 features, predict this number," traditional ML may be more efficient.

Mathematical Foundations

The core objective of language modeling is to estimate the probability distribution over tokens in a sequence:

Autoregressive Language Modeling

P(x1,x2,ldots,xT)=prodt=1TP(xtx1,ldots,xt1;theta)P(x_1, x_2, \\ldots, x_T) = \\prod_{t=1}^{T} P(x_t | x_1, \\ldots, x_{t-1}; \\theta)

Here,

  • xtx_t=Token at position t
  • TT=Sequence length
  • θ\theta=Model parameters
  • P(xtx1,,xt1)P(x_t | x_1, \ldots, x_{t-1})=Conditional probability of token t given previous tokens

The model is trained by minimizing the negative log-likelihood:

mathcalL(theta)=frac1Tsumt=1TlogP(xtx1,ldots,xt1;theta)\\mathcal{L}(\\theta) = -\\frac{1}{T} \\sum_{t=1}^{T} \\log P(x_t | x_1, \\ldots, x_{t-1}; \\theta)

Practical Example: Loading an LLM

Here's a minimal example of loading and using an LLM with HuggingFace Transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load a pre-trained LLM
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Create a prompt
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain the concept of emergence in LLMs."}
]

# Tokenize and generate
inputs = tokenizer.apply_chat_template(
    messages, 
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

# Decode and print
response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(response)

When working with LLMs in practice, always consider: (1) memory requirements, (2) inference latency, (3) cost per token, and (4) whether fine-tuning or prompting is more appropriate for your use case.

Practice Exercises

  1. Conceptual: Explain why scaling laws suggest that both model size and data size must increase together for optimal performance. What happens if you scale only one?

  2. Mathematical: Given a vocabulary of size V = 50,000 and a sequence of length T = 1024, calculate the total number of parameters in the output projection layer of a model with hidden dimension d = 4096.

  3. Practical: Using the HuggingFace Transformers library, load a small GPT-2 model and measure the difference in perplexity between GPT-2 (124M) and GPT-2 Medium (355M) on a text sample of your choice.

  4. Research: Compare the training compute requirements of GPT-3 (175B) and LLaMA 2 70B. Which model is more compute-efficient and why?

Key Takeaways:

  • LLMs are neural networks with hundreds of millions to trillions of parameters trained on massive text corpora
  • Performance scales predictably with parameters, data, and compute (Chinchilla scaling laws)
  • Emergent capabilities appear at scale: in-context learning, chain-of-thought reasoning, instruction following
  • LLMs excel at few-shot tasks, open-ended generation, and complex reasoning
  • The core training objective is next-token prediction (autoregressive language modeling)

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement