LLM Reference

LLM Glossary — Essential Terms and Concepts

A comprehensive glossary of terms used throughout the field of Large Language Models. This glossary provides clear, concise definitions for both foundational and advanced concepts.

Foundational Terms — Core concepts for understanding LLMs
Architecture Terms — Model structures and components
Training Terms — Methods and techniques for building LLMs
Evaluation Terms — Metrics and assessment approaches

Knowledge is power; definitions are the foundation of knowledge.

LLM Glossary

This glossary provides definitions for key terms used in the field of Large Language Models. Terms are organized alphabetically within categories for easy reference.

Foundational Terms

A

Activation Function: A mathematical function applied to neuron outputs to introduce non-linearity. Common examples include ReLU, sigmoid, and tanh.

Attention Mechanism: A neural network component that computes weighted sums of input representations, allowing the model to focus on relevant parts of the input.

Attention Score: The weight assigned to a key-value pair in attention computation, determining how much focus to place on that part of the input.

Autoregressive Model: A model that generates outputs sequentially, where each output token depends on previously generated tokens.

B

Backpropagation: The algorithm for computing gradients in neural networks, used to update model parameters during training.

Beam Search: A decoding algorithm that explores multiple possible sequences simultaneously, keeping the top-k most probable sequences at each step.

BERT (Bidirectional Encoder Representations from Transformers): A transformer-based model that uses bidirectional context for natural language understanding tasks.

BPE (Byte-Pair Encoding): A tokenization algorithm that iteratively merges the most frequent character pairs to build a vocabulary.

Buffer Pool: Memory management structure for efficient attention computation in transformers.

C

Chain-of-Thought (CoT): A prompting technique that encourages models to show intermediate reasoning steps before providing final answers.

Chinchilla Scaling Law: A principle stating that model performance scales predictably with parameters, data, and compute, with optimal allocation requiring proportional scaling.

CLM (Causal Language Modeling): Language modeling that predicts the next token in a sequence, also known as autoregressive language modeling.

Context Window: The maximum number of tokens a model can process in a single forward pass.

Contrastive Learning: Training methods that learn representations by contrasting positive and negative examples.

Constitutional AI: An alignment approach where AI systems learn to follow principles or "constitutions" through self-critique and revision.

Cosine Similarity: A measure of similarity between two vectors, computed as the cosine of the angle between them.

D

Deduplication: The process of removing duplicate data points from training datasets.

Decoder: The part of a transformer that generates output sequences, typically used in autoregressive models.

Deep Learning: A subset of machine learning using neural networks with multiple layers to learn hierarchical representations.

Denoising Objective: A training objective that requires the model to reconstruct clean inputs from corrupted versions.

Diffusion Model: A generative model that learns to generate data by reversing a gradual noising process.

DPO (Direct Preference Optimization): A method for aligning language models with human preferences without reinforcement learning.

Dropout: A regularization technique that randomly sets a fraction of neurons to zero during training.

DSPy: A framework for programming with language models, emphasizing composable modules and automatic optimization.

E

Embedding: A dense vector representation of a discrete object (word, token, document) in a continuous vector space.

Encoder: The part of a transformer that processes input sequences into representations, typically used for understanding tasks.

Epoch: One complete pass through the entire training dataset.

Emergent Capabilities: Abilities that appear suddenly in large models but are absent in smaller models.

EOS Token (End of Sequence): A special token indicating the end of a sequence.

Eval Set: A dataset used to evaluate model performance during training.

F

Few-Shot Learning: Learning from a small number of examples provided in the prompt, without gradient updates.

Fine-Tuning: Further training a pre-trained model on task-specific data to adapt it to new tasks.

Flash Attention: An efficient attention implementation that reduces memory usage and improves speed.

Float16 (FP16): A 16-bit floating-point format used for efficient model training and inference.

FLOPs (Floating Point Operations): A measure of computational complexity, used to estimate training and inference costs.

Function Calling: The ability of LLMs to call external functions or APIs based on user requests.

G

GPT (Generative Pre-trained Transformer): An autoregressive language model architecture using transformer decoders.

Gradient Accumulation: A technique for simulating larger batch sizes by accumulating gradients over multiple forward passes.

Gradient Checkpointing: A memory optimization technique that recomputes activations during backpropagation instead of storing them.

Grounding: Connecting model outputs to verifiable facts or external sources.

Grouped-Query Attention (GQA): An attention variant where multiple query heads share key-value heads, reducing memory usage.

H

Hallucination: When a model generates information that is factually incorrect or not grounded in the input.

Hugging Face: A platform providing tools and resources for working with machine learning models, especially NLP.

Human Evaluation: Assessment of model outputs by human judges, considered the gold standard for quality evaluation.

HuggingFace Transformers: A library providing pre-trained models and tools for NLP tasks.

I

ICL (In-Context Learning): Learning from examples provided in the prompt without updating model parameters.

Instruction Tuning: Fine-tuning models on instruction-response pairs to improve instruction following.

Inference: The process of using a trained model to generate outputs from inputs.

Interpolation: Generating outputs that fall within the range of training data, as opposed to extrapolation.

J

Jailbreaking: Attempting to bypass model safety restrictions through adversarial prompts.

K

KVM (Key-Value Memory): Cached key-value pairs from attention layers, used to speed up autoregressive generation.

Knowledge Distillation: Training a smaller model to mimic a larger model's behavior.

Knowledge Graph: A structured representation of facts as entities and relationships.

L

Language Model: A probabilistic model of text that assigns probabilities to sequences of tokens.

Latent Space: The learned representation space where similar inputs are mapped to nearby points.

Layer Normalization: A normalization technique applied across features for each input sample.

LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method that adds low-rank matrices to existing weights.

LLaMA (Large Language Model Meta AI): An open-source language model architecture from Meta.

LM Head: The output layer of a language model that projects hidden states to vocabulary probabilities.

M

Masked Language Modeling (MLM): A training objective where random tokens are masked and the model predicts them.

Machine Translation: Automatically translating text from one language to another using computational methods.

MMLU (Massive Multitask Language Understanding): A benchmark evaluating knowledge across 57 subjects.

Multi-Head Attention: Attention mechanism with multiple parallel attention heads capturing different patterns.

N

Named Entity Recognition (NER): Identifying and classifying named entities in text.

Neural Scaling Laws: Power-law relationships between model performance and factors like parameters, data, and compute.

NLL (Negative Log-Likelihood): A common loss function for training probabilistic models.

NLP (Natural Language Processing): The field of AI focused on understanding and generating human language.

Non-Parametric: Methods that don't learn fixed parameters but instead use memory or lookup tables.

O

OOD (Out-of-Distribution): Data that differs significantly from the training distribution.

Open-Source Model: A model whose weights and often training code are publicly available.

OpenAI: A research organization developing and deploying AI systems, including GPT models.

Outlier Detection: Identifying unusual data points that deviate significantly from normal patterns.

P

Packing: Combining multiple short sequences into a single longer sequence to improve training efficiency.

PaLM (Pathways Language Model): A large language model from Google using the Pathways system.

Parameters: Learnable weights in a neural network that are updated during training.

Perplexity: A measure of how well a probability model predicts a sample, computed as the exponentiated average negative log-likelihood.

Positional Encoding: Adding position information to tokens since transformers lack inherent sequence order.

Pre-training: Initial training on large datasets to learn general representations before fine-tuning.

Prompt Engineering: Designing input prompts to elicit desired outputs from language models.

Q

QLoRA: A quantized version of LoRA that uses 4-bit quantization for memory efficiency.

Quantization: Reducing the precision of model weights to reduce memory usage and improve inference speed.

R

RAG (Retrieval-Augmented Generation): Combining retrieval of relevant documents with generation for more accurate outputs.

Recall: The fraction of relevant items that are successfully retrieved or identified.

Reinforcement Learning from Human Feedback (RLHF): Training models to align with human preferences using reinforcement learning.

Replay Buffer: Stored experiences used for training in reinforcement learning.

Residual Connection: Skip connections that add input directly to layer output, enabling deeper networks.

S

Sampling: Generating tokens by randomly selecting from the probability distribution.

Scaling Law: Mathematical relationships describing how model performance improves with scale.

Self-Attention: Attention mechanism where queries, keys, and values all come from the same sequence.

Sentence Embedding: A vector representation of a sentence or paragraph.

Softmax: A function that converts logits to probabilities summing to 1.

Speculative Decoding: A technique that uses a smaller model to generate draft tokens verified by a larger model.

SFT (Supervised Fine-Tuning): Fine-tuning on labeled data with input-output pairs.

T

Temperature: A parameter controlling randomness in sampling; higher values increase diversity.

Tensor: A multi-dimensional array, the fundamental data structure in deep learning.

TF-IDF: Term Frequency-Inverse Document Frequency, a weighting scheme for importance.

Token: A unit of text, which can be a word, subword, or character.

Tokenization: Splitting text into tokens for model processing.

Top-p (Nucleus) Sampling: Sampling from the smallest set of tokens whose cumulative probability exceeds p.

Transformer: A neural network architecture based on self-attention, introduced in "Attention Is All You Need."

U

Unembedding: The reverse of embedding, mapping from vector space back to discrete tokens.

Unigram: A single token or word.

Unsupervised Learning: Learning patterns from unlabeled data.

V

VAE (Variational Autoencoder): A generative model that learns a latent space for data generation.

Vocabulary: The set of all tokens a model can process.

Vocabulary Size: The total number of unique tokens in the model's vocabulary.

W

Weight Decay: A regularization technique that penalizes large weights during training.

Word Embedding: A dense vector representing a word's meaning.

X

XLA (Accelerated Linear Algebra): An optimized linear algebra compiler for TensorFlow.

Z

Zero-Shot Learning: Performing tasks without any task-specific examples, relying solely on instructions.

Zipf's Law: The empirical observation that word frequency is inversely proportional to rank.

Evaluation Metrics

Metric	Description	Use Case
BLEU	N-gram precision	Translation, summarization
ROUGE	Recall-oriented	Summarization
BERTScore	Semantic similarity	General quality
Perplexity	Model confidence	Language modeling
Exact Match	Exact match	QA, classification
F1 Score	Harmonic mean of precision and recall	Classification
EM	Binary exact match	QA
AUC-ROC	Area under ROC curve	Classification
NDCG	Normalized Discounted Cumulative Gain	Ranking

Abbreviations

Abbreviation	Full Form
LLM	Large Language Model
NLP	Natural Language Processing
GPT	Generative Pre-trained Transformer
BERT	Bidirectional Encoder Representations from Transformers
RLHF	Reinforcement Learning from Human Feedback
DPO	Direct Preference Optimization
LoRA	Low-Rank Adaptation
RAG	Retrieval-Augmented Generation
CoT	Chain-of-Thought
ICL	In-Context Learning
SFT	Supervised Fine-Tuning
PEFT	Parameter-Efficient Fine-Tuning
BPE	Byte-Pair Encoding
EOS	End of Sequence
BOS	Beginning of Sequence
UNK	Unknown Token

This glossary is a living document. As the field evolves, new terms will be added and existing definitions updated. Contributions and suggestions are welcome.

Practice Exercises

Term Definition: Define the following terms in your own words: attention mechanism, transformer, fine-tuning, RLHF.
Term Relationships: How do the following terms relate to each other: GPT, transformer, autoregressive, language modeling?
Term Application: When would you use the following: RAG, LoRA, DPO, chain-of-thought prompting?
Term Evolution: How have the definitions of "large language model" and "alignment" evolved over time?

Key Takeaways:

This glossary provides definitions for essential LLM terms and concepts
Terms are organized alphabetically within categories for easy reference
The field evolves rapidly; new terms emerge and definitions change
Understanding terminology is fundamental to mastering LLM concepts
Use this glossary as a quick reference while reading papers and documentation

What to Learn Next

-> LLM Tool Ecosystem Overview of HuggingFace, LangChain, LlamaIndex, and other tools.

-> LLM Best Practices Best practices for common LLM tasks and applications.

-> LLM Roadmap Learning roadmap, skill progression, and career paths in LLMs.

-> LLM Research Paper Guide Key papers, reading guides, and research methodology for LLMs.

-> LLM Tool Ecosystem Overview of HuggingFace, LangChain, LlamaIndex, and other tools.

-> LLM Best Practices Best practices for common LLM tasks and applications.

LLM Glossary

LLM Glossary — Essential Terms and Concepts

LLM Glossary

Foundational Terms

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Z

Evaluation Metrics

Abbreviations

Practice Exercises

What to Learn Next

Need Expert LLM Help?