R Factors — Categorical Data Mastery
Learning Objectives
By the end of this tutorial, you will be able to:
- Create ordered and unordered factors
- Manipulate factor levels with
levels(),relevel(), anddroplevels() - Convert between factors and other types
- Understand when to use factors vs characters
- Handle common factor pitfalls in data analysis
What Is a Factor?
A factor is R's way of representing categorical data — variables that take on a limited number of possible values (levels). Internally, factors are stored as integers with labels.
# Create a factor
colors <- factor(c("red", "blue", "green", "red", "blue"))
colors
# [1] red blue green red blue
# Levels: blue green red
# Check structure
str(colors)
# Factor w/ 3 levels "blue","green","red": 3 1 2 3 1
# Internal representation
as.integer(colors)
# [1] 3 1 2 3 1
Creating Factors
Using factor()
# Basic factor
status <- factor(c("active", "inactive", "active", "pending"))
status
# [1] active inactive active pending
# Levels: active inactive pending
# With specific levels
status <- factor(
c("active", "inactive", "active", "pending"),
levels = c("pending", "active", "inactive")
)
status
# [1] active inactive active pending
# Levels: pending active inactive
# With labels
status <- factor(
c(1, 2, 1, 3),
levels = c(1, 2, 3),
labels = ["Pending", "Active", "Inactive"]
)
Ordered Factors
# Ordinal data (has meaningful order)
education <- factor(
c("high_school", "bachelors", "masters", "phd", "bachelors"),
levels = c("high_school", "bachelors", "masters", "phd"),
ordered = TRUE
)
education
# [1] high_school bachelors masters phd bachelors
# Levels: high_school < bachelors < masters < phd
# Comparison operators work on ordered factors
education[1] < education[2] # [1] TRUE
education[3] > education[1] # [1] TRUE
Quick Factor Creation
# factor() with character vector
factor(c("M", "F", "M", "F", "M"))
# gl() — generate factor levels
gl(3, 2, labels = c("Low", "Medium", "High"))
# [1] Low Low Medium Medium High High
# Levels: Low Medium High
# gl(n, k) — n levels, each repeated k times
gl(2, 3, labels = c("Control", "Treatment"))
# [1] Control Control Control Treatment Treatment Treatment
# Levels: Control Treatment
Factor Properties
x <- factor(c("a", "b", "a", "c", "b", "a"))
# Levels
levels(x) # [1] "a" "b" "c"
# Number of levels
nlevels(x) # [1] 3
# Class
class(x) # [1] "factor"
# Internal storage
typeof(x) # [1] "integer"
as.integer(x) # [1] 1 2 1 3 2 1
# Length
length(x) # [1] 6
Manipulating Levels
Adding Levels
x <- factor(c("a", "b"))
levels(x) # [1] "a" "b"
# Add a level (even if not used yet)
levels(x) <- c(levels(x), "c", "d")
x
# [1] a b
# Levels: a b c d
# Add levels dynamically
x <- addNA(x) # Add NA as a level
Reordering Levels
# By frequency (most common first)
x <- factor(c("c", "a", "b", "a", "c", "a", "b"))
table(x)
# x
# a b c
# 3 2 2
x <- factor(x, levels = names(sort(table(x), decreasing = TRUE)))
levels(x)
# [1] "a" "b" "c"
# By character sort
x <- factor(c("banana", "apple", "cherry"))
x <- factor(x, levels = sort(levels(x)))
levels(x)
# [1] "apple" "banana" "cherry"
# Using relevel() — make a specific level first
x <- factor(c("low", "medium", "high", "medium"))
x <- relevel(x, ref = "medium")
levels(x)
# [1] "medium" "low" "high"
Dropping Levels
x <- factor(c("a", "b", "c", "a", "b"))
levels(x) # [1] "a" "b" "c"
# Keep only levels that appear in data
x <- droplevels(x)
levels(x) # [1] "a" "b" "c" (same here, but would drop unused)
# Subset then drop
x_sub <- x[c(1, 2)]
levels(droplevels(x_sub)) # [1] "a" "b"
Factor Operations
Table and Frequency
x <- factor(c("red", "blue", "red", "green", "blue", "red"))
# Frequency table
table(x)
# x
# blue green red
# 2 1 3
# Proportions
prop.table(table(x))
# x
# blue green red
# 0.3333333 0.1666667 0.5000000
# Cross-tabulation
gender <- factor(c("M", "F", "M", "F", "M"))
color <- factor(c("red", "blue", "red", "blue", "red"))
table(gender, color)
# color
# gender blue red
# F 2 0
# M 0 3
Summary
x <- factor(c("a", "b", "a", "c", "b", "a"))
summary(x)
# a b c
# 3 2 1
Factor Math
# Cannot do math on factors!
x <- factor(c(1, 2, 3, 4))
# x + 1 # Error: invalid function
# Must convert first
as.numeric(x) + 1
# [1] 2 3 4 5
Converting Between Types
# Factor to character
x <- factor(c("a", "b", "c"))
as.character(x) # [1] "a" "b" "c"
# Character to factor
x <- c("a", "b", "c", "a", "b")
factor(x)
# Factor to numeric
x <- factor(c(1, 2, 3, 4))
as.numeric(x) # [1] 1 2 3 4 (wrong — gives level indices!)
as.numeric(as.character(x)) # [1] 1 2 3 4 (correct)
# Numeric to factor
x <- c(1, 2, 3, 1, 2)
factor(x)
# Data frame column
df <- data.frame(
color = factor(c("red", "blue", "green")),
value = 1:3
)
str(df)
# 'data.frame': 3 obs. of 2 variables:
# $ color: Factor w/ 3 levels "blue","green","red": 3 1 2
# $ value: int 1 2 3
Common Factor Pitfalls
1. Strings as Factors
# R < 4.0: strings are factors by default
df <- data.frame(x = c("a", "b", "c"))
str(df)
# 'data.frame': 3 obs. of 1 variable:
# $ x: Factor w/ 3 levels "a","b","c": 1 2 3
# R >= 4.0: strings are characters by default
# Or use stringsAsFactors = FALSE
df <- data.frame(x = c("a", "b", "c"), stringsAsFactors = FALSE)
str(df)
# 'data.frame': 3 obs. of 1 variable:
# $ x: chr "a" "b" "c"
2. Unexpected Ordering
# Alphabetical order may not be what you want
sizes <- factor(c("small", "large", "medium"))
levels(sizes)
# [1] "large" "medium" "small" # Alphabetical!
# Fix: specify levels explicitly
sizes <- factor(c("small", "large", "medium"),
levels = c("small", "medium", "large"))
levels(sizes)
# [1] "small" "medium" "large" # Correct order!
3. Math on Factors
# This doesn't work
x <- factor(c(1, 2, 3))
# x + 1 # Error
# Convert first
as.numeric(as.character(x)) + 1
Practical Examples
Example 1: Survey Data
# Likert scale data
satisfaction <- factor(
c("Satisfied", "Neutral", "Very Satisfied", "Dissatisfied", "Satisfied"),
levels = c("Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very Satisfied"),
ordered = TRUE
)
table(satisfaction)
# satisfaction
# Very Dissatisfied Dissatisfied Neutral
# 0 1 1
# Satisfied Very Satisfied
# 2 1
# Summary
summary(satisfaction)
Example 2: Group Comparisons
# Treatment groups
group <- factor(c("Control", "Treatment A", "Treatment B",
"Control", "Treatment A", "Treatment B"))
# Response
response <- c(10, 15, 20, 12, 18, 22)
# By group
tapply(response, group, mean)
# Control Treatment A Treatment B
# 11.0 16.5 21.0
Practice Exercises
Exercise 1: Grade Factor
Create a factor for grades (A, B, C, D, F) with 15 student grades. Find the most common grade and the proportion of students with A or B.
Solution
set.seed(42)
grades <- factor(sample(c("A", "B", "C", "D", "F"), 15, replace = TRUE))
grades
# Most common grade
names(which.max(table(grades)))
# Proportion with A or B
mean(grades %in% c("A", "B"))
Exercise 2: Ordered Factor
Create an ordered factor for temperature ranges (Cold, Cool, Warm, Hot) with 20 observations. Verify that the ordering is correct.
Solution
temp <- factor(
sample(c("Cold", "Cool", "Warm", "Hot"), 20, replace = TRUE),
levels = c("Cold", "Cool", "Warm", "Hot"),
ordered = TRUE
)
# Check ordering
temp[1] < temp[2]
# Table
table(temp)
Key Takeaways
- Factors represent categorical data — stored as integers with labels
- Use
ordered = TRUEfor ordinal data (has meaningful order) - Always specify
levelsexplicitly to control ordering droplevels()removes unused levels after subsetting- Never do math on factors — convert to numeric first
- R 4.0+ defaults to strings as characters — use
stringsAsFactors = FALSE table()gives frequencies,prop.table()gives proportions- Use
relevel()to change the reference level
Next: Learn about R Conditionals — if/else, switch, and vectorized conditionals.