R Factors — Categorical Data Mastery

R BasicsFactorsFree Lesson

Advertisement

R Factors — Categorical Data Mastery

Learning Objectives

By the end of this tutorial, you will be able to:

  • Create ordered and unordered factors
  • Manipulate factor levels with levels(), relevel(), and droplevels()
  • Convert between factors and other types
  • Understand when to use factors vs characters
  • Handle common factor pitfalls in data analysis

What Is a Factor?

A factor is R's way of representing categorical data — variables that take on a limited number of possible values (levels). Internally, factors are stored as integers with labels.

# Create a factor
colors <- factor(c("red", "blue", "green", "red", "blue"))
colors
# [1] red   blue  green red   blue
# Levels: blue green red

# Check structure
str(colors)
#  Factor w/ 3 levels "blue","green","red": 3 1 2 3 1

# Internal representation
as.integer(colors)
# [1] 3 1 2 3 1

Creating Factors

Using factor()

# Basic factor
status <- factor(c("active", "inactive", "active", "pending"))
status
# [1] active   inactive active   pending
# Levels: active inactive pending

# With specific levels
status <- factor(
  c("active", "inactive", "active", "pending"),
  levels = c("pending", "active", "inactive")
)
status
# [1] active   inactive active   pending
# Levels: pending active inactive

# With labels
status <- factor(
  c(1, 2, 1, 3),
  levels = c(1, 2, 3),
  labels = ["Pending", "Active", "Inactive"]
)

Ordered Factors

# Ordinal data (has meaningful order)
education <- factor(
  c("high_school", "bachelors", "masters", "phd", "bachelors"),
  levels = c("high_school", "bachelors", "masters", "phd"),
  ordered = TRUE
)
education
# [1] high_school bachelors   masters      phd         bachelors
# Levels: high_school < bachelors < masters < phd

# Comparison operators work on ordered factors
education[1] < education[2]   # [1] TRUE
education[3] > education[1]   # [1] TRUE

Quick Factor Creation

# factor() with character vector
factor(c("M", "F", "M", "F", "M"))

# gl() — generate factor levels
gl(3, 2, labels = c("Low", "Medium", "High"))
# [1] Low    Low    Medium Medium High   High
# Levels: Low Medium High

# gl(n, k) — n levels, each repeated k times
gl(2, 3, labels = c("Control", "Treatment"))
# [1] Control   Control   Control   Treatment Treatment Treatment
# Levels: Control Treatment

Factor Properties

x <- factor(c("a", "b", "a", "c", "b", "a"))

# Levels
levels(x)     # [1] "a" "b" "c"

# Number of levels
nlevels(x)    # [1] 3

# Class
class(x)      # [1] "factor"

# Internal storage
typeof(x)     # [1] "integer"
as.integer(x) # [1] 1 2 1 3 2 1

# Length
length(x)     # [1] 6

Manipulating Levels

Adding Levels

x <- factor(c("a", "b"))
levels(x)     # [1] "a" "b"

# Add a level (even if not used yet)
levels(x) <- c(levels(x), "c", "d")
x
# [1] a b
# Levels: a b c d

# Add levels dynamically
x <- addNA(x)  # Add NA as a level

Reordering Levels

# By frequency (most common first)
x <- factor(c("c", "a", "b", "a", "c", "a", "b"))
table(x)
# x
# a b c
# 3 2 2

x <- factor(x, levels = names(sort(table(x), decreasing = TRUE)))
levels(x)
# [1] "a" "b" "c"

# By character sort
x <- factor(c("banana", "apple", "cherry"))
x <- factor(x, levels = sort(levels(x)))
levels(x)
# [1] "apple"  "banana" "cherry"

# Using relevel() — make a specific level first
x <- factor(c("low", "medium", "high", "medium"))
x <- relevel(x, ref = "medium")
levels(x)
# [1] "medium" "low"    "high"

Dropping Levels

x <- factor(c("a", "b", "c", "a", "b"))
levels(x)     # [1] "a" "b" "c"

# Keep only levels that appear in data
x <- droplevels(x)
levels(x)     # [1] "a" "b" "c" (same here, but would drop unused)

# Subset then drop
x_sub <- x[c(1, 2)]
levels(droplevels(x_sub))  # [1] "a" "b"

Factor Operations

Table and Frequency

x <- factor(c("red", "blue", "red", "green", "blue", "red"))

# Frequency table
table(x)
# x
#  blue green   red
#     2     1     3

# Proportions
prop.table(table(x))
# x
#      blue     green       red
# 0.3333333 0.1666667 0.5000000

# Cross-tabulation
gender <- factor(c("M", "F", "M", "F", "M"))
color <- factor(c("red", "blue", "red", "blue", "red"))
table(gender, color)
#        color
# gender blue red
#      F    2   0
#      M    0   3

Summary

x <- factor(c("a", "b", "a", "c", "b", "a"))
summary(x)
# a b c
# 3 2 1

Factor Math

# Cannot do math on factors!
x <- factor(c(1, 2, 3, 4))
# x + 1  # Error: invalid function

# Must convert first
as.numeric(x) + 1
# [1] 2 3 4 5

Converting Between Types

# Factor to character
x <- factor(c("a", "b", "c"))
as.character(x)    # [1] "a" "b" "c"

# Character to factor
x <- c("a", "b", "c", "a", "b")
factor(x)

# Factor to numeric
x <- factor(c(1, 2, 3, 4))
as.numeric(x)      # [1] 1 2 3 4 (wrong — gives level indices!)
as.numeric(as.character(x))  # [1] 1 2 3 4 (correct)

# Numeric to factor
x <- c(1, 2, 3, 1, 2)
factor(x)

# Data frame column
df <- data.frame(
  color = factor(c("red", "blue", "green")),
  value = 1:3
)
str(df)
# 'data.frame':	3 obs. of  2 variables:
#  $ color: Factor w/ 3 levels "blue","green","red": 3 1 2
#  $ value: int  1 2 3

Common Factor Pitfalls

1. Strings as Factors

# R < 4.0: strings are factors by default
df <- data.frame(x = c("a", "b", "c"))
str(df)
# 'data.frame':	3 obs. of  1 variable:
#  $ x: Factor w/ 3 levels "a","b","c": 1 2 3

# R >= 4.0: strings are characters by default
# Or use stringsAsFactors = FALSE
df <- data.frame(x = c("a", "b", "c"), stringsAsFactors = FALSE)
str(df)
# 'data.frame':	3 obs. of  1 variable:
#  $ x: chr  "a" "b" "c"

2. Unexpected Ordering

# Alphabetical order may not be what you want
sizes <- factor(c("small", "large", "medium"))
levels(sizes)
# [1] "large" "medium" "small"  # Alphabetical!

# Fix: specify levels explicitly
sizes <- factor(c("small", "large", "medium"),
                levels = c("small", "medium", "large"))
levels(sizes)
# [1] "small"  "medium" "large"  # Correct order!

3. Math on Factors

# This doesn't work
x <- factor(c(1, 2, 3))
# x + 1  # Error

# Convert first
as.numeric(as.character(x)) + 1

Practical Examples

Example 1: Survey Data

# Likert scale data
satisfaction <- factor(
  c("Satisfied", "Neutral", "Very Satisfied", "Dissatisfied", "Satisfied"),
  levels = c("Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very Satisfied"),
  ordered = TRUE
)

table(satisfaction)
# satisfaction
# Very Dissatisfied       Dissatisfied             Neutral
#                  0                  1                  1
#          Satisfied     Very Satisfied
#                  2                  1

# Summary
summary(satisfaction)

Example 2: Group Comparisons

# Treatment groups
group <- factor(c("Control", "Treatment A", "Treatment B",
                  "Control", "Treatment A", "Treatment B"))

# Response
response <- c(10, 15, 20, 12, 18, 22)

# By group
tapply(response, group, mean)
#    Control Treatment A Treatment B
#        11.0        16.5        21.0

Practice Exercises

Exercise 1: Grade Factor

Create a factor for grades (A, B, C, D, F) with 15 student grades. Find the most common grade and the proportion of students with A or B.

Solution

set.seed(42)
grades <- factor(sample(c("A", "B", "C", "D", "F"), 15, replace = TRUE))
grades

# Most common grade
names(which.max(table(grades)))

# Proportion with A or B
mean(grades %in% c("A", "B"))

Exercise 2: Ordered Factor

Create an ordered factor for temperature ranges (Cold, Cool, Warm, Hot) with 20 observations. Verify that the ordering is correct.

Solution

temp <- factor(
  sample(c("Cold", "Cool", "Warm", "Hot"), 20, replace = TRUE),
  levels = c("Cold", "Cool", "Warm", "Hot"),
  ordered = TRUE
)

# Check ordering
temp[1] < temp[2]

# Table
table(temp)

Key Takeaways

  • Factors represent categorical data — stored as integers with labels
  • Use ordered = TRUE for ordinal data (has meaningful order)
  • Always specify levels explicitly to control ordering
  • droplevels() removes unused levels after subsetting
  • Never do math on factors — convert to numeric first
  • R 4.0+ defaults to strings as characters — use stringsAsFactors = FALSE
  • table() gives frequencies, prop.table() gives proportions
  • Use relevel() to change the reference level

Next: Learn about R Conditionals — if/else, switch, and vectorized conditionals.

Advertisement

Need Expert R Programming Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement