R Strings — Text Manipulation Masterclass
Learning Objectives
By the end of this tutorial, you will be able to:
- Create and format strings using paste, sprintf, and glue
- Manipulate strings with substring extraction, replacement, and case conversion
- Apply regular expressions for pattern matching and extraction
- Use the stringr package for consistent, readable string operations
- Handle common string problems like encoding, trimming, and splitting
Creating Strings
R strings are character vectors enclosed in quotes.
# Single or double quotes (both work)
name <- "Alice"
name <- 'Alice'
# Quotes inside strings
cat("She said \"hello\"\n")
# She said "hello"
cat('He said \'hello\'\n')
# He said 'hello'
# Raw strings (R 4.0+) — no escape processing
path <- r"(C:\Users\Documents\file.txt)"
cat(path)
# C:\Users\Documents\file.txt
# Multi-line strings
multi <- "Line 1
Line 2
Line 3"
cat(multi)
# Line 1
# Line 2
# Line 3
String Functions
paste() and paste0()
# paste() — concatenate with separator
paste("Hello", "World")
# [1] "Hello World"
paste("Hello", "World", sep = "-")
# [1] "Hello-World"
paste("a", "b", "c", sep = "")
# [1] "abc"
# paste0() — no separator (faster)
paste0("Hello", " ", "World")
# [1] "Hello World"
paste0("ID", 1:5)
# [1] "ID1" "ID2" "ID3" "ID4" "ID5"
# Vectorized
paste(c("a", "b", "c"), 1:3)
# [1] "a 1" "b 2" "c 3"
# collapse — join vector into single string
paste(c("a", "b", "c"), collapse = ", ")
# [1] "a, b, c"
sprintf() — Formatted Strings
# Basic formatting
sprintf("Hello, %s!", "Alice")
# [1] "Hello, Alice!"
sprintf("Value: %d", 42)
# [1] "Value: 42"
sprintf("Pi is approximately %.2f", pi)
# [1] "Pi is approximately 3.14"
sprintf("Binary: %b, Octal: %o, Hex: %x", 255, 255, 255)
# [1] "Binary: 11111111, Octal: 377, Hex: ff"
# Padding
sprintf("%10s", "right") # [1] " right"
sprintf("%-10s", "left") # [1] "left "
sprintf("%05d", 42) # [1] "00042"
# Multiple values
sprintf("Name: %s, Age: %d, Score: %.1f", "Bob", 25, 95.5)
# [1] "Name: Bob, Age: 25, Score: 95.5"
sprintf Format Specifiers
| Specifier | Description | Example |
|---|---|---|
%s | String | "hello" |
%d | Integer | 42 |
%f | Float | 3.141593 |
%.2f | Float, 2 decimals | 3.14 |
%e | Scientific notation | 3.141593e+00 |
%g | Auto format | 3.141593 |
%x | Hexadecimal | ff |
%o | Octal | 377 |
%b | Binary | 11111111 |
%10s | Right-aligned | " hello" |
%-10s | Left-aligned | "hello " |
%05d | Zero-padded | "00042" |
glue Package
library(glue)
# Template strings with {}
name <- "Alice"
age <- 30
glue("My name is {name} and I am {age} years old.")
# My name is Alice and I am 30 years old.
# Expressions in {}
glue("Next year I will be {age + 1}.")
# Next year I will be 31.
# Multi-line
glue("
Name: {name}
Age: {age}
")
# Collapsing
glue_collapse(c("a", "b", "c"), sep = ", ")
# a, b, c
String Manipulation
Case Conversion
x <- "Hello, World!"
toupper(x) # [1] "HELLO, WORLD!"
tolower(x) # [1] "hello, world!"
Substring Extraction
x <- "Hello, World!"
# Extract substring
substr(x, 1, 5) # [1] "Hello"
substr(x, 8, 12) # [1] "World"
# Replace using substr
substr(x, 1, 5) <- "Hi, there"
x
# [1] "Hi, there, World!"
# substring() — vectorized
substring("abcdef", 1:6)
# [1] "a" "ab" "abc" "abcd" "abcde" "abcdef"
String Splitting
# strsplit() — returns a list
x <- "apple,banana,cherry"
strsplit(x, ",")
# [[1]]
# [1] "apple" "banana" "cherry"
# Unlist to get vector
unlist(strsplit(x, ","))
# [1] "apple" "banana" "cherry"
# Split on whitespace
strsplit(" hello world ", " +")
# [[1]]
# [1] "" "hello" "world" ""
# Split and take first element
strsplit("a-b-c", "-")[[1]][1]
# [1] "a"
String Replacement
# gsub() — global replacement
x <- "hello world hello r hello"
gsub("hello", "hi", x)
# [1] "hi world hi r hi"
# Replace with regex
gsub("\\s+", "_", "hello world")
# [1] "hello_world"
# sub() — first occurrence only
sub("hello", "hi", x)
# [1] "hi world hello r hello"
# Fixed replacement (no regex)
gsub("hello", "hi", x, fixed = TRUE)
String Trimming and Padding
# Trim whitespace
x <- " hello "
trimws(x) # [1] "hello"
trimws(x, "left") # [1] "hello "
trimws(x, "right") # [1] " hello"
# Padding
sprintf("%10s", "hi") # [1] " hi" (right-aligned)
sprintf("%-10s", "hi") # [1] "hi " (left-aligned)
sprintf("%^10s", "hi") # [1] " hi " (centered, R 4.0+)
String Width and Characters
# Number of characters
nchar("hello") # [1] 5
# Character vector
strsplit("hello", "")[[1]]
# [1] "h" "e" "l" "l" "o"
# Reversing
paste(rev(strsplit("hello", "")[[1]]), collapse = "")
# [1] "olleh"
Regular Expressions
Basic Regex in Base R
# grep() — find matches (indices)
x <- c("apple", "banana", "cherry", "apricot")
grep("^a", x)
# [1] 1 4
# grepl() — logical vector
grepl("an", x)
# [1] FALSE TRUE FALSE FALSE
# sub() — replace first match
sub("a", "@", x)
# [1] "@pple" "b@nana" "cherry" "@pricot"
# gsub() — replace all matches
gsub("a", "@", x)
# [1] "@pple" "b@n@n@" "cherry" "@pricot"
# regmatches() — extract matches
x <- "My phone is 555-1234 and zip is 12345"
regmatches(x, regexpr("[0-9]{3}-[0-9]{4}", x))
# [1] "555-1234"
regmatches(x, gregexpr("[0-9]+", x))
# [[1]]
# [1] "555" "1234" "12345"
Common Regex Patterns
| Pattern | Description | Example Match |
|---|---|---|
. | Any character | "a" in "abc" |
^ | Start of string | "h" in "hello" |
$ | End of string | "o" in "hello" |
\\d | Digit | "5" in "a5b" |
\\w | Word character | "h" in "hello" |
\\s | Whitespace | " " in "a b" |
[abc] | Character class | "a" in "apple" |
[^abc] | Negated class | "b" in "abc" |
* | Zero or more | "ll" in "hello" |
+ | One or more | "ll" in "hello" |
? | Zero or one | "h" in "hello" |
{n} | Exactly n | "ll" in "hello" |
{n,m} | Between n and m | "ll" in "hello" |
| ` | ` | Alternation |
() | Grouping | Captures |
\\b | Word boundary | "h" in "hello" |
# Email validation (simplified)
email <- "user@example.com"
grepl("^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$", email)
# [1] TRUE
# Phone number
phone <- "(555) 123-4567"
grepl("^\\(\\d{3}\\) \\d{3}-\\d{4}$", phone)
# [1] TRUE
# IP address (simplified)
ip <- "192.168.1.1"
grepl("^\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}$", ip)
# [1] TRUE
# Extract all numbers
text <- "I have 3 cats and 5 dogs"
as.integer(regmatches(text, gregexpr("\\d+", text))[[1]])
# [1] 3 5
stringr Package
The stringr package provides a consistent, modern interface for string manipulation.
library(stringr)
Core Functions
x <- "Hello, World!"
# String length
str_length(x) # [1] 13
# Case conversion
str_to_upper(x) # [1] "HELLO, WORLD!"
str_to_lower(x) # [1] "hello, world!"
str_to_title(x) # [1] "Hello, World!"
# Substring
str_sub(x, 1, 5) # [1] "Hello"
str_sub(x, -6, -1) # [1] "orld!"
# Replacement
str_replace(x, "World", "R")
# [1] "Hello, R!"
str_replace_all(x, "l", "L")
# [1] "HeLLo, WorLd!"
# Splitting
str_split("a,b,c", ",")
# [[1]]
# [1] "a" "b" "c"
str_split("a,b,c", ",", simplify = TRUE)
# [,1] [,2] [,3]
# [1,] "a" "b" "c"
# Trimming
str_trim(" hello ") # [1] "hello"
str_pad("hi", width = 10, side = "left") # [1] " hi"
# Detection
str_detect(c("apple", "banana", "cherry"), "an")
# [1] FALSE TRUE FALSE
# Count
str_count("mississippi", "s")
# [1] 4
# Locate
str_locate("hello world", "world")
# start end
# [1,] 7 11
# Extract
str_extract("Order #12345 placed", "\\d+")
# [1] "12345"
str_extract_all("Order #12345 shipped #67890", "\\d+")
# [[1]]
# [1] "12345" "67890"
stringr Regex Functions
# str_match() — extract groups
str_match("2024-01-15", "(\\d{4})-(\\d{2})-(\\d{2})")
# [,1] [,2] [,3] [,4]
# [1,] "2024-01-15" "2024" "01" "15"
# str_match_all() — all matches
str_match_all("a1b2c3", "([a-z])(\\d)")
# [[1]]
# [,1] [,2] [,3]
# [1,] "a1" "a" "1"
# [2,] "b2" "b" "2"
# [3,] "c3" "c" "3"
# str_replace with groups
str_replace("2024-01-15", "(\\d{4})-(\\d{2})-(\\d{2})", "\\3/\\2/\\1")
# [1] "15/01/2024"
stringr Convenience Functions
# Word wrapping
str_wrap("This is a long text that needs to be wrapped at 20 characters.", width = 20)
# [1] "This is a long text\nthat needs to be\nwrapped at 20\ncharacters."
# Truncation
str_trunc("This is a very long string", width = 20)
# [1] "This is a very lo..."
# Duplication
str_dup("ab", 3) # [1] "ababab"
# Padding
str_pad("hi", width = 6, pad = "0") # [1] "0000hi"
str_pad("hi", width = 6, side = "right", pad = ".") # [1] "hi...."
Practical Examples
Example 1: Clean Data
library(stringr)
# Messy data
raw <- c(" Alice Smith ", "BOB JONES", "charlie brown")
# Clean up
clean <- raw |>
str_trim() |>
str_to_title()
clean
# [1] "Alice Smith" "Bob Jones" "Charlie Brown"
Example 2: Parse CSV Line
line <- "John,Doe,30,New York"
fields <- str_split(line, ",", simplify = TRUE)
data.frame(
first = fields[1],
last = fields[2],
age = as.integer(fields[3]),
city = fields[4]
)
# first last age city
# 1 John Doe 30 New York
Example 3: Extract Domain from Email
emails <- c("alice@gmail.com", "bob@yahoo.com", "charlie@company.org")
domains <- str_extract(emails, "(?<=@)[^.]+")
domains
# [1] "gmail" "yahoo" "company"
Example 4: Format Numbers
big_numbers <- c(1234, 56789, 1234567)
# Add commas
formatC(big_numbers, format = "f", big.mark = ",", digits = 0)
# [1] "1,234" "56,789" "1,234,567"
# Or use scales package
# scales::comma(big_numbers)
Common Mistakes
1. Forgetting escape characters
# Wrong — backslash is escape character
path <- "C:\Users\Documents" # \U and \D are escape sequences
# Right — double backslash or raw string
path <- "C:\\Users\\Documents"
path <- r"(C:\Users\Documents)"
2. Comparing floating-point strings
# Wrong
sprintf("%.1f", 0.1 + 0.2) == "0.3"
# [1] TRUE (works here, but fragile)
# Better — use numeric comparison
abs((0.1 + 0.2) - 0.3) < 1e-10
# [1] TRUE
3. Not handling NA in string operations
x <- c("hello", NA, "world")
# Warning: NAs introduced by coercion
toupper(x)
# [1] "HELLO" NA "WORLD"
# Better — handle NA first
x[!is.na(x)] <- toupper(x[!is.na(x)])
Practice Exercises
Exercise 1: String Reverser
Write a function that reverses a string.
Solution
reverse_string <- function(x) {
paste(rev(strsplit(x, "")[[1]]), collapse = "")
}
reverse_string("hello") # [1] "olleh"
reverse_string("R") # [1] "R"
reverse_string("") # [1] ""
Exercise 2: Email Validator
Write a function that checks if a string is a valid email format.
Solution
is_valid_email <- function(email) {
grepl("^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$", email)
}
is_valid_email("user@example.com") # [1] TRUE
is_valid_email("invalid@") # [1] FALSE
is_valid_email("@no-local.com") # [1] FALSE
Exercise 3: Word Counter
Write a function that counts the number of words in a string.
Solution
count_words <- function(x) {
length(str_split(trimws(x), "\\s+")[[1]])
}
count_words("hello world") # [1] 2
count_words(" one two three ") # [1] 3
count_words("") # [1] 1
Key Takeaways
- Strings are character vectors in R —
paste()andsprintf()are your friends paste0()is faster thanpaste()when you don't need separatorsgluemakes formatting readable — use{variable}syntaxstringrprovides consistency — all functions start withstr_- Regular expressions are powerful — learn
\\d,\\w,\\s,[],() - Always handle
NAin string operations to avoid unexpected results - Use raw strings
r"(...)"for file paths to avoid escape hell str_detect()is vectorized — works on character vectors, not just single strings
Next: Learn about R Vectors — R's fundamental data structure.