R Strings — Text Manipulation Masterclass

Learning Objectives

By the end of this tutorial, you will be able to:

Create and format strings using paste, sprintf, and glue
Manipulate strings with substring extraction, replacement, and case conversion
Apply regular expressions for pattern matching and extraction
Use the stringr package for consistent, readable string operations
Handle common string problems like encoding, trimming, and splitting

Creating Strings

R strings are character vectors enclosed in quotes.

# Single or double quotes (both work)
name <- "Alice"
name <- 'Alice'

# Quotes inside strings
cat("She said \"hello\"\n")
# She said "hello"

cat('He said \'hello\'\n')
# He said 'hello'

# Raw strings (R 4.0+) — no escape processing
path <- r"(C:\Users\Documents\file.txt)"
cat(path)
# C:\Users\Documents\file.txt

# Multi-line strings
multi <- "Line 1
Line 2
Line 3"
cat(multi)
# Line 1
# Line 2
# Line 3

String Functions

paste() and paste0()

# paste() — concatenate with separator
paste("Hello", "World")
# [1] "Hello World"

paste("Hello", "World", sep = "-")
# [1] "Hello-World"

paste("a", "b", "c", sep = "")
# [1] "abc"

# paste0() — no separator (faster)
paste0("Hello", " ", "World")
# [1] "Hello World"

paste0("ID", 1:5)
# [1] "ID1" "ID2" "ID3" "ID4" "ID5"

# Vectorized
paste(c("a", "b", "c"), 1:3)
# [1] "a 1" "b 2" "c 3"

# collapse — join vector into single string
paste(c("a", "b", "c"), collapse = ", ")
# [1] "a, b, c"

sprintf() — Formatted Strings

# Basic formatting
sprintf("Hello, %s!", "Alice")
# [1] "Hello, Alice!"

sprintf("Value: %d", 42)
# [1] "Value: 42"

sprintf("Pi is approximately %.2f", pi)
# [1] "Pi is approximately 3.14"

sprintf("Binary: %b, Octal: %o, Hex: %x", 255, 255, 255)
# [1] "Binary: 11111111, Octal: 377, Hex: ff"

# Padding
sprintf("%10s", "right")    # [1] "     right"
sprintf("%-10s", "left")    # [1] "left      "
sprintf("%05d", 42)         # [1] "00042"

# Multiple values
sprintf("Name: %s, Age: %d, Score: %.1f", "Bob", 25, 95.5)
# [1] "Name: Bob, Age: 25, Score: 95.5"

sprintf Format Specifiers

Specifier	Description	Example
`%s`	String	`"hello"`
`%d`	Integer	`42`
`%f`	Float	`3.141593`
`%.2f`	Float, 2 decimals	`3.14`
`%e`	Scientific notation	`3.141593e+00`
`%g`	Auto format	`3.141593`
`%x`	Hexadecimal	`ff`
`%o`	Octal	`377`
`%b`	Binary	`11111111`
`%10s`	Right-aligned	`" hello"`
`%-10s`	Left-aligned	`"hello "`
`%05d`	Zero-padded	`"00042"`

glue Package

library(glue)

# Template strings with {}
name <- "Alice"
age <- 30
glue("My name is {name} and I am {age} years old.")
# My name is Alice and I am 30 years old.

# Expressions in {}
glue("Next year I will be {age + 1}.")
# Next year I will be 31.

# Multi-line
glue("
  Name: {name}
  Age: {age}
")

# Collapsing
glue_collapse(c("a", "b", "c"), sep = ", ")
# a, b, c

String Manipulation

Case Conversion

x <- "Hello, World!"

toupper(x)      # [1] "HELLO, WORLD!"
tolower(x)      # [1] "hello, world!"

Substring Extraction

x <- "Hello, World!"

# Extract substring
substr(x, 1, 5)        # [1] "Hello"
substr(x, 8, 12)       # [1] "World"

# Replace using substr
substr(x, 1, 5) <- "Hi, there"
x
# [1] "Hi, there, World!"

# substring() — vectorized
substring("abcdef", 1:6)
# [1] "a" "ab" "abc" "abcd" "abcde" "abcdef"

String Splitting

# strsplit() — returns a list
x <- "apple,banana,cherry"
strsplit(x, ",")
# [[1]]
# [1] "apple"  "banana" "cherry"

# Unlist to get vector
unlist(strsplit(x, ","))
# [1] "apple"  "banana" "cherry"

# Split on whitespace
strsplit("  hello   world  ", " +")
# [[1]]
# [1] ""    "hello" "world" ""

# Split and take first element
strsplit("a-b-c", "-")[[1]][1]
# [1] "a"

String Replacement

# gsub() — global replacement
x <- "hello world hello r hello"

gsub("hello", "hi", x)
# [1] "hi world hi r hi"

# Replace with regex
gsub("\\s+", "_", "hello   world")
# [1] "hello_world"

# sub() — first occurrence only
sub("hello", "hi", x)
# [1] "hi world hello r hello"

# Fixed replacement (no regex)
gsub("hello", "hi", x, fixed = TRUE)

String Trimming and Padding

# Trim whitespace
x <- "  hello  "
trimws(x)        # [1] "hello"
trimws(x, "left")  # [1] "hello  "
trimws(x, "right") # [1] "  hello"

# Padding
sprintf("%10s", "hi")     # [1] "        hi" (right-aligned)
sprintf("%-10s", "hi")    # [1] "hi        " (left-aligned)
sprintf("%^10s", "hi")    # [1] "    hi    " (centered, R 4.0+)

String Width and Characters

# Number of characters
nchar("hello")        # [1] 5

# Character vector
strsplit("hello", "")[[1]]
# [1] "h" "e" "l" "l" "o"

# Reversing
paste(rev(strsplit("hello", "")[[1]]), collapse = "")
# [1] "olleh"

Regular Expressions

Basic Regex in Base R

# grep() — find matches (indices)
x <- c("apple", "banana", "cherry", "apricot")
grep("^a", x)
# [1] 1 4

# grepl() — logical vector
grepl("an", x)
# [1] FALSE  TRUE FALSE FALSE

# sub() — replace first match
sub("a", "@", x)
# [1] "@pple" "b@nana" "cherry" "@pricot"

# gsub() — replace all matches
gsub("a", "@", x)
# [1] "@pple" "b@n@n@" "cherry" "@pricot"

# regmatches() — extract matches
x <- "My phone is 555-1234 and zip is 12345"
regmatches(x, regexpr("[0-9]{3}-[0-9]{4}", x))
# [1] "555-1234"

regmatches(x, gregexpr("[0-9]+", x))
# [[1]]
# [1] "555" "1234" "12345"

Common Regex Patterns

Pattern	Description	Example Match
`.`	Any character	`"a"` in `"abc"`
`^`	Start of string	`"h"` in `"hello"`
`$`	End of string	`"o"` in `"hello"`
`\\d`	Digit	`"5"` in `"a5b"`
`\\w`	Word character	`"h"` in `"hello"`
`\\s`	Whitespace	`" "` in `"a b"`
`[abc]`	Character class	`"a"` in `"apple"`
`[^abc]`	Negated class	`"b"` in `"abc"`
`*`	Zero or more	`"ll"` in `"hello"`
`+`	One or more	`"ll"` in `"hello"`
`?`	Zero or one	`"h"` in `"hello"`
`{n}`	Exactly n	`"ll"` in `"hello"`
`{n,m}`	Between n and m	`"ll"` in `"hello"`
`	`	Alternation
`()`	Grouping	Captures
`\\b`	Word boundary	`"h"` in `"hello"`

# Email validation (simplified)
email <- "user@example.com"
grepl("^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$", email)
# [1] TRUE

# Phone number
phone <- "(555) 123-4567"
grepl("^\\(\\d{3}\\) \\d{3}-\\d{4}$", phone)
# [1] TRUE

# IP address (simplified)
ip <- "192.168.1.1"
grepl("^\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}$", ip)
# [1] TRUE

# Extract all numbers
text <- "I have 3 cats and 5 dogs"
as.integer(regmatches(text, gregexpr("\\d+", text))[[1]])
# [1] 3 5

stringr Package

The stringr package provides a consistent, modern interface for string manipulation.

library(stringr)

Core Functions

x <- "Hello, World!"

# String length
str_length(x)           # [1] 13

# Case conversion
str_to_upper(x)         # [1] "HELLO, WORLD!"
str_to_lower(x)         # [1] "hello, world!"
str_to_title(x)         # [1] "Hello, World!"

# Substring
str_sub(x, 1, 5)        # [1] "Hello"
str_sub(x, -6, -1)      # [1] "orld!"

# Replacement
str_replace(x, "World", "R")
# [1] "Hello, R!"

str_replace_all(x, "l", "L")
# [1] "HeLLo, WorLd!"

# Splitting
str_split("a,b,c", ",")
# [[1]]
# [1] "a" "b" "c"

str_split("a,b,c", ",", simplify = TRUE)
#      [,1] [,2] [,3]
# [1,] "a"  "b"  "c"

# Trimming
str_trim("  hello  ")           # [1] "hello"
str_pad("hi", width = 10, side = "left")  # [1] "        hi"

# Detection
str_detect(c("apple", "banana", "cherry"), "an")
# [1] FALSE  TRUE FALSE

# Count
str_count("mississippi", "s")
# [1] 4

# Locate
str_locate("hello world", "world")
#      start end
# [1,]     7  11

# Extract
str_extract("Order #12345 placed", "\\d+")
# [1] "12345"

str_extract_all("Order #12345 shipped #67890", "\\d+")
# [[1]]
# [1] "12345" "67890"

stringr Regex Functions

# str_match() — extract groups
str_match("2024-01-15", "(\\d{4})-(\\d{2})-(\\d{2})")
#      [,1]         [,2]  [,3]  [,4]
# [1,] "2024-01-15" "2024" "01"  "15"

# str_match_all() — all matches
str_match_all("a1b2c3", "([a-z])(\\d)")
# [[1]]
#      [,1] [,2] [,3]
# [1,] "a1" "a"  "1"
# [2,] "b2" "b"  "2"
# [3,] "c3" "c"  "3"

# str_replace with groups
str_replace("2024-01-15", "(\\d{4})-(\\d{2})-(\\d{2})", "\\3/\\2/\\1")
# [1] "15/01/2024"

stringr Convenience Functions

# Word wrapping
str_wrap("This is a long text that needs to be wrapped at 20 characters.", width = 20)
# [1] "This is a long text\nthat needs to be\nwrapped at 20\ncharacters."

# Truncation
str_trunc("This is a very long string", width = 20)
# [1] "This is a very lo..."

# Duplication
str_dup("ab", 3)        # [1] "ababab"

# Padding
str_pad("hi", width = 6, pad = "0")  # [1] "0000hi"
str_pad("hi", width = 6, side = "right", pad = ".")  # [1] "hi...."

Practical Examples

Example 1: Clean Data

library(stringr)

# Messy data
raw <- c("  Alice Smith  ", "BOB JONES", "charlie brown")

# Clean up
clean <- raw |>
  str_trim() |>
  str_to_title()

clean
# [1] "Alice Smith"    "Bob Jones"      "Charlie Brown"

Example 2: Parse CSV Line

line <- "John,Doe,30,New York"

fields <- str_split(line, ",", simplify = TRUE)
data.frame(
  first = fields[1],
  last = fields[2],
  age = as.integer(fields[3]),
  city = fields[4]
)
#   first last age      city
# 1  John  Doe  30 New York

Example 3: Extract Domain from Email

emails <- c("alice@gmail.com", "bob@yahoo.com", "charlie@company.org")

domains <- str_extract(emails, "(?<=@)[^.]+")
domains
# [1] "gmail"    "yahoo"    "company"

Example 4: Format Numbers

big_numbers <- c(1234, 56789, 1234567)

# Add commas
formatC(big_numbers, format = "f", big.mark = ",", digits = 0)
# [1] "1,234"     "56,789"     "1,234,567"

# Or use scales package
# scales::comma(big_numbers)

Common Mistakes

1. Forgetting escape characters

# Wrong — backslash is escape character
path <- "C:\Users\Documents"  # \U and \D are escape sequences

# Right — double backslash or raw string
path <- "C:\\Users\\Documents"
path <- r"(C:\Users\Documents)"

2. Comparing floating-point strings

# Wrong
sprintf("%.1f", 0.1 + 0.2) == "0.3"
# [1] TRUE (works here, but fragile)

# Better — use numeric comparison
abs((0.1 + 0.2) - 0.3) < 1e-10
# [1] TRUE

3. Not handling NA in string operations

x <- c("hello", NA, "world")

# Warning: NAs introduced by coercion
toupper(x)
# [1] "HELLO" NA      "WORLD"

# Better — handle NA first
x[!is.na(x)] <- toupper(x[!is.na(x)])

Practice Exercises

Exercise 1: String Reverser

Write a function that reverses a string.

Solution

reverse_string <- function(x) {
  paste(rev(strsplit(x, "")[[1]]), collapse = "")
}

reverse_string("hello")    # [1] "olleh"
reverse_string("R")        # [1] "R"
reverse_string("")         # [1] ""

Exercise 2: Email Validator

Write a function that checks if a string is a valid email format.

Solution

is_valid_email <- function(email) {
  grepl("^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$", email)
}

is_valid_email("user@example.com")    # [1] TRUE
is_valid_email("invalid@")            # [1] FALSE
is_valid_email("@no-local.com")       # [1] FALSE

Exercise 3: Word Counter

Write a function that counts the number of words in a string.

Solution

count_words <- function(x) {
  length(str_split(trimws(x), "\\s+")[[1]])
}

count_words("hello world")                    # [1] 2
count_words("  one   two   three  ")          # [1] 3
count_words("")                               # [1] 1

Key Takeaways

Strings are character vectors in R — paste() and sprintf() are your friends
paste0() is faster than paste() when you don't need separators
glue makes formatting readable — use {variable} syntax
stringr provides consistency — all functions start with str_
Regular expressions are powerful — learn \\d, \\w, \\s, [], ()
Always handle NA in string operations to avoid unexpected results
Use raw strings r"(...)" for file paths to avoid escape hell
str_detect() is vectorized — works on character vectors, not just single strings

Next: Learn about R Vectors — R's fundamental data structure.

R Strings — Text Manipulation Masterclass

R Strings — Text Manipulation Masterclass

Learning Objectives

Creating Strings

String Functions

paste() and paste0()

sprintf() — Formatted Strings

sprintf Format Specifiers

glue Package

String Manipulation

Case Conversion

Substring Extraction

String Splitting

String Replacement

String Trimming and Padding

String Width and Characters

Regular Expressions

Basic Regex in Base R

Common Regex Patterns

stringr Package

Core Functions

stringr Regex Functions

stringr Convenience Functions

Practical Examples

Example 1: Clean Data

Example 2: Parse CSV Line

Example 3: Extract Domain from Email

Example 4: Format Numbers

Common Mistakes

1. Forgetting escape characters

2. Comparing floating-point strings

3. Not handling NA in string operations

Practice Exercises

Exercise 1: String Reverser

Exercise 2: Email Validator

Exercise 3: Word Counter

Key Takeaways

Need Expert R Programming Help?