R Strings — Text Manipulation Masterclass

R BasicsStringsFree Lesson

Advertisement

R Strings — Text Manipulation Masterclass

Learning Objectives

By the end of this tutorial, you will be able to:

  • Create and format strings using paste, sprintf, and glue
  • Manipulate strings with substring extraction, replacement, and case conversion
  • Apply regular expressions for pattern matching and extraction
  • Use the stringr package for consistent, readable string operations
  • Handle common string problems like encoding, trimming, and splitting

Creating Strings

R strings are character vectors enclosed in quotes.

# Single or double quotes (both work)
name <- "Alice"
name <- 'Alice'

# Quotes inside strings
cat("She said \"hello\"\n")
# She said "hello"

cat('He said \'hello\'\n')
# He said 'hello'

# Raw strings (R 4.0+) — no escape processing
path <- r"(C:\Users\Documents\file.txt)"
cat(path)
# C:\Users\Documents\file.txt

# Multi-line strings
multi <- "Line 1
Line 2
Line 3"
cat(multi)
# Line 1
# Line 2
# Line 3

String Functions

paste() and paste0()

# paste() — concatenate with separator
paste("Hello", "World")
# [1] "Hello World"

paste("Hello", "World", sep = "-")
# [1] "Hello-World"

paste("a", "b", "c", sep = "")
# [1] "abc"

# paste0() — no separator (faster)
paste0("Hello", " ", "World")
# [1] "Hello World"

paste0("ID", 1:5)
# [1] "ID1" "ID2" "ID3" "ID4" "ID5"

# Vectorized
paste(c("a", "b", "c"), 1:3)
# [1] "a 1" "b 2" "c 3"

# collapse — join vector into single string
paste(c("a", "b", "c"), collapse = ", ")
# [1] "a, b, c"

sprintf() — Formatted Strings

# Basic formatting
sprintf("Hello, %s!", "Alice")
# [1] "Hello, Alice!"

sprintf("Value: %d", 42)
# [1] "Value: 42"

sprintf("Pi is approximately %.2f", pi)
# [1] "Pi is approximately 3.14"

sprintf("Binary: %b, Octal: %o, Hex: %x", 255, 255, 255)
# [1] "Binary: 11111111, Octal: 377, Hex: ff"

# Padding
sprintf("%10s", "right")    # [1] "     right"
sprintf("%-10s", "left")    # [1] "left      "
sprintf("%05d", 42)         # [1] "00042"

# Multiple values
sprintf("Name: %s, Age: %d, Score: %.1f", "Bob", 25, 95.5)
# [1] "Name: Bob, Age: 25, Score: 95.5"

sprintf Format Specifiers

SpecifierDescriptionExample
%sString"hello"
%dInteger42
%fFloat3.141593
%.2fFloat, 2 decimals3.14
%eScientific notation3.141593e+00
%gAuto format3.141593
%xHexadecimalff
%oOctal377
%bBinary11111111
%10sRight-aligned" hello"
%-10sLeft-aligned"hello "
%05dZero-padded"00042"

glue Package

library(glue)

# Template strings with {}
name <- "Alice"
age <- 30
glue("My name is {name} and I am {age} years old.")
# My name is Alice and I am 30 years old.

# Expressions in {}
glue("Next year I will be {age + 1}.")
# Next year I will be 31.

# Multi-line
glue("
  Name: {name}
  Age: {age}
")

# Collapsing
glue_collapse(c("a", "b", "c"), sep = ", ")
# a, b, c

String Manipulation

Case Conversion

x <- "Hello, World!"

toupper(x)      # [1] "HELLO, WORLD!"
tolower(x)      # [1] "hello, world!"

Substring Extraction

x <- "Hello, World!"

# Extract substring
substr(x, 1, 5)        # [1] "Hello"
substr(x, 8, 12)       # [1] "World"

# Replace using substr
substr(x, 1, 5) <- "Hi, there"
x
# [1] "Hi, there, World!"

# substring() — vectorized
substring("abcdef", 1:6)
# [1] "a" "ab" "abc" "abcd" "abcde" "abcdef"

String Splitting

# strsplit() — returns a list
x <- "apple,banana,cherry"
strsplit(x, ",")
# [[1]]
# [1] "apple"  "banana" "cherry"

# Unlist to get vector
unlist(strsplit(x, ","))
# [1] "apple"  "banana" "cherry"

# Split on whitespace
strsplit("  hello   world  ", " +")
# [[1]]
# [1] ""    "hello" "world" ""

# Split and take first element
strsplit("a-b-c", "-")[[1]][1]
# [1] "a"

String Replacement

# gsub() — global replacement
x <- "hello world hello r hello"

gsub("hello", "hi", x)
# [1] "hi world hi r hi"

# Replace with regex
gsub("\\s+", "_", "hello   world")
# [1] "hello_world"

# sub() — first occurrence only
sub("hello", "hi", x)
# [1] "hi world hello r hello"

# Fixed replacement (no regex)
gsub("hello", "hi", x, fixed = TRUE)

String Trimming and Padding

# Trim whitespace
x <- "  hello  "
trimws(x)        # [1] "hello"
trimws(x, "left")  # [1] "hello  "
trimws(x, "right") # [1] "  hello"

# Padding
sprintf("%10s", "hi")     # [1] "        hi" (right-aligned)
sprintf("%-10s", "hi")    # [1] "hi        " (left-aligned)
sprintf("%^10s", "hi")    # [1] "    hi    " (centered, R 4.0+)

String Width and Characters

# Number of characters
nchar("hello")        # [1] 5

# Character vector
strsplit("hello", "")[[1]]
# [1] "h" "e" "l" "l" "o"

# Reversing
paste(rev(strsplit("hello", "")[[1]]), collapse = "")
# [1] "olleh"

Regular Expressions

Basic Regex in Base R

# grep() — find matches (indices)
x <- c("apple", "banana", "cherry", "apricot")
grep("^a", x)
# [1] 1 4

# grepl() — logical vector
grepl("an", x)
# [1] FALSE  TRUE FALSE FALSE

# sub() — replace first match
sub("a", "@", x)
# [1] "@pple" "b@nana" "cherry" "@pricot"

# gsub() — replace all matches
gsub("a", "@", x)
# [1] "@pple" "b@n@n@" "cherry" "@pricot"

# regmatches() — extract matches
x <- "My phone is 555-1234 and zip is 12345"
regmatches(x, regexpr("[0-9]{3}-[0-9]{4}", x))
# [1] "555-1234"

regmatches(x, gregexpr("[0-9]+", x))
# [[1]]
# [1] "555" "1234" "12345"

Common Regex Patterns

PatternDescriptionExample Match
.Any character"a" in "abc"
^Start of string"h" in "hello"
$End of string"o" in "hello"
\\dDigit"5" in "a5b"
\\wWord character"h" in "hello"
\\sWhitespace" " in "a b"
[abc]Character class"a" in "apple"
[^abc]Negated class"b" in "abc"
*Zero or more"ll" in "hello"
+One or more"ll" in "hello"
?Zero or one"h" in "hello"
{n}Exactly n"ll" in "hello"
{n,m}Between n and m"ll" in "hello"
``Alternation
()GroupingCaptures
\\bWord boundary"h" in "hello"
# Email validation (simplified)
email <- "user@example.com"
grepl("^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$", email)
# [1] TRUE

# Phone number
phone <- "(555) 123-4567"
grepl("^\\(\\d{3}\\) \\d{3}-\\d{4}$", phone)
# [1] TRUE

# IP address (simplified)
ip <- "192.168.1.1"
grepl("^\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}$", ip)
# [1] TRUE

# Extract all numbers
text <- "I have 3 cats and 5 dogs"
as.integer(regmatches(text, gregexpr("\\d+", text))[[1]])
# [1] 3 5

stringr Package

The stringr package provides a consistent, modern interface for string manipulation.

library(stringr)

Core Functions

x <- "Hello, World!"

# String length
str_length(x)           # [1] 13

# Case conversion
str_to_upper(x)         # [1] "HELLO, WORLD!"
str_to_lower(x)         # [1] "hello, world!"
str_to_title(x)         # [1] "Hello, World!"

# Substring
str_sub(x, 1, 5)        # [1] "Hello"
str_sub(x, -6, -1)      # [1] "orld!"

# Replacement
str_replace(x, "World", "R")
# [1] "Hello, R!"

str_replace_all(x, "l", "L")
# [1] "HeLLo, WorLd!"

# Splitting
str_split("a,b,c", ",")
# [[1]]
# [1] "a" "b" "c"

str_split("a,b,c", ",", simplify = TRUE)
#      [,1] [,2] [,3]
# [1,] "a"  "b"  "c"

# Trimming
str_trim("  hello  ")           # [1] "hello"
str_pad("hi", width = 10, side = "left")  # [1] "        hi"

# Detection
str_detect(c("apple", "banana", "cherry"), "an")
# [1] FALSE  TRUE FALSE

# Count
str_count("mississippi", "s")
# [1] 4

# Locate
str_locate("hello world", "world")
#      start end
# [1,]     7  11

# Extract
str_extract("Order #12345 placed", "\\d+")
# [1] "12345"

str_extract_all("Order #12345 shipped #67890", "\\d+")
# [[1]]
# [1] "12345" "67890"

stringr Regex Functions

# str_match() — extract groups
str_match("2024-01-15", "(\\d{4})-(\\d{2})-(\\d{2})")
#      [,1]         [,2]  [,3]  [,4]
# [1,] "2024-01-15" "2024" "01"  "15"

# str_match_all() — all matches
str_match_all("a1b2c3", "([a-z])(\\d)")
# [[1]]
#      [,1] [,2] [,3]
# [1,] "a1" "a"  "1"
# [2,] "b2" "b"  "2"
# [3,] "c3" "c"  "3"

# str_replace with groups
str_replace("2024-01-15", "(\\d{4})-(\\d{2})-(\\d{2})", "\\3/\\2/\\1")
# [1] "15/01/2024"

stringr Convenience Functions

# Word wrapping
str_wrap("This is a long text that needs to be wrapped at 20 characters.", width = 20)
# [1] "This is a long text\nthat needs to be\nwrapped at 20\ncharacters."

# Truncation
str_trunc("This is a very long string", width = 20)
# [1] "This is a very lo..."

# Duplication
str_dup("ab", 3)        # [1] "ababab"

# Padding
str_pad("hi", width = 6, pad = "0")  # [1] "0000hi"
str_pad("hi", width = 6, side = "right", pad = ".")  # [1] "hi...."

Practical Examples

Example 1: Clean Data

library(stringr)

# Messy data
raw <- c("  Alice Smith  ", "BOB JONES", "charlie brown")

# Clean up
clean <- raw |>
  str_trim() |>
  str_to_title()

clean
# [1] "Alice Smith"    "Bob Jones"      "Charlie Brown"

Example 2: Parse CSV Line

line <- "John,Doe,30,New York"

fields <- str_split(line, ",", simplify = TRUE)
data.frame(
  first = fields[1],
  last = fields[2],
  age = as.integer(fields[3]),
  city = fields[4]
)
#   first last age      city
# 1  John  Doe  30 New York

Example 3: Extract Domain from Email

emails <- c("alice@gmail.com", "bob@yahoo.com", "charlie@company.org")

domains <- str_extract(emails, "(?<=@)[^.]+")
domains
# [1] "gmail"    "yahoo"    "company"

Example 4: Format Numbers

big_numbers <- c(1234, 56789, 1234567)

# Add commas
formatC(big_numbers, format = "f", big.mark = ",", digits = 0)
# [1] "1,234"     "56,789"     "1,234,567"

# Or use scales package
# scales::comma(big_numbers)

Common Mistakes

1. Forgetting escape characters

# Wrong — backslash is escape character
path <- "C:\Users\Documents"  # \U and \D are escape sequences

# Right — double backslash or raw string
path <- "C:\\Users\\Documents"
path <- r"(C:\Users\Documents)"

2. Comparing floating-point strings

# Wrong
sprintf("%.1f", 0.1 + 0.2) == "0.3"
# [1] TRUE (works here, but fragile)

# Better — use numeric comparison
abs((0.1 + 0.2) - 0.3) < 1e-10
# [1] TRUE

3. Not handling NA in string operations

x <- c("hello", NA, "world")

# Warning: NAs introduced by coercion
toupper(x)
# [1] "HELLO" NA      "WORLD"

# Better — handle NA first
x[!is.na(x)] <- toupper(x[!is.na(x)])

Practice Exercises

Exercise 1: String Reverser

Write a function that reverses a string.

Solution

reverse_string <- function(x) {
  paste(rev(strsplit(x, "")[[1]]), collapse = "")
}

reverse_string("hello")    # [1] "olleh"
reverse_string("R")        # [1] "R"
reverse_string("")         # [1] ""

Exercise 2: Email Validator

Write a function that checks if a string is a valid email format.

Solution

is_valid_email <- function(email) {
  grepl("^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$", email)
}

is_valid_email("user@example.com")    # [1] TRUE
is_valid_email("invalid@")            # [1] FALSE
is_valid_email("@no-local.com")       # [1] FALSE

Exercise 3: Word Counter

Write a function that counts the number of words in a string.

Solution

count_words <- function(x) {
  length(str_split(trimws(x), "\\s+")[[1]])
}

count_words("hello world")                    # [1] 2
count_words("  one   two   three  ")          # [1] 3
count_words("")                               # [1] 1

Key Takeaways

  • Strings are character vectors in R — paste() and sprintf() are your friends
  • paste0() is faster than paste() when you don't need separators
  • glue makes formatting readable — use {variable} syntax
  • stringr provides consistency — all functions start with str_
  • Regular expressions are powerful — learn \\d, \\w, \\s, [], ()
  • Always handle NA in string operations to avoid unexpected results
  • Use raw strings r"(...)" for file paths to avoid escape hell
  • str_detect() is vectorized — works on character vectors, not just single strings

Next: Learn about R Vectors — R's fundamental data structure.

Advertisement

Need Expert R Programming Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement