How to Use sub() and gsub() in R for String Replacement
R provides sub() and gsub() as core base functions for replacing text patterns. Both functions take a search pattern, a replacement value, and a character vector or data frame column as input. By default, they also support regular expressions. Their main difference is the replacement scope: sub() changes only the first match in each string element, while gsub() replaces every matching occurrence.
These functions follow conventions similar to Unix sed and have been available in base R from the beginning. Because they are part of base R, they require no additional packages and are available in every standard R installation.
In current R versions, sub() and gsub() continue to be widely used for data cleanup, text preprocessing, log processing, and column name standardization.
This tutorial explains the syntax and arguments of both functions, demonstrates regular expressions with character classes, anchors, and capture group backreferences, covers advanced options such as case-insensitive matching and the PCRE2 engine through perl = TRUE, shows how to replace multiple patterns with Reduce() and stringr::str_replace_all(), applies gsub() to data frame columns, and compares gsub() with alternatives from stringr and stringi.
Key Takeaways
- Use
sub()when only the first matching occurrence in each string element should be replaced. Usegsub()when all occurrences should be replaced. - Both functions support regular expressions by default. Use
fixed = TRUEwhen the pattern should be interpreted as a literal string. - Use
perl = TRUEto switch from the default TRE engine to PCRE2, enabling lookaheads, lookbehinds, named capture groups, and\U/\Lcase modifiers in replacement strings. - Use
\\1,\\2, and similar references in the replacement string to reuse capture groups from the search pattern. - To replace several different patterns in one workflow, use
stringr::str_replace_all()with a named vector or chain multiplegsub()calls withReduce(). - Apply
gsub()directly to a data frame column with syntax such asdf$col <- gsub("pattern", "replacement", df$col). - Use
ignore.case = TRUEfor case-insensitive matching. Combine it with\\bfor whole-word replacement, keeping in mind that\\bis ASCII-only in TRE. Useperl = TRUEfor Unicode-aware word boundaries. - Both functions leave
NAvalues unchanged and do not raise an error. If missing values must be detected or imputed, handle them before callingsub()orgsub().
Prerequisites
To follow this tutorial, you need:
- R installed on a local machine or server.
- The
stringrpackage for the comparison section, installed withinstall.packages("stringr"). This package is optional.
What Are R’s sub() and gsub() Functions?
sub() and gsub() are base R functions used to search for a pattern in a character vector and replace it with another string. The pattern can be either a literal value or a regular expression. Both functions are vectorized, so each element of the input vector is processed independently within one function call.
sub() Function: Syntax and Parameters
The complete syntax for sub() is:
sub(pattern, replacement, x,
ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
The following table describes each argument:
| Parameter | Type | Description |
|---|---|---|
pattern |
character | The string or regular expression to search for. |
replacement |
character | The text used to replace each match. Backreferences such as \\1 and \\2 are supported. |
x |
character vector | The input text to search. Data frame columns are passed as vectors. |
ignore.case |
logical | If set to TRUE, matching ignores capitalization. Default: FALSE. |
perl |
logical | If set to TRUE, R uses the PCRE2 regex engine instead of TRE. Default: FALSE. |
fixed |
logical | If set to TRUE, the pattern is treated as a literal string rather than a regex. Default: FALSE. |
useBytes |
logical | If set to TRUE, matching is performed byte by byte. This is rarely required. Default: FALSE. |
gsub() Function: Syntax and Parameters
gsub() uses the same syntax and accepts the same arguments as sub(). Its replacement string also supports the same backreference format.
gsub(pattern, replacement, x,
ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
The only behavioral difference between the two functions is summarized below:
| Feature | sub() |
gsub() |
|---|---|---|
| Replacements per element | Only the first match | All matches |
| Typical use case | Remove or modify a leading token | Perform a global search-and-replace |
| Syntax | Same as gsub() |
Same as sub() |
| Regex support | Yes | Yes |
| Performance on long strings | Can be slightly faster because it stops after the first match | Scans the full string |
Basic Usage Examples
The following examples show the basic behavior of sub() and gsub() in practical situations.
Note: R’s console aligns printed vector elements for readability based on the length of the longest element, so the exact spacing in output may differ slightly.
Replacing Only the First Match with sub()
sub() searches each string element and stops after replacing the first match.
# Replace the first occurrence of "R" in a tutorial description
tutorial_text <- "In this tutorial, we will install R and add packages from CRAN."
sub("R", "The R Language", tutorial_text)
Running the command returns this output:
Output
[1] "In this tutorial, we will install The R Language and add packages from CRAN."
Only the first "R" is changed. Because sub() replaces just the first match in each string element, later matches in the same string remain as they are.
Replacing All Matches with gsub()
gsub() continues scanning the full string and replaces every match it finds. Word-boundary anchors such as \\b help match only the intended whole-word token instead of a substring inside another word.
# Replace ALL standalone occurrences of "R" using word boundaries
tutorial_text <- "R is open-source. Use R for data analysis and R for visualization."
gsub("\\bR\\b", "The R Language", tutorial_text)
The output is:
Output
[1] "The R Language is open-source. Use The R Language for data analysis and The R Language for visualization."
Each standalone "R" is replaced. The \\b boundary is important because a plain "R" pattern would also match the letter inside terms such as "CRAN" or "TRE", which can cause unwanted replacements.
Using sub() and gsub() on a Data Frame Column
Both functions can receive a data frame column vector as the x argument. Pass the column into the function, assign the result back, and the data frame column is updated.
# Create a sample data frame
marine_species <- data.frame(
creature = c("Starfish", "Blue Crab", "Bluefin Tuna", "Blue Shark", "Blue Whale"),
population_millions = c(5, 6, 4, 2, 2),
stringsAsFactors = FALSE
)
# sub(): replace the first "Blue" in each element
sub("Blue", "Green", marine_species$creature)
This command returns:
Output
[1] "Starfish" "Green Crab" "Greenfin Tuna" "Green Shark" "Green Whale"
"Bluefin Tuna" becomes "Greenfin Tuna" because "Blue" appears at the beginning of "Bluefin". sub() does not automatically recognize whole words; it matches any occurrence, including substrings inside longer words. To save the result in the data frame, assign it directly to the column:
marine_species$creature <- sub("Blue", "Green", marine_species$creature)
Using Regular Expressions with sub() and gsub()
By default, both functions interpret pattern as a regular expression using R’s TRE engine, which is a modified POSIX ERE implementation. The following examples demonstrate common regex constructs.
Regex engine quick reference: Base R uses TRE by default. Setting perl = TRUE switches to PCRE2. The stringr package uses ICU regex through stringi, which differs slightly in syntax and behavior from both TRE and PCRE2. Some features are available in one engine but not another.
Matching Character Classes and Wildcards
A character class is written inside square brackets and matches any one character from the defined set. A negated class begins with ^ and matches any character that is not part of that set.
# Remove all digits from product codes
product_codes <- c("SKU-1234", "SKU-5678", "SKU-ABCD")
gsub("[0-9]", "", product_codes)
Removing the digits produces:
Output
[1] "SKU-" "SKU-" "SKU-ABCD"
The dot wildcard . matches any single character except a newline. In the next example, the dot matches any character between "l" and "g", including letters, digits, and punctuation:
variants <- c("log", "lag", "l9g", "lg")
gsub("l.g", "[match]", variants)
The result is:
Output
[1] "[match]" "[match]" "[match]" "lg"
"lg" is not changed because no character exists between "l" and "g" for the dot to match. Quantifiers such as + for one or more and * for zero or more extend matches across multiple characters. When advanced regex constructs must behave consistently across environments, perl = TRUE can provide more predictable results.
Using Anchors (^ and $) in Patterns
The ^ anchor matches the start of a string, while $ matches the end. Anchors are useful when leading or trailing content should be removed without changing the middle of the string.
# Remove trailing whitespace from column labels
messy_labels <- c("Revenue ", "Costs ", "Profit ")
gsub("\\s+$", "", messy_labels)
The cleaned labels are:
Output
[1] "Revenue" "Costs" "Profit"
In R string literals, regex escape sequences need double backslashes. The source string \\s becomes \s inside the regex engine, where it represents whitespace. R’s TRE engine supports \s as a documented POSIX ERE extension, so this example works without perl = TRUE. For stronger portability across regex tools outside R, the strictly POSIX-compatible equivalent is [[:space:]].
Greedy Matching and Lazy Quantifiers
By default, quantifiers in R’s regex engines are greedy, meaning they match as much text as possible while still allowing the full pattern to succeed. This can lead to surprising results when the desired match is the shortest possible substring.
# Greedy: .* consumes from the first "<" to the LAST ">"
html_tags <- c("<b>bold</b>", "<em>italic</em>")
gsub("<.*>", "", html_tags)
You should see this output:
Output
[1] "" ""
Inside each string element, the greedy .* pattern matched everything between the first < and the final >, so the entire string content was removed. To match the shortest possible range, use the lazy quantifier .*?, which requires perl = TRUE.
# Lazy: .*? stops at the NEAREST ">"
gsub("<.*?>", "", html_tags, perl = TRUE)
The result is:
Output
[1] "bold" "italic"
.*? stops at the first > it reaches, which removes each individual tag while preserving the text content.
Escaping Special Characters
Regex metacharacters such as ., *, +, ?, (, ), [, ], {, }, ^, $, |, and \ must be escaped with \\ when they should be matched literally.
# Replace literal dots in a version string with hyphens
version_string <- "R 4.6.0 released"
gsub("\\.", "-", version_string)
Replacing the dots returns:
Output
[1] "R 4-6-0 released"
Another option is to set fixed = TRUE, which disables regex parsing. gsub(".", "-", version_string, fixed = TRUE) produces the same result without requiring escaping. Use fixed = TRUE whenever the pattern contains no regex logic and should always be matched literally. When fixed = TRUE is enabled, regex metacharacters and PCRE features are disabled because the pattern is treated as plain text, so combining it with perl = TRUE has no effect.
Advanced Pattern Matching Techniques
The following techniques address more complex use cases, including case-insensitive replacement, capture group backreferences, and PCRE2 features available through perl = TRUE.
Case-Insensitive Replacement with ignore.case = TRUE
Pattern matching is case-sensitive by default. Set ignore.case = TRUE to match the pattern regardless of capitalization.
# Normalize mixed-case product labels to a standard form
product_labels <- c("Widget Pro", "WIDGET PRO", "widget pro", "Super Widget Pro")
gsub("\\bwidget pro\\b", "StandardWidget", product_labels, ignore.case = TRUE)
After normalization, the labels are:
Output
[1] "StandardWidget" "StandardWidget" "StandardWidget" "Super StandardWidget"
The \\b word boundary limits the match to the whole phrase and prevents matches inside longer words. ignore.case = TRUE works with both sub() and gsub() and is compatible with TRE and PCRE2.
Using Backreferences and Capture Groups
Parentheses in a pattern create capture groups. In the replacement string, \\1 references the first group, \\2 references the second group, and so on. This allows captured text to be reused in a different position.
# Reformat "First Last" names to "Last, First"
full_names <- c("Alice Johnson", "Bob Martinez", "Carol White")
gsub("(\\w+) (\\w+)", "\\2, \\1", full_names)
The reformatted names are:
Output
[1] "Johnson, Alice" "Martinez, Bob" "White, Carol"
This simple example assumes exactly two words separated by one space. It does not handle middle names, hyphenated surnames, or apostrophes in names such as "O'Brien". For production workflows, use a more specific pattern or a dedicated name-parsing library. Also note that character classes such as \\w are more consistent with perl = TRUE or ICU-based engines when the input may include Unicode characters. In TRE’s default mode, \\w matches only ASCII word characters.
Backreferences are also useful for reformatting structured values such as dates.
# Reformat ISO dates (YYYY-MM-DD) to US format (MM/DD/YYYY)
iso_dates <- c("2025-03-15", "2024-11-01", "2026-06-04")
gsub("(\\d{4})-(\\d{2})-(\\d{2})", "\\2/\\3/\\1", iso_dates)
The converted US-style dates are:
Output
[1] "03/15/2025" "11/01/2024" "06/04/2026"
Enabling PCRE with perl = TRUE
Setting perl = TRUE changes the regex engine from TRE to PCRE2.
PCRE2 provides stronger Unicode handling and advanced regex capabilities for UTF-8 text. It also enables features not supported by TRE, including lookaheads, lookbehinds, named capture groups, possessive quantifiers, and \U and \L case-conversion modifiers in replacement strings.
# Use \U to uppercase each word (PCRE2 case modifier, perl = TRUE required)
product_names <- c("widget pro", "super gadget", "nano device")
gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", product_names, perl = TRUE)
Title-casing each word produces:
Output
[1] "Widget Pro" "Super Gadget" "Nano Device"
\\U\\1 uppercases the first character of each word, while \\L\\2 lowercases the remaining characters. These operators work only in PCRE2 replacement strings and have no effect unless perl = TRUE is used.
Lookaheads and Lookbehinds with perl = TRUE
Lookaheads such as (?=...) and lookbehinds such as (?<=...) are zero-width assertions. They match a position depending on what follows or precedes it, without consuming the surrounding characters. These are PCRE2-only features and require perl = TRUE.
# Insert an underscore between a letter and a digit (e.g., "Revenue2024" to "Revenue_2024")
field_names <- c("Revenue2024", "Costs2024", "Profit2024")
gsub("(?<=[A-Za-z])(?=[0-9])", "_", field_names, perl = TRUE)
The updated field names are:
Output
[1] "Revenue_2024" "Costs_2024" "Profit_2024"
The lookbehind (?<=[A-Za-z]) checks that a letter appears before the current position, and the lookahead (?=[0-9]) checks that a digit appears after it. An underscore is inserted at that position without consuming any of the surrounding characters.
Replacing Multiple Patterns in R
A single gsub() call accepts one pattern only. When several different patterns must be replaced in the same operation, use chained gsub() calls with Reduce(), or use stringr::str_replace_all(), which accepts a named vector of pattern-replacement pairs.
Applying Multiple gsub() Replacements with Reduce()
Create a named character vector where each name is the search pattern and each value is the replacement. Then use Reduce() to apply each substitution one after another.
# Expand SMS-style abbreviations in survey responses
survey_responses <- c("pls send info asap", "thx for ur help", "gr8 service btw")
replacements <- c(
"pls" = "please",
"asap" = "as soon as possible",
"thx" = "thanks",
"ur" = "your",
"gr8" = "great",
"btw" = "by the way"
)
result <- Reduce(
function(text, pat) gsub(pat, replacements[pat], text, fixed = TRUE),
names(replacements),
init = survey_responses
)
result
The expanded abbreviations return:
Output
[1] "please send info as soon as possible"
[2] "thanks for your help"
[3] "great service by the way"
Reduce() applies each gsub() call in sequence, passing the result from one step into the next. The order of values in replacements matters if one substitution can create text that a later pattern also matches. The following example demonstrates this issue:
# Hazardous order: "cat" is replaced by "dog", then "dog" is replaced by "wolf"
chained_replacements <- c("cat" = "dog", "dog" = "wolf")
Reduce(
function(text, pat) gsub(pat, chained_replacements[pat], text, fixed = TRUE),
names(chained_replacements),
init = "my cat and my dog"
)
The chained replacements produce:
Output
[1] "my wolf and my wolf"
Both "cat" and "dog" become "wolf" because the first replacement turns "cat" into "dog", which is then matched by the second replacement. To avoid this, arrange the replacements so later patterns cannot match earlier outputs, or use str_replace_all() with a named vector. That approach applies replacements against the original string rather than against text that changes after each step.
Using stringr::str_replace_all() with a Named Vector
stringr::str_replace_all() can accept the same named vector directly, making multi-pattern replacement shorter and easier to read.
library(stringr)
result <- str_replace_all(survey_responses, replacements)
result
str_replace_all() returns the same result:
Output
[1] "please send info as soon as possible"
[2] "thanks for your help"
[3] "great service by the way"
Both methods produce the same output. str_replace_all() is often more readable when the replacement list grows beyond two or three entries and fits naturally into |> and %>% pipelines.
Applying gsub() and sub() to Data Frames
Applying either function to a data frame column works the same way as applying it to a character vector: pass the column as x and assign the returned value back to that column. The following sections cover replacement in one column, multiple-column replacement with lapply(), and a practical data cleaning example.
A key behavior to understand before working with real data is that both functions preserve NA values. If a vector element is NA before the function call, it remains NA afterward and does not trigger an error.
# NA values are preserved, not replaced or raised as errors
gsub("cat", "dog", c("cat", NA, "catfish"))
The NA value remains unchanged:
Output
[1] "dog" NA "dogfish"
If a cleaning pipeline depends on detecting or replacing missing values, handle NA values before running gsub() instead of expecting the replacement itself to modify them.
Replacing Values in a Single Column
To use gsub() on one data frame column, pass the column as the x argument and assign the result back to the same column.
# Clean a price column from a CSV import
sales_data <- data.frame(
product = c("Laptop", "Phone", "Tablet", "Monitor"),
price = c("$1,299", "$899", "$2,450", "$349"),
stringsAsFactors = FALSE
)
# The character class [$,] matches either $ or , in a single regex pass
sales_data$price <- as.numeric(gsub("[$,]", "", sales_data$price))
sales_data
The cleaned data frame appears as:
Output
product price
1 Laptop 1299
2 Phone 899
3 Tablet 2450
4 Monitor 349
Applying gsub() Across Multiple Columns with lapply()
To apply the same transformation across several columns without repeating the same function call, use a column selector together with lapply().
# Strip formatting from phone number columns
contact_data <- data.frame(
primary_phone = c("(555) 123-4567", "(555) 987-6543"),
secondary_phone = c("(555) 111-2222", "(555) 333-4444"),
stringsAsFactors = FALSE
)
phone_cols <- c("primary_phone", "secondary_phone")
contact_data[phone_cols] <- lapply(contact_data[phone_cols], function(col) {
gsub("[^0-9]", "", col)
})
contact_data
The stripped phone numbers are:
Output
primary_phone secondary_phone
1 5551234567 5551112222
2 5559876543 5553334444
lapply() returns a list, which R can assign back automatically to the selected columns.
Practical Data Cleaning Example
The next example cleans a data frame that represents a raw CSV import with inconsistent numeric formatting and mixed date separators.
# Simulate a messy CSV import
orders <- data.frame(
order_id = c("ORD-001", "ORD-002", "ORD-003"),
amount = c("$1,299.00", "$ 450.50", "$3,000.00"),
order_date = c("01-15-2025", "02/03/2025", "2025.03.10"),
stringsAsFactors = FALSE
)
# Step 1: strip currency symbol, spaces, and commas; convert to numeric
orders$amount <- as.numeric(gsub("[$,\\s]", "", orders$amount))
# Step 2: normalize all date separators to hyphens
orders$order_date <- gsub("[/.]", "-", orders$order_date)
orders
After both cleaning steps, the data frame is:
Output
order_id amount order_date
1 ORD-001 1299.00 01-15-2025
2 ORD-002 450.50 02-03-2025
3 ORD-003 3000.00 2025-03-10
Two vectorized gsub() calls perform the normalization clearly, without explicit loops or extra packages.
sub() and gsub() vs stringr: When to Use Which
gsub() and stringr::str_replace_all() both replace every pattern match in a character vector. The better choice usually depends on dependency preferences, pipeline style, and required features.
Comparison Table: gsub() vs str_replace_all()
The table below summarizes the most common decision points.
| Feature | gsub() |
str_replace_all() |
|---|---|---|
| Package | base R, no installation needed | stringr, part of the tidyverse ecosystem |
| Multiple patterns in one call | No | Yes, with a named vector |
| Default regex engine | TRE, POSIX ERE | ICU through stringi |
| PCRE support | Yes, with perl = TRUE |
Always ICU |
| Pipeline compatibility | Moderate | High, native with |> or %>% |
| Unicode support | Good, excellent with perl = TRUE |
Excellent, because ICU is always used |
| External dependency | None | stringr plus stringi |
Performance Considerations on Large Vectors
For single-pattern replacements on typical vectors, gsub() and str_replace_all() usually perform similarly. For large workloads with fixed, non-regex patterns, stringi::stri_replace_all_fixed() is often faster for literal string replacement because it is optimized for fixed-string matching rather than full regex parsing.
str_replace_all() can be more convenient and may perform better than chaining several gsub() calls, depending on the workload. For performance-critical data pipelines, benchmark with the real data before assuming one approach is faster.
The example below uses stringi for fixed-string replacement:
library(stringi)
# High-performance fixed-string replacement using stringi
product_descriptions <- c("apple and apple pie", "apple juice", "pineapple")
stri_replace_all_fixed(product_descriptions, "apple", "pear")
The replacement returns:
Output
[1] "pear and pear pie" "pear juice" "pinepear"
Literal replacement also affects substrings inside longer words. stri_replace_all_fixed() replaces "apple" inside "pineapple", producing "pinepear". This is expected for literal-match functions. Use stri_replace_all_regex() with a word-boundary pattern such as \\bapple\\b when only whole words should be matched.
Unicode and Multibyte String Handling
Recent R versions offer much better UTF-8 support across platforms.
For data containing non-ASCII characters, emoji, or non-Latin scripts, two approaches tend to be more predictable: use perl = TRUE with gsub() to activate PCRE2 Unicode support, or use str_replace_all(), which always relies on the ICU engine through stringi.
If multibyte text does not match as expected, inspect the encoding with Encoding(x) and normalize to UTF-8 with enc2utf8(x) before calling gsub().
Common Errors and How to Fix Them
The following sections explain four frequent reasons for unexpected behavior when using sub() and gsub(), along with practical fixes.
“invalid regular expression” Error
This error occurs when the pattern argument contains unescaped metacharacters or invalid regex syntax. Common causes include unmatched parentheses, unescaped square brackets, and unclosed quantifiers.
# Unmatched "(" causes an invalid regular expression error
gsub("(error", "warning", "connection (error) occurred")
R raises an error instead of returning a result:
Output
Error in gsub("(error", "warning", "connection (error) occurred") :
invalid regular expression '(error', reason 'Missing ')''
Fix the issue by escaping the parenthesis with \\(, or use fixed = TRUE when the search value should be treated as a literal string.
# Escaped version
gsub("\\(error", "warning", "connection (error) occurred")
With the parenthesis escaped, R returns:
Output
[1] "connection warning) occurred"
Backslash Escaping Issues in Patterns and Replacement Strings
R string literals require double backslashes, \\, to represent one regex backslash, \. A regex that needs to match a literal backslash requires four backslashes in the R source. The string "\\\\" becomes the two-character regex \\, which the engine interprets as one literal backslash.
The same double-backslash rule applies to the replacement argument. To use a backreference in a replacement string, write "\\1" in R source. The regex engine sees \1 and resolves it to the first capture group. To insert a literal backslash in the output, write "\\\\" in R source, which the engine sees as \\ and outputs as one \.
# Replace backslashes in a Windows-style file path with forward slashes
file_path <- "C:\\Users\\alice\\Documents"
gsub("\\\\", "/", file_path)
The normalized path is:
Output
[1] "C:/Users/alice/Documents"
Unexpected Behavior with fixed = TRUE vs Regex Patterns
When fixed = TRUE is enabled, regex metacharacters no longer have special meaning and are matched literally. If a pattern such as [0-9]+ is used with fixed = TRUE, R searches for the literal text "[0-9]+" rather than digits. If fewer replacements occur than expected, check whether fixed = TRUE has been turned on unintentionally.
Pattern Matches More Than Intended
Patterns without anchors match anywhere in the string, including inside longer words. Use ^ and $ to anchor the match to the full string, or use \\b for word boundaries to restrict the match. Combining ignore.case = TRUE with \\b is a reliable approach for case-insensitive whole-word replacement.
FAQ
What Is the Difference Between sub() and gsub() in R?
The main difference is the number of matches replaced in each string element.
sub() replaces only the first occurrence of the specified pattern within each element of a character vector. This is useful when only the initial match should be changed or removed, such as stripping one prefix while leaving later occurrences untouched.
gsub() replaces every occurrence of the pattern throughout each string element. This makes it well suited for global search-and-replace tasks and thorough data cleaning.
Both functions share the same syntax and accept the same arguments, so switching between them usually only requires changing the function name. In most data cleaning and normalization workflows, gsub() is used more often because it performs global matching. Choose sub() when only the first match should be controlled precisely.
How Do I Replace Multiple Different Patterns at Once Using gsub() in R?
Base R’s gsub() cannot replace multiple different patterns in a single call. Two common approaches are available.
Chaining with Reduce(): Create a named character vector where names are patterns and values are replacements. Then use Reduce() to apply a wrapper function to each pattern-replacement pair in sequence.
patterns <- c("foo" = "bar", "baz" = "qux")
text <- "foo and baz"
Reduce(function(x, y) gsub(y, patterns[y], x), names(patterns), init = text)
Using stringr::str_replace_all(): The stringr function str_replace_all() accepts a named vector of pattern-replacement pairs directly, allowing all substitutions to be written in one line.
library(stringr)
str_replace_all(text, patterns)
str_replace_all() is recommended when more than a few patterns are involved because it keeps the code cleaner and easier to maintain, especially in tidyverse workflows.
How Do I Make gsub() Case-Insensitive in R?
To perform case-insensitive matching with gsub(), set ignore.case to TRUE:
gsub("hello", "hi", x, ignore.case = TRUE)
This replaces all capitalization variants of "hello", such as "Hello" or "HELLO". To avoid matching the same letters inside another word, such as "hello" inside "chello", combine ignore.case = TRUE with a word boundary:
gsub("\\bhello\\b", "hi", x, ignore.case = TRUE)
This limits the replacement to whole-word matches.
What Does perl = TRUE Do in gsub() and sub()?
The perl = TRUE argument tells R to use the PCRE2 engine instead of the default TRE engine. Enabling it makes advanced regex features available, including:
- Lookaheads such as
(?=...)and lookbehinds such as(?<=...) - Named capture groups such as
(?<name>...) - Possessive quantifiers such as
*+and++ - Case conversion operators such as
\\U,\\L, and\\Ein replacements - Improved Unicode support for UTF-8 data
If the pattern or replacement depends on advanced regex functionality available only in PCRE2, set perl = TRUE.
How Do I Use Backreferences in gsub() Replacement Strings?
Backreferences allow matched groups from the pattern to be reused in the replacement string. To use them, place the part of the pattern to capture inside parentheses. Each pair of parentheses creates a numbered capture group. Then reference those groups in the replacement with \\1, \\2, and so on, based on their order.
For example:
gsub("(\\w+) (\\w+)", "\\2, \\1", "John Smith")
# Result: "Smith, John"
When using perl = TRUE, the match can be enhanced further with case-modifying escapes. For example, \\U\\1 uppercases a group and \\L\\2 lowercases one, enabling more advanced string transformations.
How Do I Apply gsub() to a Column in a Data Frame?
To apply gsub() to a single column, pass the column into the function and assign the result back:
df$column <- gsub("pattern", "replacement", df$column)
For several columns, use lapply() to apply the same replacement to each selected column:
df[cols] <- lapply(df[cols], function(col) gsub("pattern", "replacement", col))
This method scales well and avoids writing the same gsub() call repeatedly for each column. It is especially helpful when cleaning and standardizing larger datasets.
Should I Use gsub() or stringr::str_replace_all() in R?
The choice depends on the project context and requirements.
Use gsub() when a solution without external dependencies is preferred, when PCRE2-specific features are needed through perl = TRUE, or when benchmarks show it is the best option for the use case. It is included in base R and does not require installation.
Use stringr::str_replace_all() for tidyverse-style workflows, when multiple patterns need to be replaced at the same time, or when consistent modern Unicode handling is desired through the ICU engine via stringi. It also integrates cleanly into data pipelines and can make replacement mappings more readable with named vectors.
The right choice depends on workflow preferences, required features, and the complexity of the replacement task.
Why Is gsub() Not Replacing My Pattern as Expected?
Several common issues can prevent gsub() from replacing a pattern as expected.
Unescaped special regex characters: Regex metacharacters such as ., *, +, [, ], (, ), |, ?, ^, and $ have special meanings in patterns. To match them literally, either set fixed = TRUE or escape each metacharacter with double backslashes. For example, use \\. to match a literal dot.
Using PCRE2-only features without perl = TRUE: Advanced regex constructs such as lookahead, lookbehind, and named capture groups require perl = TRUE. Enable it when the pattern depends on those features.
Encoding problems with non-ASCII strings: If the input includes Unicode characters, encoding mismatches can stop patterns from matching. Check the encoding with Encoding(x) and convert to UTF-8 with enc2utf8(x) if needed before running gsub().
Unanchored patterns matching too much text: A pattern without anchors may match inside larger words or substrings, causing unintended replacements. Use \\b, ^, or $ to restrict matches to the desired scope.
Tip: Test the pattern first with grepl(pattern, x) to confirm that it matches exactly what you expect before using gsub() for replacement.
Conclusion
sub() and gsub() are essential base R functions for string replacement. They work with character vectors, data frame columns, and any object that can be coerced to character, without requiring external packages. sub() replaces the first match in each element, while gsub() replaces every match. Regex support through the TRE engine handles most common cases, and perl = TRUE enables PCRE2 features such as lookaheads, lookbehinds, and case modifiers for advanced formatting. When multiple patterns must be replaced in one call or tidyverse pipeline integration is important, stringr::str_replace_all() is a natural complement.


