Regular Expression (REGEX) in R

In computing, a regular expression (abbreviated regexp) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings. The patterns are often a combination of text abbreviations, meta characters, and wild cards. Regular expressions are used for searching for objects, doing extractions, or find/replace operations. The use of regular expressions offers convenience and can have powerful impact on data or object management.

Functions in R

Functions in R for regular expressions include:

Function Description
grep(regexp, vector) Finds all the strings in the vector that contain a substring match regexp
sub(regexp, replacement, vector) Replaces the first substring matching the regular expression with the replacement (for each element of the vector).
gsub() Does the same thing as sub() but can make more than one replacement per string.
regexpr(regexp, vector) Returns the position of the first match within each string.
gregexpr() Is the same as regexpr() except that it returns all matches.
strsplit() Splits a string at each match to a regular expression
glob2rx() Converts filename wildcard specifications to regular expressions.

Basic Pattern Concepts

A pattern is an expression used to specify a set of strings required for a particular purpose. A simple way to specify a set of strings is complete enumeration, or simply listing all elements or members. However, there are more concise ways to specify the desired set of strings. For example, the set containing the three strings “Handel”, “Händel”, and “Haendel” can be specified by the pattern H(ä|ae?)ndel. This pattern matches each of the three strings. If there exists at least one regexp that matches a particular set then there exists at least another pattern, and possibly an infinite number of patterns, that generate the same result.

Pattern matching frequently makes use of the following operations to construct regular expressions.

Boolean “or”

A vertical bar separates alternatives. For example, gray|grey can match “gray” or “grey”.


Parenthesis define the scope and precedence of the operators. For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of “gray” or “grey”.


A quantifier after a token (such as a character) or group specifies how often that element is allowed to occur. The most common quantifiers are the question mark ?, the asterisk *, and the plus sign +.

For example:

? The question mark indicates that there is 0 or 1 of the preceding elements. Hence, colou?r is a pattern that matches both color and colour;

* The asterisk is a well known wild card that indicates there is 0 or more of the preceding elements. Hence, ab*c matches ac, abc, abbc, abbbc, and so on; and

+ The plus sign indicates there is one or more of the preceding elements. Thus, ab+c matches abc, abbc, abbbc, and so on, but not ac.

These constructions can be combined to form arbitrarily complex expressions, much like one can constructs arithmetical expressions from numbers. For example, H(ae?|ä)ndel and H(a|ae|ä)ndel are both valid patterns which match the same strings as the earlier example, H(ä|ae?)ndel.


Sabalico Logo
Sabalytics Logo
Senty Logo
SEO Guide Logo
World Map Logo
rStatistics Logo
Day Map Logo
Time Zone Logo
Galaxy View Logo
Periodic Table Logo
My Location Logo
Weather Track Logo
Sprite Sheet Logo
Barcode Generator Logo
Test Speed Logo
Website Tools Logo
Image Tools Logo
Color Tools Logo
Text Tools Logo
Finance Tools Logo
File Tools Logo
Data Tools Logo
History of Humanity - History Archive Logo
History of Humanity - History Mysteries Logo
History of Humanity - Ancient Mesopotamia Logo
History of Humanity - Persian Empire Logo
History of Humanity - Alexander the Great Logo
History of Humanity - Roman History Logo
History of Humanity - Punic Wars Logo
History of Humanity - Golden Age of Piracy Logo
History of Humanity - Revolutionary War Logo