In computing, a regular expression (abbreviated regexp) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings. The patterns are often a combination of text abbreviations, meta characters, and wild cards. Regular expressions are used for searching for objects, doing extractions, or find/replace operations. The use of regular expressions offers convenience and can have powerful impact on data or object management.
Functions in R for regular expressions include:
|grep(regexp, vector)||Finds all the strings in the vector that contain a substring match regexp|
|sub(regexp, replacement, vector)||Replaces the first substring matching the regular expression with the replacement (for each element of the vector).|
|gsub()||Does the same thing as sub() but can make more than one replacement per string.|
|regexpr(regexp, vector)||Returns the position of the first match within each string.|
|gregexpr()||Is the same as regexpr() except that it returns all matches.|
|strsplit()||Splits a string at each match to a regular expression|
|glob2rx()||Converts filename wildcard specifications to regular expressions.|
A pattern is an expression used to specify a set of strings required for a particular purpose. A simple way to specify a set of strings is complete enumeration, or simply listing all elements or members. However, there are more concise ways to specify the desired set of strings. For example, the set containing the three strings “Handel”, “Händel”, and “Haendel” can be specified by the pattern H(ä|ae?)ndel. This pattern matches each of the three strings. If there exists at least one regexp that matches a particular set then there exists at least another pattern, and possibly an infinite number of patterns, that generate the same result.
Pattern matching frequently makes use of the following operations to construct regular expressions.
A vertical bar separates alternatives. For example, gray|grey can match “gray” or “grey”.
Parenthesis define the scope and precedence of the operators. For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of “gray” or “grey”.
A quantifier after a token (such as a character) or group specifies how often that element is allowed to occur. The most common quantifiers are the question mark ?, the asterisk *, and the plus sign +.
? The question mark indicates that there is 0 or 1 of the preceding elements. Hence, colou?r is a pattern that matches both color and colour;
* The asterisk is a well known wild card that indicates there is 0 or more of the preceding elements. Hence, ab*c matches ac, abc, abbc, abbbc, and so on; and
+ The plus sign indicates there is one or more of the preceding elements. Thus, ab+c matches abc, abbc, abbbc, and so on, but not ac.
These constructions can be combined to form arbitrarily complex expressions, much like one can constructs arithmetical expressions from numbers. For example, H(ae?|ä)ndel and H(a|ae|ä)ndel are both valid patterns which match the same strings as the earlier example, H(ä|ae?)ndel.