
Introduction



Regular expressions consist of one or more branches, separated from one another by the character `'. A regular expression matches any string that matches any one among its branches.


For example, the regular expression

abcdefghijklmnopqrstuvwxyz




matches any (single) lowercase character of the English alphabet.


Each branch of a regular expression is composed of pieces, which are composed by juxtaposition. A string matches a regular expression if it contains a match for the first piece, followed (immediately) by a match for the second piece, followed (immediately) by a match for the third piece, and so on.


For instance, each literal string that doesn't contain special characters, such as "Maple", can be understood as a concatenation of the pieces M, a, p, l, and e. The string

The fastest computer algebra system is Maple, of course!




matches the regular expression Maple, because it contains a match (M) for the first piece, followed immediately by a match for the second piece (a), and so on.


Although simple regular expressions like Maple decompose further, in practice, one normally thinks of these as "morally atomic", because they always match in this simple, straightforward way. (Their decomposition into single characters is merely a technical convenience.)


A piece consists of an atom, optionally followed by one of the characters `*', `+' or `?'. The purpose of these characters is as follows:

•

A string matches a piece of the form a*, where a is an atom, if it matches zero or more consecutive occurrences of a.

•

A string matches a piece of the form a+, where a is an atom, if it matches one or more consecutive occurrences of a.

•

A string matches a piece of the form a?, where a is an atom, if it matches zero or one occurrences of a.


Since single alphabetic characters all constitute atoms, each of u, u+, u*, and u? are examples of pieces based on the atomic regular expression u consisting of the single character `u'. The first of these, u, matches a single occurrence of the letter 'u'. The regular expression u+ matches any sequence of one or more u's. For example, all of u, uu, and uuuuuuu match, but uuv does not match.


For now, assume that the string (Fred)that is, the piece Fred surrounded by parenthesesis an atom and that it matches the same strings that the regular expression Fred matches. Then each of (Fred), (Fred)?, (Fred)+, and (Fred)*, are pieces. The first of these, (Fred), matches any string containing the substring "Fred". The second example, (Fred)?, matches zero or one occurrences of the string "Fred". (On its own, therefore, it matches every string. Typically, this would be used as a part of a larger regular expression.) The regular expression (Fred)+ matches any string containing one or more adjacent occurrences of the string "Fred": "aFredz", "aFredFredz", "aFredFredFredz", and so on. The regular expression (Fred)* is similar, matching everything matched by (Fred)+, but also matching zero occurrences of the string "Fred": both "az" and "aFredz". (Again, taken by itself this regular expression matches any string.)


Special Characters



As mentioned before, single alphanumeric characters are examples of atoms. In fact, any nonspecial character is an atomthat is, any character except one among the following special characters.


Each of these special characters is subject to special interpretation in the structure of a regular expression (some of which have already been encountered). Any of these characters can be included literally in a regular expression by escaping it with a \ (backslash character). For example, a\+b matches the string "a+b", but not the string "aaaab", which is matched by the regular expression a+b. When a \ character precedes a nonspecial character, it is the same as if the \ did not appear at all. For instance, the regular expressions a\b and ab are equivalent. The backslash character cannot appear as the last character in a regular expression.


Note: In Maple, backslashes used for escaping in calls to RegMatch and RegSubs must usually be "doubled up", due to their interpretation in Maple strings.


In addition to the nonspecial characters, an atom can be the regular expression (), which matches an empty string, or any regular expression re that is enclosed in parentheses: (re). This matches any match for the contained regular expression re, and is used as a grouping device. (For instance, in the examples above, with (Fred)+.)


As an example of the use of parentheses, consider the difference between the two regular expressions (Fred)+ and Fred+. The first matches repeated occurrences of the string "Fred", while the second matches strings such as "Fred", "Fredd", "Freddd", "Fredddd", and so on.


The single character regular expression `.' (a single "dot") is an atom. It matches any character. For example, the regular expression ... matches any string of length three.


The special characters `^' and `$' are also atomic regular expressions. The regular expression ^ matches the empty string at the beginning of any string, while $ matches the empty string at the end of a string. Thus, for example, the regular expression ^Fred$ matches the exact string "Fred", but not "aFredz" (which Fred matches). Regular expressions that employ ^ or $ are sometimes called anchored regular expressions.



Bracket Expressions



Atoms also include bracket expressions. A bracket expression is a sequence of characters enclosed in square brackets ([ and ]). In the simplest case, a bracket expression matches any one of the characters between the brackets. For this reason, these are sometimes also called character classes. For example, the regular expression [aeiouAEIOU] matches any vowel, and [aeiouAEIOU]+ matches any sequence of one or more vowels, in any order. However, several special constructs and abbreviations are permitted within bracket expressions.


If the first character after the opening bracket of a bracket expression is the character ^, then the meaning of the bracket expression is reversedthat is, it matches any character not from the set of characters described by the (rest of the) bracket expression. For example, the regular expression [^aeiouAEIOU] matches the consonants and either case of the letter 'y'.


A range of characters can be included in a bracket expression. Ranges are specified by separating two characters by the character `'. Thus, the lowercase letters of the alphabet can be abbreviated as the regular expression [az].


Ranges can be embedded anywhere in a bracket expression and there may be multiple ranges in a single bracket expression. Each of [azAZ], [0357AG], [^az09], and [aabbccddee] are valid bracket expressions. The first matches any upper or lowercase alphabet character, the second is equivalent to [0123567ABCDEFG], the third matches any character except a decimal digit or lowercase alphabet character, and the last example is just another way of saying [ae]. It is illegal for two character ranges to share an endpoint. For example, [ace] is syntactically invalid as a regular expression.


Note: Character ranges are highly dependent upon collating sequences, and are thus localedependent. As such, they are not portable. The Maple implementation recovers this lack of portability, at the cost of localization, by specifying that the POSIX locale be used for all computations involving regular expressions.


The right bracket character `]' can be included in a bracket expression literally by making it the first character (except for ^, if present). For example, []] matches the string "]".


A hyphen `' can be included in a bracket expression by making it either the first or last character, or by making it the endpoint of a character range. Any of [], [], and [a] will match a hyphen (but are not mutually equivalent). Note that [a] is a syntax error, because the character 'a' follows the character '' in the collating sequence of the POSIX locale.


Inside a bracket expression, most special characters lose their significance.




References



Friedl, Jeffrey E. F. Mastering Regular Expressions. O'Reilly and Associates Inc., 1997.



