A Brief Introduction To Regular Expressions

What is a Regular Expression?

Purpose

A regular expression is a flexible way of defining patterns of text. It is a formal language which is interpreted by a regular expression engine (which might be part of an application or a programming language) that parses input text and compares it to the regular expression, and then performs operations on text that matches the regular expression.

Common uses of regular expressions include:

  • Matching text
  • Substituting text
  • Extracting text

Syntax

The basic syntax of a regular expression is /pattern/flags. The main part is the text pattern description, and the flags control the behaviour of the regular expression engine.

Different regular expression engines support different features, and also slightly vary in their syntax. After a overview of general regexp syntax we will look at some common applications and languages and how they support regular expressions.

Examples

%\d\d?/\d\d?/\d\d\d?\d?%

This will match something that looks like a date, in a format like dd/mm/yyyy or m/d/yy. Note that it does not check that it is a valid date, a string like 75/33/9876 would match. Also note that a percentage mark has been used as the regexp delimiter; this can be clearer when the pattern contains slashes.

/<p( [^>]*)?>.*?</p>/m

This regular will match a paragraph element and its contents in a HTML document.

/
  (?:
    (?:(-?\d{1,3})m {0,3}(-?\d{1,4})y (?:\(( {0,2}-?\d{1,2})\))?)
    |
    (?: {0,2} (?:(-?\d{1,3})\/(\d)|(Junct))[ ]
      - {0,2} (?:(-?\d{1,3})\/(\d)|(Junct))
    )
  )
  \s+
  (C|C&A&T|C & [AT]|L[234]|\*L4|SD[12])?
  \s+
  (?:\(((?:[ -]\d{2})|(?:\d\.\d))\))?
  \s+
  (-?\d+)?
  \s+
  (ALIG35|AL70|GAUGE|MT70|[LR]TOP
   | TW[35]M|CYC(?:[69]_|1[38])(?:BO|[LR]T))
  \s*=?\s* (-?\d{1,3}\.\d+)mm
  (?:\(1: ?(\d{2,3})\))?
  \s+
  \[ {0,2}(\d{1,3})\]
  \s*(.*)
  (?: +> :)_+ *_+:_+\/_+\/_+
  \s+
  (?:to +(?:(-?\d{1,3})m {0,3}(-?\d{1,4})y[ ]
     (?:\(( {0,2}-?\d{1,2})\))?): *(\d+)cycles)?
  \s+
  ((?:P )?(?:IN)?VALID (?: BUT OFF ROUTE)?|OFF ROUTE|UNVERIFIED)?
/gmx

This is a much more complex example. It is a regular expression that was written to match text in reports produced by a legacy system. These reports had been designed to be printed and read; by using a regular expression it was possible to parse the report and extract the important information from it. This regular expression matches a group of lines in the report and captures the bits of data that we are interested in. It would be possible to use other methods to parse this report, but the flexibility of regular expressions make it well suited to cope with the quirks of the report formatting produced by different versions of the legacy software; the use of alternation and variable matches means that this regexp can match all formats of the report instead of having to rewrite the parsing code for each version.

Regexp elements

Characters

Normal characters

Normal characters match themselves only.

  • a b c X Y Z
  • 0 1 2 3 4 5 6 7 8 9
  • " _ = #

Special characters

More exoctic characters are matched using character sequences.

.
The dot character will match almost any single character. It does not usually match line break characters, unless the /g flag is set.
\* \? \} \[ \] \/ \\ \^ \$
If you need to match a literal character that has a special meaning in regular expresssions then it needs to be escaped using a backslash.
\n \t \e \a
There are several predefined sequences for non-printable characters. \n is a new line character, \t is a tab, \e is an escape character and \a is a bell. These will be familiar to anyone who has used C or many other programming languages.
\xB0 \u0260
Some regular expression engines allow arbitrary hexadecimal or Unicode code points to be represented using a \x00 or \u0000 syntax.

Character classes

By using [square brackets] you can match any of several different characters.

Collections of characters

[abc]
The simplest form is a list of characters in square brackets, this will match any one of those characters.
[0-9]
[a-z]
[0-9a-zA-Z]
To make it simpler to match a large number of possible characaters you can specify ranges.
[-+0-9]
Simple characters and ranges can be combined as shown above. Note that, due to its special meaning for ranges, to match a literal hyphen character then you can place it at the start of a character class (alternatively you can escape it with a backslash).
[^abc]
Negation is done by having a caret at the start of a character class. The above example will match any character apart from a, b, or c.

Pre-defined character classes

There are many predefined shorthand sequences for commonly used character classes.

\d
Any digit.
\d
Any character other than a digit.
\s
Any space character, e.g. space, tab.
\S
Any non-space character.
\w
Any word character. The definition of word characters can vary, but it usually means any letter, any digit, or an underscore.
\W
Any non-word character.
[[:alpha:]]
Any letter character. This is an example of a POSIX character class. Note the double square brackets used here; the POSIX character class is [:alpha:] which can only be used inside the normal square brackets for character classes. POSIX character classes can be combined with other elements within a character class, e.g. [[:alpha::]ab[:digit:]].

Repetition

Quantifiers are used to control repetetive matching. Greedy quantifiers will try and match as much text as possible, lazy quantifiers will try and match as little as possible. Lazy quantifiers are used much less frequently than greedy quantifiers.

Normal quantifiers

ab?c
The question mark character will match either zero or one occurrence of the preceding expression. The above example will match either ac or abc, preferring the latter if possible.
ab*c
The asterisk character matches zero or more occurrences. The above example will match ac, abc, abbc, abbbc, …
ab+c
The plus character matches one or more occurrences. The above example will match abc, abbc, abbbc, …

Range quantifiers

ab{3}c
A number inside braces indicates an exact number of occurrences. The above only will match abbbc.
ab{2,4}c
Two numbers inside braces, separated by a comma, indicates a range of occurrences. This example will match abbc, abbbc, or abbbbc.
ab{2,}c
Omitting the second number, but keeping the comma, gives a minimum number of occurrences. This example will match abbc, abbbc, abbbbc, …
ab{,3}c
Omitting the first number, gives a maximum number of occurrences. This example will match ac, abc, abbc, or abbbc.

Lazy quantifiers

ab??c
Like ab?c this will match either ac or abc, but the double question mark will make it prefer to match the former if this is possible.
ab*?c
This will match the same set of possibilities as ab*c, but if there are several possible matches then it will match as few b characters as it can.
ab+?c
This is the lazy equivalent of ab+c.
ab{1,2}?c
Similarly, putting a question mark after a range quantifier makes it lazy.

Alternation, grouping and matching

Alternation

a|b
Matching one a set of possible different is done by using the pipe operator. This will match either a or b.
foo|bar
The alternation operator has very low precedence, in particular lower than a sequence of characters. This means that this example will match either foo or bar, not fooar or fobar.
foo|bar|baz
Matching one of more than two possibilities is simply done by using multiple pipe operators. This will match any one of foo, bar, or baz.

Grouping and matching

    foo(bar)?
    Parentheses group a set of characters together. Here the ? quantifier applies to everything inside the brackets, so this will match either foo or foobar.
    foo(bar|foo)
    Parentheses can be combined with other operators, such as the pipe alternation operator. This will match either foobar or foofoo.
    (fooba[rz])
    As well as grouping characters together, parentheses are used to capture elements within a regular expression which can then be examined later on. This will match either foobar or foobaz and the matching text will be captured; in sed it will be stored as /1, in perl in the variable $1.

    Grouping without matching

    (?:foo)
    If you want to group a set of characters together without capturing them, then the (?:…) operator will do this.

    Positional markers

    As well as matching text itself, you can control where the text occurs by using positional markers. These markers do not match any text themselves, but control where the other patterns in the regular expression are able to match text.

    Beginning/end of lines

    ^foo
    ^ matches the start of a line or piece of text. This example will only match foo if it is at the start of a line.
    bar$
    $ matches the end of a line or the end of the text. This will match bar when it is at the end of a line.

    Beginning and end of words

    \b
    This batches word boundaries. In the string foo bar it will match the start of the string, between the o and the space at the end of the word foo, between the space and the b at the start of the word bar, and at the end of the string.
    \B
    This is the opposite of /b and will match anywhere other than a word boundary, i.e. in the middle of words, and within sequences of non-word characters.

    Lookaround

    foo(?=bar)
    This is a positive lookahead: it will match if the text contains foobar, but will only match the foo part, and not the bar part.
    foo(?!bar)
    This is a negative lookahead: it will match foo, unless it is immediately followed by bar.
    (?<=foo)bar
    This is a positive lookbehind: it will match the text bar, but only if it occurs as foobar. The text foo will not be part of the match.
    (?<!foo)bar
    This is a negative lookbehind: it will match bar unless it is preceded by foo.

    Flags

    Flags controls the overall behaviour of the regular expression.

    i
    The i (insensitive) flag tells the regular expression to match in a case insensitive manner. /foo/ will only match foo, but /foo/i will also match FOO, fOo, and so on.
    g
    The g (global) flag tells the regexp engine to match all possible instances of the regular expression. Normally it will stop after the first match, but if this flag is set then it will look for any further matches.
    m
    The m (multiline) flag is for regular expressions that span more than one line of text. Normally the match has to be on a single line, but if this flag is set then the match can span several lines. This also changes the behaviour of the dot character class; it normally does not match line end characters, but will if the m flag is set.
    x
    Unlike the other flags, this does not alter the behaviour of the regexp engine. Instead it allows you to write more legible regular expressions by splitting them across multiple lines: the lines will be concatenated with leading and trailing white space ignored. The earlier example used this flag to break up a very long regular expression.

    Programs and Languages

    grep

    grep is a simple program usually used to extract lines from text files that match a given pattern. It is often used to match plain character sequences, so there are very few special characters: most regular expression operators have to be preceded by a backslash to give them their normal meaning. Exceptions to this are the * quantifier and the ^ and $ anchors which work as normal.

    Metacharacters
    . \n \t \s \S \w \W
    POSIX character classes, e.g. [[:digit:]]
    Repetition
    \? * \+ \{n,m\}
    Alternation and grouping
    \| \(…\)
    Anchoring
    ^ $ \b \B \< \>

    Examples

    grep 'FIXME\|TODO' */*.p[lm]

    This will print any lines containing either FIXME or TODO from perl files.

    egrep

    Grep also has an extended mode which removes makes most of the operator characters behave as normal, so you do not need to prefix them with a backslash like in its basic mode. If you are using anything other than very simple regular expressions with grep then is best to use this mode.

    Metacharacters
    . \n \t \s \S \w \W
    POSIX character classes, e.g. [[:digit:]]
    Repetition
    ? * + {n,m}
    Alternation and grouping
    | (…)
    Anchoring
    ^ $ \b \B \< \>

    Examples

    grep -E 'FIXME|TODO' **/*.p[lm]

    sed

    sed performs operations on streams of characters. The most common operation is to replace strings, but many more powerful things are possible. Its regexp syntax is very similar to the basic mode of grep.

    Metacharacters
    . \n \t \s \S \w \W
    POSIX character classes, e.g. [[:digit:]]
    Repetition
    * \? \+ \{m,n\}
    Alternation and grouping
    \| \( \)
    Anchors
    ^ $ \b \B \` \'

    Examples

    sed 's%\(\d\d\)/\(\d\d\)/\(\d\d\)%20\3-\1-\2%'

    This will transform dates from the format mm/dd/yy to the format yyyy-mm-dd, assuming that the date is in the 21st century.

    sed '/^__END__$/,$d' foo.pl

    This will strip the perlpod, and anything else that follows a __END__ line, from a perl file.

    sed 's/^\s\+//;s/\s\+$//'

    This strips all leading and trailing spaces from text.

    sed 's/^\s*\(.*\S\)\?\s*$/\1/'

    This also strips leading and trailing spaces from text. The previous example uses two statements, one for leading space, and one for trailing space; this one using a single statement using a backreference. This approach is much less efficient and will be several order of magnitudes slower than the previous example due to the increased memory requirements from the backreference.

    perl

    Perl has by far the most comprehensive support for regular expression features. Many features appear first in Perl before being copied by other languages and programs.

    The Perl regular expression syntax is used in many applications and other programming languages through the PCRE library. This library is used by PHP, the Apache webserver, the Exim mailserver, and many others.

    Metacharacters
    All metacharacters are supported.
    Repetition
    ? * + {n,m} ?? *? +? {n,m}?
    Alternation and grouping
    | (…) (?:…)
    Anchors
    ^ $ \b \B (?=…) (?!…) (?<=…) (?<!…)

    Perl Example 1

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    
    if ( /^ *ELR : +([A-Z]{3}\d?|[A-Z]{2}\d{2})/o ) {
        $elr = $1;
    } elsif ( /^ *Track Id : +\d(4})/o ) {
        $tid = $1;
    } elsif ( /^ *\d{1,3}.\d{4}/o) {
        my @data = unpack($template, $_);
        for (my $i = @data; $i >= 0; --$i) {
        if ($i % 2 == 0) {
            # every other element is a separator -- delete these
            splice(@data, $i, 1);
        } else {
            # remove leading/trailing spaces
            $data[$i] =~ s/^ +//;
            $data[$i] =~ s/ +$//;
        }
        print $elr, $tid, @data;
    }

    Perl Example 2

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    
    # decodes a standard deviation value
    my %errcodes = ( NA => -1, NF => -2, NV => -3,
                     SS => -4, ST => -5 );
     
    sub sdval {
        my $val = $_[0];
        if ($val =~ m/\d\.\d/) {
            return $val;
        } elsif ($val =~ m/\*\*/) {
            return 10;
        } elsif ($val =~ m/($errcodes)/) {
            return $errcodes{$1};
        } else {
            return "";
        }
    }

    Further reading

    • perldoc perlretut
    • Mastering Regular Expressions by Jeffrey E F Friedl.
      2nd edition published by O'Reilly, 2002.

    Based on a talk presented by Oliver Burnett-Hall at Durham LUG on 17 February 2008.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload the CAPTCHA.