A regular expression matcher

26 Feb 2013

Sections

Regular expressions (regexps) are a both a theoretical model of computation and a practical basis for language processing (e.g. the Unix command-line tools `grep` and `awk`, scripting languages such as `bash` and `perl` and the `lex` and `alex` compiler-building tools).

This tutorial we are going to build a simple regexp matcher in Haskell. The key to expressing matching in an elegant and compositional way is to use a combination of algebraic datatypes and programming with continuations. Note, however, that the emphasis of this tutorial is on clarity, not optimization; real-world code should rely on an industrial-strengh library such as `regexp-pcre`.

This tutorial was inspired by Olivier Danvy's Defunctionalization at Work (a BRICS technical report).

Regular expressions

I define UNIX as "30 definitions of regular expressions living under one roof." --- Don Knuth

Although practical tools adopt extended definitions of regular expressions, we will consider only simple regexps built as follows:

• the 0 regexp that matches nothing;
• the 1 regexp (or epsilon) that matches the empty string;
• a single character c matching itself;
• the union (+) of two regexps;
• the concatenation (.) of two regexps;
• the zero-or-more repetition (*) of a regexp (also called Kleene closure).

Concatenation, union and repetition are standard in practical tools such as `grep`; 0 and 1 are necessary solely to ensure good algebraic properties (e.g. that every regular language can be represented by a regexp) and sometimes ommitted in practice (e.g. in `grep`, the empty string is represented by `^\$` but there is no representation for nothing).

Instead of re-using some existing type to encode regexps (e.g. strings) we begin by defining a custom recursive datatype. This will make it easier to process regexps and define the matching algorithm.

``````data Regexp = Zero                  -- empty
| One                   -- epsilon
| Lit Char              -- single character
| Plus Regexp Regexp    -- union (+)
| Cat  Regexp Regexp    -- concatenation (.)
| Star Regexp           -- repetition (*)``````

Some examples of regexp together with a description of what they match:

``````Lit 'a'                        -- an 'a'
Plus (Lit 'a') (Lit 'b')       -- an 'a' or a 'b'
Cat (Star (Lit 'a')) (Lit 'b') -- b, ab, aab, aaab, ...``````

Note that any value of the `Regexp` data type (except for non-terminating ones) corresponds to a valid regular expression. This means that ill-formed regexps are prevented by the compiler simply by performing type checking!

However, writting complex regexps this way is verbose and error-prone. We therefore introduce a few some shortcuts.

First, we define two infix operators for union and concatenation of regexps. We can also use some algebraic properties such as `0+e = e+0 = e` and `1.e = 1.e = e` to do some automatic simplification. We also define a "smart" constructor for repetition.

``````infixl 6 <+>
infixl 7 <>

(<+>) :: Regexp -> Regexp -> Regexp
Zero <+> e = e
e <+> Zero = e
e1 <+> e2  = Plus e1 e2

(<>) :: Regexp -> Regexp -> Regexp
Zero <> _   = Zero
_ <> Zero   = Zero
One <> e    = e
e <> One    = e
e1 <> e2    = Cat e1 e2

star :: Regexp -> Regexp
star Zero     = One
star One       = One
star (Star e)  = Star e
star e         = Star e``````

Second, we employ the `OverloadedStrings` GHC extension to automatically convert any string literal to a regexp. For example, the string `"abc"` is converted to a concatenation of characters:

``Cat (Lit 'a') (Cat (Lit 'b') (Lit 'c'))``

(To see how this is achived open the full code window on the runnable example at the end.)

Using these operators we can write complex regexps quite succintly:

``````"ab"
"a"<+>"b"
"ab" <> star ("a"<+>"b")``````

Matching

Our objective is to define a matching function, i.e. a function that takes a regexp and a string and yields a boolean.

``match :: Regexp -> String -> Bool``

However, if we try to define the above function directly by recursion over the regular expression we will run into problems. For example, to match the concatenation of two regexps, we would have to split the input string:

``````match (Cat e1 e2) cs = match e1 cs1 && match e2 cs2
-- incomplete: missing some cs1, cs2 such that cs=cs1++cs2``````

The problem is that `match` does not yield how much of the input string was matched.

One solution would be to have `match` yield a pair e.g. of a boolean and a string. But another more elegant solution is to define a "worker" function taking an extra parameter called the continuation that specifies what to do with the non-matched part of the string; in this case, the continuation is a function from strings to booleans (the result of matching). For redability we define a type synomym for continuations;

``type Cont = String -> Bool   -- type of continuations``

We can now define the worker function `accept` by case-analysis on the regexp; note that `accept` (unlike `match`) is higher-order because the continuation argument `k` is a function. The top-level function `match` simply calls `accept` with a continuation that checks for the empty string (i.e. the `null` function from the standard Prelude).

``````accept :: Regexp -> String -> Cont -> Bool  -- worker function
accept Zero    cs      k = False
accept One     cs      k = k cs
accept (Lit c) (c':cs) k = c==c' && k cs
accept (Lit c) []      k = False
accept (Cat e1 e2) cs  k = accept e1 cs (\cs' -> accept e2 cs' k)
accept (Plus e1 e2) cs k = accept e1 cs k || accept e2 cs k
accept (Star e) cs k     = accept_star e cs k
where
accept_star e cs k
= k cs || accept e cs (\cs' -> cs'/=cs && accept_star e cs' k)``````

The case of `Zero` always fails while `One` success and applies the continuation to the remaining string. The case for single character checks the start of the string and applies the continuation. The case for union is trivial. Concatenation is more interesting: we simply call `accept` for the first regexp e1 with a continuation that calls `accept` for e2 (and then proceeds to the original continuation). Finally, we use an auxiliary definition `accept_star` for matching the repetition.

The following example allows experimenting with the matcher; try changing the string or the regexp and re-running the program.

``````{-# LANGUAGE OverloadedStrings #-}
import GHC.Exts (IsString(..))

data Regexp = Zero                  -- empty
| One                   -- epsilon
| Lit Char              -- single character
| Plus Regexp Regexp    -- union (+)
| Cat  Regexp Regexp    -- concatenation (.)
| Star Regexp           -- repetition (*)
deriving Show

infixl 6 <+>
infixl 7 <>

(<+>) :: Regexp -> Regexp -> Regexp
Zero <+> e = e
e <+> Zero = e
e1 <+> e2  = Plus e1 e2

(<>) :: Regexp -> Regexp -> Regexp
Zero <> _   = Zero
_ <> Zero   = Zero
One <> e    = e
e <> One    = e
e1 <> e2    = Cat e1 e2

star :: Regexp -> Regexp
star Zero     = One
star One       = One
star (Star e)  = Star e
star e         = Star e

type Cont= String -> Bool

accept :: Regexp -> String -> Cont -> Bool  -- worker function
accept Zero    cs      k = False
accept One     cs      k = k cs
accept (Lit c) (c':cs) k = c==c' && k cs
accept (Lit c) []      k = False
accept (Cat e1 e2) cs  k = accept e1 cs (\cs' -> accept e2 cs' k)
accept (Plus e1 e2) cs k = accept e1 cs k || accept e2 cs k
accept (Star e) cs k     = accept_star e cs k
where
accept_star e cs k
= k cs || accept e cs (\cs' -> cs'/=cs && accept_star e cs' k)

match :: Regexp -> String -> Bool
match re s = accept re s null

instance IsString Regexp where
fromString cs = foldr ((<>) . Lit) One cs

-- show
main = let re = "ab" <> star "ba"
in print (match re "abbaba")
-- /show``````

Exercises

1. Define auxiliary functions `plus` and `option` for the `+` and `?` operators in `grep`, i.e. one-or-more times repetition and optional matching.

2. The `accept_star` function for handling repetition includes a condition `cs'/=cs` to check that characters are consumed in the recursive case. However, checking list inequality requires traversing the full list in the worst-case. Find a way to avoid this inefficient check.