A regular expression matcher

Regular expressions (regexps) are a both a theoretical model of computation and a practical basis for language processing (e.g. the Unix command-line tools grep and awk, scripting languages such as bash and perl and the lex and alex compiler-building tools).

This tutorial we are going to build a simple regexp matcher in Haskell. The key to expressing matching in an elegant and compositional way is to use a combination of algebraic datatypes and programming with continuations. Note, however, that the emphasis of this tutorial is on clarity, not optimization; real-world code should rely on an industrial-strengh library such as regexp-pcre.

This tutorial was inspired by Olivier Danvy's Defunctionalization at Work (a BRICS technical report).

Regular expressions

I define UNIX as "30 definitions of regular expressions living under one roof." --- Don Knuth

Although practical tools adopt extended definitions of regular expressions, we will consider only simple regexps built as follows:

  • the 0 regexp that matches nothing;
  • the 1 regexp (or epsilon) that matches the empty string;
  • a single character c matching itself;
  • the union (+) of two regexps;
  • the concatenation (.) of two regexps;
  • the zero-or-more repetition (*) of a regexp (also called Kleene closure).

Concatenation, union and repetition are standard in practical tools such as grep; 0 and 1 are necessary solely to ensure good algebraic properties (e.g. that every regular language can be represented by a regexp) and sometimes ommitted in practice (e.g. in grep, the empty string is represented by ^$ but there is no representation for nothing).

Instead of re-using some existing type to encode regexps (e.g. strings) we begin by defining a custom recursive datatype. This will make it easier to process regexps and define the matching algorithm.

data Regexp = Zero                  -- empty
            | One                   -- epsilon
            | Lit Char              -- single character
            | Plus Regexp Regexp    -- union (+)
            | Cat  Regexp Regexp    -- concatenation (.)
            | Star Regexp           -- repetition (*)

Some examples of regexp together with a description of what they match:

Lit 'a'                        -- an 'a'
Plus (Lit 'a') (Lit 'b')       -- an 'a' or a 'b'
Cat (Star (Lit 'a')) (Lit 'b') -- b, ab, aab, aaab, ...

Note that any value of the Regexp data type (except for non-terminating ones) corresponds to a valid regular expression. This means that ill-formed regexps are prevented by the compiler simply by performing type checking!

However, writting complex regexps this way is verbose and error-prone. We therefore introduce a few some shortcuts.

First, we define two infix operators for union and concatenation of regexps. We can also use some algebraic properties such as 0+e = e+0 = e and 1.e = 1.e = e to do some automatic simplification. We also define a "smart" constructor for repetition.

infixl 6 <+>
infixl 7 <>

(<+>) :: Regexp -> Regexp -> Regexp
Zero <+> e = e
e <+> Zero = e
e1 <+> e2  = Plus e1 e2

(<>) :: Regexp -> Regexp -> Regexp
Zero <> _   = Zero
_ <> Zero   = Zero
One <> e    = e
e <> One    = e
e1 <> e2    = Cat e1 e2

star :: Regexp -> Regexp 
star Zero     = One
star One       = One
star (Star e)  = Star e
star e         = Star e

Second, we employ the OverloadedStrings GHC extension to automatically convert any string literal to a regexp. For example, the string "abc" is converted to a concatenation of characters:

Cat (Lit 'a') (Cat (Lit 'b') (Lit 'c'))

(To see how this is achived open the full code window on the runnable example at the end.)

Using these operators we can write complex regexps quite succintly:

"ab" 
"a"<+>"b"
"ab" <> star ("a"<+>"b")

Matching

Our objective is to define a matching function, i.e. a function that takes a regexp and a string and yields a boolean.

match :: Regexp -> String -> Bool

However, if we try to define the above function directly by recursion over the regular expression we will run into problems. For example, to match the concatenation of two regexps, we would have to split the input string:

match (Cat e1 e2) cs = match e1 cs1 && match e2 cs2
    -- incomplete: missing some cs1, cs2 such that cs=cs1++cs2

The problem is that match does not yield how much of the input string was matched.

One solution would be to have match yield a pair e.g. of a boolean and a string. But another more elegant solution is to define a "worker" function taking an extra parameter called the continuation that specifies what to do with the non-matched part of the string; in this case, the continuation is a function from strings to booleans (the result of matching). For redability we define a type synomym for continuations;

type Cont = String -> Bool   -- type of continuations

We can now define the worker function accept by case-analysis on the regexp; note that accept (unlike match) is higher-order because the continuation argument k is a function. The top-level function match simply calls accept with a continuation that checks for the empty string (i.e. the null function from the standard Prelude).

accept :: Regexp -> String -> Cont -> Bool  -- worker function
accept Zero    cs      k = False
accept One     cs      k = k cs
accept (Lit c) (c':cs) k = c==c' && k cs
accept (Lit c) []      k = False
accept (Cat e1 e2) cs  k = accept e1 cs (\cs' -> accept e2 cs' k)
accept (Plus e1 e2) cs k = accept e1 cs k || accept e2 cs k
accept (Star e) cs k     = accept_star e cs k
  where 
     accept_star e cs k 
       = k cs || accept e cs (\cs' -> cs'/=cs && accept_star e cs' k)

The case of Zero always fails while One success and applies the continuation to the remaining string. The case for single character checks the start of the string and applies the continuation. The case for union is trivial. Concatenation is more interesting: we simply call accept for the first regexp e1 with a continuation that calls accept for e2 (and then proceeds to the original continuation). Finally, we use an auxiliary definition accept_star for matching the repetition.

The following example allows experimenting with the matcher; try changing the string or the regexp and re-running the program.

{-# LANGUAGE OverloadedStrings #-}
import GHC.Exts (IsString(..))

data Regexp = Zero                  -- empty
            | One                   -- epsilon
            | Lit Char              -- single character
            | Plus Regexp Regexp    -- union (+)
            | Cat  Regexp Regexp    -- concatenation (.)
            | Star Regexp           -- repetition (*)
            deriving Show

infixl 6 <+>
infixl 7 <>

(<+>) :: Regexp -> Regexp -> Regexp
Zero <+> e = e
e <+> Zero = e
e1 <+> e2  = Plus e1 e2

(<>) :: Regexp -> Regexp -> Regexp
Zero <> _   = Zero
_ <> Zero   = Zero
One <> e    = e
e <> One    = e
e1 <> e2    = Cat e1 e2

star :: Regexp -> Regexp 
star Zero     = One
star One       = One
star (Star e)  = Star e
star e         = Star e

type Cont= String -> Bool

accept :: Regexp -> String -> Cont -> Bool  -- worker function
accept Zero    cs      k = False
accept One     cs      k = k cs
accept (Lit c) (c':cs) k = c==c' && k cs
accept (Lit c) []      k = False
accept (Cat e1 e2) cs  k = accept e1 cs (\cs' -> accept e2 cs' k)
accept (Plus e1 e2) cs k = accept e1 cs k || accept e2 cs k
accept (Star e) cs k     = accept_star e cs k
  where 
     accept_star e cs k 
       = k cs || accept e cs (\cs' -> cs'/=cs && accept_star e cs' k)


match :: Regexp -> String -> Bool
match re s = accept re s null

instance IsString Regexp where
  fromString cs = foldr ((<>) . Lit) One cs

-- show
main = let re = "ab" <> star "ba"
       in print (match re "abbaba")
-- /show

Exercises

  1. Define auxiliary functions plus and option for the + and ? operators in grep, i.e. one-or-more times repetition and optional matching.

  2. The accept_star function for handling repetition includes a condition cs'/=cs to check that characters are consumed in the recursive case. However, checking list inequality requires traversing the full list in the worst-case. Find a way to avoid this inefficient check.