Wednesday, October 10, 2012

Regular expression matching in Haskell

This is a short Haskell program that implements regular expression matching ( you can download the source file here: RE.hs). Two neat things about this implementation:
  1.  matching is defined by recursion over the structure of the regexp (i.e. there is no need to convert the regexp to an automata); and
  2. it uses the "overloaded strings" extension to allow treating string literals as regexps.
Matching is implemented using continuations that specify what to do with the suffix left-over after matching the string; this leads to a very elegant and self-documented solution.
There is, however, one drawback: the handling of star-closure is inefficient due to the test cs'/=cs to check that a prefix of the string was consumed. The changes needed to implement a more efficient version are left as an exercise for the reader.



{- ------------------------------------------------------------------ 
   Regular expression matching in Haskell 
 
   This is a classical example of functional programming with 
   continuations. Adapted from the SML version by Olivier Danvy 
   (from "Defunctionalization at Work", BRICS RS-01-23, 2001).
   This version uses the "overloaded strings" GHC extension for
   mixing character literals in regexps.  

   Some examples to try on the GHCi interpreter;
   (Note: use the command ":set -XOverloadedStrings"
   to enable overloaded strings for the GHCi session.)

   > :set -XOverloadedStrings
   > match (star "ab") ""
   = True
   > match (star "ab") "aba"
   = False
   > match (star "ab") "abab"
   = True
   > match (star "a" <+> star "b") "aaaa"
   = True
   > match (star "a" <+> star "b") "aabb"
   = False
   > match (star ("a" <+> "b")) "aabb"
   = True

   Pedro Vasconcelos, 2012
  -------------------------------------------------------------------
-}

-- enable the overloaded strings language extension in GHC
{-# LANGUAGE OverloadedStrings #-}
module RE where
-- import the interface for overloading strings
import GHC.Exts ( IsString(..) )

-- Abstract syntax of regular expressions;
-- literal characters, empty word, empty language
-- concatenation, union and Kleene star closure
data Regexp = Lit Char
            | Epsilon
            | Empty
            | Concat Regexp Regexp
            | Union Regexp Regexp
            | Star Regexp
            deriving (Show, Read, Eq)
                            
-- "Smart" constructors that perform some simplifications
-- Infix binary operators for concation and union
infixl 7 <>
infixl 6 <+>

(<>) :: Regexp -> Regexp -> Regexp
Empty <> _ = Empty
_ <> Empty = Empty
Epsilon <> r = r
r <> Epsilon = r
r <> r'      = Concat r r'

(<+>) :: Regexp -> Regexp -> Regexp
Empty <+> r = r
r <+> Empty = r
r <+> r'    = Union r r'

-- Kleene star closure
star :: Regexp -> Regexp 
star Empty     = Epsilon
star Epsilon   = Epsilon
star (Star r) = Star r
star r        = Star r

-- coercion of strings to regular expressions
-- e.g. fromString "abc" = Lit 'a' <> Lit 'b' <> Lit 'c' <> Epsilon
instance IsString Regexp where
  fromString cs = foldr ((<>) . Lit) Epsilon cs


-- Higher-order, continuation-based matcher for regular expressions
-- worker function to match a regexp with string
-- 3rd argument is the continuation for the non-matched suffix
accept :: Regexp -> String -> (String -> Bool) -> Bool
accept Empty cs k   = False
accept Epsilon cs k = k cs
accept (Lit c) (c':cs) k = c==c' && k cs
accept (Lit c) []  k   = False
accept (Concat r1 r2) cs k = accept r1 cs (\cs' -> accept r2 cs' k)
accept (Union r1 r2) cs k = accept r1 cs k || accept r2 cs k
accept (Star r) cs k = accept_star r cs k
  
accept_star r cs k 
  = k cs || accept r cs (\cs' -> cs'/=cs && accept_star r cs' k)

-- top-level matcher; initial continuation tests for the empty string
match :: Regexp -> String -> Bool
match r cs = accept r cs null

-- end of file --




No comments:

Post a Comment