Text Parsing tutorial

Introduction

In this tutorial you will learn how to parse log-like files and how to render a log to a file. Many applications use logs to keep track of some useful information to be analysed later on. Parsing a log-like file it is an easy parsing task in comparison with parsing, say, a programming language, but it is an useful practice for a Haskell parser beginner. Most of the code in this tutorial is editable and runnable, so take advantage and experiment with the code yourself.

While log files do not have a specific format, we are going to output them as CSV tables. An specification of CSV can be found in the RFC 4180.

Among the many parser libraries in Haskell we have chosen attoparsec in this tutorial. Why? Firstly, because it is easy to use and secondly because it is fast. The other popular choice is parsec. Parsec has a similar interface to attoparsec, but share also some differences. For example, a parser in parsec can be used as a monad transformer, allowing you to add custom states. Also, when a parsing error arises, parsec gives you a lot more information than attoparsec. The lack of these features in attoparsec is precisely what makes it faster.

Writing a parser

Writing a parser involves teaching our computer how to read something. If a human see the string "25" it will quickly concludes that the string contains a number. In fact, probably you read it as "twenty five" instead of "two five". However, for the computer it is just a string of characters. In Haskell, we would have to write a function from String (or Text or ByteString, depending on the input type) to Integer in order to use it as a number. This is what parsing means. But, how we accomplish such task? Well, say that an application has sent to us the following ByteString:

"131.45.68.123"

It is the IP of a user that just connected to our server! In our code, we have the following type definition:

import Data.Word

data IP = IP Word8 Word8 Word8 Word8 deriving Show

It is a type we defined for IP's. The Word8 type represents 8-bit unsigned integer values. Now it would be great if we could parse the input 131.45.68.123 to the value IP 131 45 68 123. The first thing we look is how IP's are written. They follow this pattern:

  • An 8-bit integer.
  • A dot.
  • An 8-bit integer.
  • A dot.
  • An 8-bit integer.
  • A dot.
  • An 8-bit integer.

When we write a parser in Haskell, what we actually do is following the pattern of the input format from left to right. In this case, the function parseIP defines a parser for our type IP following the pattern we just described. Note that the decimal parser succeeds for any unsigned integral number (Word8 in this example).

{-# LANGUAGE OverloadedStrings #-}

-- This attoparsec module is intended for parsing text that is
-- represented using an 8-bit character set, e.g. ASCII or ISO-8859-15.
import Data.Attoparsec.Char8
import Data.Word

-- | Type for IP's.
data IP = IP Word8 Word8 Word8 Word8 deriving Show

parseIP :: Parser IP
parseIP = do
  d1 <- decimal
  char '.'
  d2 <- decimal
  char '.'
  d3 <- decimal
  char '.'
  d4 <- decimal
  return $ IP d1 d2 d3 d4

main :: IO ()
main = print $ parseOnly parseIP "131.45.68.123"

Note that the output of parseOnly, the function that applies the parser parseIP to the input "131.45.68.123" returns a value of type Either String IP. This is because parsing is not a total function, meaning that not every input has an output. For example, parsing the string "foo" cannot result in any IP. As a consequence, the parser fails. Each time the parser fails, it will return Left str, where str is a value of type String describing the error (in attoparsec, not very descriptive actually). If the parser ends successfuly, it will return Right x, where x is the parsed value.

As you can see, the approach to define a parser is to use simpler parsers and combine them write parsers for more complex expressions. In the following example, you will see how to parse a log file, including IP's. We will re-use the recently created parser.

Parsing logs

In this section, we develop a parser for log files that mixes content of different types. We use an example to guide the process.

Step 1: Define types

Say we have an online shop where we sell computer items like mouses, keyboards, monitors and speakers. Each time a product is sold, our application saves some information in a log file, containing the time when the product was sold, the IP of the client and the name of the product. Each log entry may be represented by the following type:

import Data.Time

data Product = Mouse | Keyboard | Monitor | Speakers

data LogEntry =
  LogEntry { -- A local time contains the date and the time of the day.
             -- For example: 2013-06-29 11:16:23.
             entryTime :: LocalTime
           , entryIP   :: IP
           , entryProduct   :: Product
             } deriving Show

The log file will therefore contain a list of elements of type LogEntry.

-- | Type synonym of a list of log entries.
type Log = [LogEntry]

Step 2: Follow the syntax

The log file, or anything that we can parse, follows a specific syntax. For example, here is our today log:

2013-06-29 11:16:23 124.67.34.60 keyboard
2013-06-29 11:32:12 212.141.23.67 mouse
2013-06-29 11:33:08 212.141.23.67 monitor
2013-06-29 12:12:34 125.80.32.31 speakers
2013-06-29 12:51:50 101.40.50.62 keyboard
2013-06-29 13:10:45 103.29.60.13 mouse

Each line contains a log entry. The idea is to write a parser for log entries, and iterate it line by line to get the list of every log entry. The elements contained in each entry would be of type LocalTime, IP and Product. We have to write parsers for each one and combine them. Fortunately, we already have a parser for IP's that we can re-use. Let's write a parser for the time stamps.

We notice that the format followed in our log is:

yyyy-MM-dd hh:mm:ss

Following this specification, we can easily write the parser as follows.

{-# LANGUAGE OverloadedStrings #-}

import Data.Time
import Data.Attoparsec.Char8

timeParser :: Parser LocalTime
timeParser = do
  y  <- count 4 digit
  char '-'
  mm <- count 2 digit
  char '-'
  d  <- count 2 digit
  char ' '
  h  <- count 2 digit
  char ':'
  m  <- count 2 digit
  char ':'
  s  <- count 2 digit
  return $
    LocalTime { localDay = fromGregorian (read y) (read mm) (read d)
              , localTimeOfDay = TimeOfDay (read h) (read m) (read s)
                }

main :: IO ()
main = print $ parseOnly timeParser "2013-06-30 14:33:29"

Note the use of count and digit. The parser digit will get the following character, in case that this character is a digit, and will fail otherwise. The combinator count repeats a parser a certain number of times. Since in our format, a year is written with 4 characters, we use count 4 digit meaning read 4 digits from the input. The same rationale applies to the rest of the code. At the end, we return a value of type LocalTime.

Parsing alternatives

Lastly, we need a parser for Product values. This one is even easier, but it also have something new. A product is represented by a word. Each word is different, so there is no single syntax to read. We have different choices. It is either keyboard or mouse or monitor or speaker. This or, separating different alternatives, it is represented in attoparsec by the <|> combinator. The <|> operator combines two parsers of the same type in one that first tries to use the first argument parser. If this one ends without failure, it returns its result. If it fails, it tries with the second one, returning any result it gives. This would be the Product parser:

{-# LANGUAGE OverloadedStrings #-}

import Data.Attoparsec.Char8
import Control.Applicative

data Product = Mouse | Keyboard | Monitor | Speakers deriving Show

productParser :: Parser Product
productParser =
     (string "mouse"    >> return Mouse)
 <|> (string "keyboard" >> return Keyboard)
 <|> (string "monitor"  >> return Monitor)
 <|> (string "speakers" >> return Speakers)

main :: IO ()
main = do
  print $ parseOnly productParser "mouse"
  print $ parseOnly productParser "mouze"
  print $ parseOnly productParser "monitor"
  print $ parseOnly productParser "keyboard"

Note that we have to import the Control.Applicative module to use the <|> combinator. Also note that when we try to parse mouze we get a cryptic error message (not enough bytes) that does not say much about the parsing error. This is one trade-off of attoparsec in order to get better performance than parsec. The API of parsec is very similar to the one of attoparsec, but parsec reports much more information when a parsing error arises.

Step 3: Combine small parsers to build a bigger one

It is time to combine our parsers into one that can read a whole log entry. We only have to invoke them in order.

{-# LANGUAGE OverloadedStrings #-}

import Data.Word
import Data.Time
import Data.Attoparsec.Char8
import Control.Applicative

-----------------------
-------- TYPES --------
-----------------------

-- | Type for IP's.
data IP = IP Word8 Word8 Word8 Word8 deriving Show

data Product = Mouse | Keyboard | Monitor | Speakers deriving Show

data LogEntry =
  LogEntry { entryTime :: LocalTime
           , entryIP   :: IP
           , entryProduct   :: Product
             } deriving Show

-----------------------
------- PARSING -------
-----------------------

-- | Parser of values of type 'IP'.
parseIP :: Parser IP
parseIP = do
  d1 <- decimal
  char '.'
  d2 <- decimal
  char '.'
  d3 <- decimal
  char '.'
  d4 <- decimal
  return $ IP d1 d2 d3 d4

-- | Parser of values of type 'LocalTime'.
timeParser :: Parser LocalTime
timeParser = do
  y  <- count 4 digit
  char '-'
  mm <- count 2 digit
  char '-'
  d  <- count 2 digit
  char ' '
  h  <- count 2 digit
  char ':'
  m  <- count 2 digit
  char ':'
  s  <- count 2 digit
  return $
    LocalTime { localDay = fromGregorian (read y) (read mm) (read d)
              , localTimeOfDay = TimeOfDay (read h) (read m) (read s)
                }

-- | Parser of values of type 'Product'.
productParser :: Parser Product
productParser =
     (string "mouse"    >> return Mouse)
 <|> (string "keyboard" >> return Keyboard)
 <|> (string "monitor"  >> return Monitor)
 <|> (string "speakers" >> return Speakers)
-- show
-- | Parser of log entries.
logEntryParser :: Parser LogEntry
logEntryParser = do
  -- First, we read the time.
  t <- timeParser
  -- Followed by a space.
  char ' '
  -- And then the IP of the client.
  ip <- parseIP
  -- Followed by another space.
  char ' '
  -- Finally, we read the type of product.
  p <- productParser
  -- And we return the result as a value of type 'LogEntry'.
  return $ LogEntry t ip p

----------------------
-------- TEST --------
----------------------

main :: IO ()
main = print $ parseOnly logEntryParser "2013-06-29 11:16:23 124.67.34.60 keyboard"
-- /show

In order to read the entire log file, we just need to iterate logEntryParser until the end of the file is reached. The combinator many will perform a parser zero or more times, returning a list of continuous successful parsings. It will stop whenever the given parser fails. For example, many digit applied to the string "123abc" will return "123" and will leave "abc" as remainding input. Also, many digit applied to the string "abc" will return the empty list without consuming any input.

In conclusion, here is our log file parser.

type Log = [LogEntry] deriving Show

logParser :: Parser Log
logParser = many $ logEntryParser <* endOfLine

The endOfLine parser succeeds only when the remaining input starts with an end of line. The <* combinator applies the parser from the left, then the parser from the right, and then returns the result of the first parser. We use it to get the result from logEntryParser instead of endOfLine, which returns ().

Full log file parser

{-# START_FILE sellings.log #-}
2013-06-29 11:16:23 124.67.34.60 keyboard
2013-06-29 11:32:12 212.141.23.67 mouse
2013-06-29 11:33:08 212.141.23.67 monitor
2013-06-29 12:12:34 125.80.32.31 speakers
2013-06-29 12:51:50 101.40.50.62 keyboard
2013-06-29 13:10:45 103.29.60.13 mouse

{-# START_FILE Main.hs #-}
{-# LANGUAGE OverloadedStrings #-}

import Data.Word
import Data.Time
import Data.Attoparsec.Char8
import Control.Applicative
-- We import ByteString qualified because the function
-- 'Data.ByteString.readFile' would clash with
-- 'Prelude.readFile'.
import qualified Data.ByteString as B

-----------------------
------ SETTINGS -------
-----------------------

-- | File where the log is stored.
logFile :: FilePath
logFile = "sellings.log"

-----------------------
-------- TYPES --------
-----------------------

-- | Type for IP's.
data IP = IP Word8 Word8 Word8 Word8 deriving Show

data Product = Mouse | Keyboard | Monitor | Speakers deriving Show

-- | Type for log entries.
--   Add, remove of modify fields to fit your own log file.
data LogEntry =
  LogEntry { entryTime :: LocalTime
           , entryIP   :: IP
           , entryProduct   :: Product
             } deriving Show

type Log = [LogEntry]

-----------------------
------- PARSING -------
-----------------------

-- | Parser of values of type 'IP'.
parseIP :: Parser IP
parseIP = do
  d1 <- decimal
  char '.'
  d2 <- decimal
  char '.'
  d3 <- decimal
  char '.'
  d4 <- decimal
  return $ IP d1 d2 d3 d4

-- | Parser of values of type 'LocalTime'.
timeParser :: Parser LocalTime
timeParser = do
  y  <- count 4 digit
  char '-'
  mm <- count 2 digit
  char '-'
  d  <- count 2 digit
  char ' '
  h  <- count 2 digit
  char ':'
  m  <- count 2 digit
  char ':'
  s  <- count 2 digit
  return $
    LocalTime { localDay = fromGregorian (read y) (read mm) (read d)
              , localTimeOfDay = TimeOfDay (read h) (read m) (read s)
                }

-- | Parser of values of type 'Product'.
productParser :: Parser Product
productParser =
     (string "mouse"    >> return Mouse)
 <|> (string "keyboard" >> return Keyboard)
 <|> (string "monitor"  >> return Monitor)
 <|> (string "speakers" >> return Speakers)

-- | Parser of log entries.
logEntryParser :: Parser LogEntry
logEntryParser = do
  -- First, we read the time.
  t <- timeParser
  -- Followed by a space.
  char ' '
  -- And then the IP of the client.
  ip <- parseIP
  -- Followed by another space.
  char ' '
  -- Finally, we read the type of product.
  p <- productParser
  -- And we return the result as a value of type 'LogEntry'.
  return $ LogEntry t ip p

logParser :: Parser Log
logParser = many $ logEntryParser <* endOfLine

----------------------
-------- MAIN --------
----------------------

main :: IO ()
main = B.readFile logFile >>= print . parseOnly logParser

Changes in the log

After some time logging our sales, we have the idea of adding a new field to each log entry. We ask each customer how he/she found about us and keep this information in our log. We happily update the logger but quickly notice that the parser does not work anymore. Apart from changing the LogEntry type we have to modify the parser to work with the new values. We allow our users to specify the following options:

data Source = Internet | Friend | NoAnswer deriving Show

We would report NoAnswer in the case that our customer did not answered. Quickly we write a parser very similar to productParser.

{-# LANGUAGE OverloadedStrings #-}

import Data.Attoparsec.Char8
import Control.Applicative

data Source = Internet | Friend | NoAnswer deriving Show

sourceParser :: Parser Source
sourceParser =
      (string "internet" >> return Internet)
  <|> (string "friend" >> return Friend)
  <|> (string "noanswer" >> return NoAnswer)
  
main :: IO ()
main = print $ parseOnly sourceParser "internet"

After checking that this parser works, we add it to our logEntryParser, upgrading the type definition of LogEntry adding the field source.

{-# START_FILE sellings.log #-}
2013-06-29 16:40:15 154.41.32.99 monitor internet
2013-06-29 16:51:12 103.29.60.13 keyboard internet
2013-06-29 17:13:21 121.95.68.21 speakers friend
2013-06-29 18:20:10 190.80.70.60 mouse noanswer
2013-06-29 18:51:23 102.42.52.64 speakers friend
2013-06-29 19:01:11 78.46.64.23 mouse internet

{-# START_FILE Main.hs #-}
{-# LANGUAGE OverloadedStrings #-}

import Data.Word
import Data.Time
import Data.Attoparsec.Char8
import Control.Applicative
-- We import ByteString qualified because the function
-- 'Data.ByteString.readFile' would clash with
-- 'Prelude.readFile'.
import qualified Data.ByteString as B

-----------------------
------ SETTINGS -------
-----------------------

-- | File where the log is stored.
logFile :: FilePath
logFile = "sellings.log"

-----------------------
-------- TYPES --------
-----------------------

-- | Type for IP's.
data IP = IP Word8 Word8 Word8 Word8 deriving Show

data Product = Mouse | Keyboard | Monitor | Speakers deriving Show

data Source = Internet | Friend | NoAnswer deriving Show

-- show
data LogEntry =
  LogEntry { entryTime :: LocalTime
           , entryIP   :: IP
           , entryProduct   :: Product
             -- Addition of the 'Source' field
           , source    :: Source
             } deriving Show
-- /show

type Log = [LogEntry]

-----------------------
------- PARSING -------
-----------------------

-- | Parser of values of type 'IP'.
parseIP :: Parser IP
parseIP = do
  d1 <- decimal
  char '.'
  d2 <- decimal
  char '.'
  d3 <- decimal
  char '.'
  d4 <- decimal
  return $ IP d1 d2 d3 d4

-- | Parser of values of type 'LocalTime'.
timeParser :: Parser LocalTime
timeParser = do
  y  <- count 4 digit
  char '-'
  mm <- count 2 digit
  char '-'
  d  <- count 2 digit
  char ' '
  h  <- count 2 digit
  char ':'
  m  <- count 2 digit
  char ':'
  s  <- count 2 digit
  return $
    LocalTime { localDay = fromGregorian (read y) (read mm) (read d)
              , localTimeOfDay = TimeOfDay (read h) (read m) (read s)
                }

-- | Parser of values of type 'Product'.
productParser :: Parser Product
productParser =
     (string "mouse"    >> return Mouse)
 <|> (string "keyboard" >> return Keyboard)
 <|> (string "monitor"  >> return Monitor)
 <|> (string "speakers" >> return Speakers)

sourceParser :: Parser Source
sourceParser =
      (string "internet" >> return Internet)
  <|> (string "friend" >> return Friend)
  <|> (string "noanswer" >> return NoAnswer)

-- show
-- | Parser of log entries.
logEntryParser :: Parser LogEntry
logEntryParser = do
  t <- timeParser
  char ' '
  ip <- parseIP
  char ' '
  p <- productParser
  -- Addition of the 'Source' field
  char ' '
  s <- sourceParser
  --
  return $ LogEntry t ip p s
-- /show

logParser :: Parser Log
logParser = many $ logEntryParser <* endOfLine

----------------------
-------- MAIN --------
----------------------

main :: IO ()
main = B.readFile logFile >>= print . parseOnly logParser

Making the changed parser compatible with the old format

However, this parser only works in the new data, and we do not want to lose the information we gathered before. The solution is to add an optional field in the parser and, when no value is found, return a default value (like NoAnswer). The option attoparsec combinators has exactly this purpose.

{-# START_FILE sellings.log #-}
2013-06-29 11:16:23 124.67.34.60 keyboard
2013-06-29 11:32:12 212.141.23.67 mouse
2013-06-29 11:33:08 212.141.23.67 monitor
2013-06-29 12:12:34 125.80.32.31 speakers
2013-06-29 12:51:50 101.40.50.62 keyboard
2013-06-29 13:10:45 103.29.60.13 mouse
2013-06-29 16:40:15 154.41.32.99 monitor internet
2013-06-29 16:51:12 103.29.60.13 keyboard internet
2013-06-29 17:13:21 121.95.68.21 speakers friend
2013-06-29 18:20:10 190.80.70.60 mouse noanswer
2013-06-29 18:51:23 102.42.52.64 speakers friend
2013-06-29 19:01:11 78.46.64.23 mouse internet

{-# START_FILE Main.hs #-}
{-# LANGUAGE OverloadedStrings #-}

import Data.Word
import Data.Time
import Data.Attoparsec.Char8
import Control.Applicative
-- We import ByteString qualified because the function
-- 'Data.ByteString.readFile' would clash with
-- 'Prelude.readFile'.
import qualified Data.ByteString as B

-----------------------
------ SETTINGS -------
-----------------------

-- | File where the log is stored.
logFile :: FilePath
logFile = "sellings.log"

-----------------------
-------- TYPES --------
-----------------------

-- | Type for IP's.
data IP = IP Word8 Word8 Word8 Word8 deriving Show

data Product = Mouse | Keyboard | Monitor | Speakers deriving Show

data Source = Internet | Friend | NoAnswer deriving Show

data LogEntry =
  LogEntry { entryTime :: LocalTime
           , entryIP   :: IP
           , entryProduct   :: Product
           , source    :: Source
             } deriving Show

type Log = [LogEntry]

-----------------------
------- PARSING -------
-----------------------

-- | Parser of values of type 'IP'.
parseIP :: Parser IP
parseIP = do
  d1 <- decimal
  char '.'
  d2 <- decimal
  char '.'
  d3 <- decimal
  char '.'
  d4 <- decimal
  return $ IP d1 d2 d3 d4

-- | Parser of values of type 'LocalTime'.
timeParser :: Parser LocalTime
timeParser = do
  y  <- count 4 digit
  char '-'
  mm <- count 2 digit
  char '-'
  d  <- count 2 digit
  char ' '
  h  <- count 2 digit
  char ':'
  m  <- count 2 digit
  char ':'
  s  <- count 2 digit
  return $
    LocalTime { localDay = fromGregorian (read y) (read mm) (read d)
              , localTimeOfDay = TimeOfDay (read h) (read m) (read s)
                }

-- | Parser of values of type 'Product'.
productParser :: Parser Product
productParser =
     (string "mouse"    >> return Mouse)
 <|> (string "keyboard" >> return Keyboard)
 <|> (string "monitor"  >> return Monitor)
 <|> (string "speakers" >> return Speakers)

sourceParser :: Parser Source
sourceParser =
      (string "internet" >> return Internet)
  <|> (string "friend" >> return Friend)
  <|> (string "noanswer" >> return NoAnswer)

-- show
-- | Parser of log entries.
logEntryParser :: Parser LogEntry
logEntryParser = do
  t <- timeParser
  char ' '
  ip <- parseIP
  char ' '
  p <- productParser
  -- Look for the field 'Source' and return
  -- a default value ('NoAnswer') when missing.
  -- The arguments of 'option' are default value
  -- followed by the parser to try.
  s <- option NoAnswer $ char ' ' >> sourceParser
  --
  return $ LogEntry t ip p s
-- /show

logParser :: Parser Log
logParser = many $ logEntryParser <* endOfLine

----------------------
-------- MAIN --------
----------------------

main :: IO ()
main = B.readFile logFile >>= print . parseOnly logParser

Merging data from different logs

Our company is growing fast and we decide to open a new online shop based in French to extend our customer range to Europe. However, after some time, we note that our engineer in French is using a different log format.

154.41.32.99 29/06/2013 15:32:23 4 internet
76.125.44.33 29/06/2013 16:56:45 3 noanswer
123.45.67.89 29/06/2013 18:44:29 4 friend
100.23.32.41 29/06/2013 19:01:09 1 internet
151.123.45.67 29/06/2013 20:30:13 2 internet

It seems that each log entry stores the information in the following order:

  • IP.
  • Date (in a different format).
  • A number representing the product sold.
  • The "how you knew from us" field that we called Source before.

Therefore, our new logEntryParser2 must parse the input in that order. We note that the date is in a different order (in most Europe countries is usual to write the day before the month) and is separated by the / symbol instead of -. Also, they are using ID's to identify products instead of writing the whole name.

Step 1: Write the new parser

Firstly, we write functions to get the ID from a Product and viceversa. Deriving an Enum instance for Product gives us an automatic implementation of the methods toEnum and fromEnum. These functions are a correspondence between a subset of the integers (type Int) and our type (Product in this case). The automatic derivation associates the integer 0 to the first constructor, 1 to the second, 2 to the third, and so on. Therefore, we can define functions product(To/From)ID as follows.

-- | Different kind of products are numbered from 1 to 4, in the given
--   order.
data Product = Mouse | Keyboard | Monitor | Speakers deriving (Enum,Show)

productFromID :: Int -> Product
productFromID n = toEnum (n-1)

productToID :: Product -> Int
productToID p = fromEnum p + 1

main :: IO ()
main = do
  print $ productFromID 1
  print $ productFromID 3
  print $ productToID Keyboard
  print $ productToID $ productFromID 4

A parser of products would accept a single digit and will apply productFromID to get the Product result.

{-# LANGUAGE OverloadedStrings #-}

import Data.Attoparsec.Char8
import Control.Applicative

data Product = Mouse | Keyboard | Monitor | Speakers deriving (Enum,Show)

productFromID :: Int -> Product
productFromID n = toEnum (n-1)

-- show
productParser2 :: Parser Product
productParser2 = productFromID . read . (:[]) <$> digit

main :: IO ()
main = print $ parseOnly productParser2 "4"
-- /show

The entryTime field also needs a new parser. The process, however, is equivalent to the previous one. We just need to parse the input in a different order and use the new delimiters.

{-# LANGUAGE OverloadedStrings #-}

import Data.Time
import Data.Attoparsec.Char8

timeParser2 :: Parser LocalTime
timeParser2 = do
  d  <- count 2 digit
  char '/'
  mm <- count 2 digit
  char '/'
  y  <- count 4 digit
  char ' '
  h  <- count 2 digit
  char ':'
  m  <- count 2 digit
  char ':'
  s  <- count 2 digit
  return $
    LocalTime { localDay = fromGregorian (read y) (read mm) (read d)
              , localTimeOfDay = TimeOfDay (read h) (read m) (read s)
                }

main :: IO ()
main = print $ parseOnly timeParser2 "29/06/2013 15:32:23"

The rest of the fields are unchanged, so we are ready to write the full parser of the new log entries. Again, this is just invoking the defined parsers in the correct order.

{-# LANGUAGE OverloadedStrings #-}

import Data.Word
import Data.Time
import Data.Attoparsec.Char8
import Control.Applicative

-----------------------
-------- TYPES --------
-----------------------

-- | Type for IP's.
data IP = IP Word8 Word8 Word8 Word8 deriving Show

data Product = Mouse | Keyboard | Monitor | Speakers deriving (Show,Enum)

productFromID :: Int -> Product
productFromID n = toEnum (n-1)

data Source = Internet | Friend | NoAnswer deriving Show

data LogEntry =
  LogEntry { entryTime :: LocalTime
           , entryIP   :: IP
           , entryProduct   :: Product
           , source    :: Source
             } deriving Show

type Log = [LogEntry]

-----------------------
------- PARSING -------
-----------------------

-- | Parser of values of type 'IP'.
parseIP :: Parser IP
parseIP = do
  d1 <- decimal
  char '.'
  d2 <- decimal
  char '.'
  d3 <- decimal
  char '.'
  d4 <- decimal
  return $ IP d1 d2 d3 d4

timeParser2 :: Parser LocalTime
timeParser2 = do
  d  <- count 2 digit
  char '/'
  mm <- count 2 digit
  char '/'
  y  <- count 4 digit
  char ' '
  h  <- count 2 digit
  char ':'
  m  <- count 2 digit
  char ':'
  s  <- count 2 digit
  return $
    LocalTime { localDay = fromGregorian (read y) (read mm) (read d)
              , localTimeOfDay = TimeOfDay (read h) (read m) (read s)
                }

productParser2 :: Parser Product
productParser2 = productFromID . read . (:[]) <$> digit

sourceParser :: Parser Source
sourceParser =
      (string "internet" >> return Internet)
  <|> (string "friend" >> return Friend)
  <|> (string "noanswer" >> return NoAnswer)

-- show
logEntryParser2 :: Parser LogEntry
logEntryParser2 = do
  ip <- parseIP
  char ' '
  t <- timeParser2
  char ' '
  p <- productParser2
  char ' ' 
  s <- sourceParser
  return $ LogEntry t ip p s
  
main :: IO ()
main = print $ parseOnly logEntryParser2 "54.41.32.99 29/06/2013 15:32:23 4 internet"
-- /show

Once we have a function to read log entries we do the same as above to iterate the parser line by line through the log file.

logParser2 :: Parser Log
logParser2 = many $ logEntryParser2 <* endOfLine

Step 2: Merge both logs conserving order

Currently we have two log files, but we want all the data together. The proposed solution is to parse one file, parse the other file, and merge both of them. The merging can be done since both parsers have the same type of output (Log). A Log is a list of log entries, so we could just append both lists and we will have all the data together. However, since both files are sorted by entryTime, it would be much nicer if the merged file is also sorted by entryTime.

Given two sorted lists, it is easy to merge them into one sorted list in linear time. This is the procedure used to merge in the mergesort algorithm.

merge :: Ord a => [a] -> [a] -> [a]
merge xs [] = xs
merge [] ys = ys
merge (x:xs) (y:ys) =
  if x <= y
     then x : merge xs (y:ys)
     else y : merge (x:xs) ys

main :: IO ()
main = print $ merge [1,3,5,7] [2,4,6,8]

To use merge, the elements of the list must be of a type instance of the Ord class. Log is a list of LogEntry, so we have to write an Ord instance for LogEntry. We use entryTime as a reference to compare different log entries, since our interest is to sort log entries by time.

instance Ord LogEntry where
  le1 <= le2 = entryTime le1 <= entryTime le2

Now we are ready to merge both log files into one single result of type Log.

{-# START_FILE sellings.log #-}
2013-06-29 11:16:23 124.67.34.60 keyboard
2013-06-29 11:32:12 212.141.23.67 mouse
2013-06-29 11:33:08 212.141.23.67 monitor
2013-06-29 12:12:34 125.80.32.31 speakers
2013-06-29 12:51:50 101.40.50.62 keyboard
2013-06-29 13:10:45 103.29.60.13 mouse
2013-06-29 16:40:15 154.41.32.99 monitor internet
2013-06-29 16:51:12 103.29.60.13 keyboard internet
2013-06-29 17:13:21 121.95.68.21 speakers friend
2013-06-29 18:20:10 190.80.70.60 mouse noanswer
2013-06-29 18:51:23 102.42.52.64 speakers friend
2013-06-29 19:01:11 78.46.64.23 mouse internet

{-# START_FILE sellings2.log #-}
154.41.32.99 29/06/2013 15:32:23 4 internet
76.125.44.33 29/06/2013 16:56:45 3 noanswer
123.45.67.89 29/06/2013 18:44:29 4 friend
100.23.32.41 29/06/2013 19:01:09 1 internet
151.123.45.67 29/06/2013 20:30:13 2 internet

{-# START_FILE Main.hs #-}
{-# LANGUAGE OverloadedStrings #-}

import Data.Word
import Data.Time
import Data.Attoparsec.Char8
import Control.Applicative
import qualified Data.ByteString as B

-- show
-----------------------
------ SETTINGS -------
-----------------------
-- | File where the log is stored.
logFile :: FilePath
logFile = "sellings.log"

-- | Second file where the log is stored.
logFile2 :: FilePath
logFile2 = "sellings2.log"
-- /show

-----------------------
-------- TYPES --------
-----------------------

-- | Type for IP's.
data IP = IP Word8 Word8 Word8 Word8 deriving (Eq,Show)

-- | Type for products.
data Product = Mouse | Keyboard | Monitor | Speakers deriving (Eq,Show,Enum)

productFromID :: Int -> Product
productFromID n = toEnum (n-1)

data Source = Internet | Friend | NoAnswer deriving (Eq,Show)

data LogEntry =
  LogEntry { entryTime :: LocalTime
           , entryIP   :: IP
           , entryProduct   :: Product
           , source    :: Source
               -- We derive Eq since is needed to be able
               -- to write an instance of Ord.
             } deriving (Eq, Show)

instance Ord LogEntry where
  le1 <= le2 = entryTime le1 <= entryTime le2

type Log = [LogEntry]

-----------------------
------- PARSING -------
-----------------------

-- | Parser of values of type 'IP'.
parseIP :: Parser IP
parseIP = do
  d1 <- decimal
  char '.'
  d2 <- decimal
  char '.'
  d3 <- decimal
  char '.'
  d4 <- decimal
  return $ IP d1 d2 d3 d4

-- | Parser of values of type 'LocalTime'.
timeParser :: Parser LocalTime
timeParser = do
  y  <- count 4 digit
  char '-'
  mm <- count 2 digit
  char '-'
  d  <- count 2 digit
  char ' '
  h  <- count 2 digit
  char ':'
  m  <- count 2 digit
  char ':'
  s  <- count 2 digit
  return $
    LocalTime { localDay = fromGregorian (read y) (read mm) (read d)
              , localTimeOfDay = TimeOfDay (read h) (read m) (read s)
                }

-- | Parser of values of type 'Product'.
productParser :: Parser Product
productParser =
     (string "mouse"    >> return Mouse)
 <|> (string "keyboard" >> return Keyboard)
 <|> (string "monitor"  >> return Monitor)
 <|> (string "speakers" >> return Speakers)

sourceParser :: Parser Source
sourceParser =
      (string "internet" >> return Internet)
  <|> (string "friend" >> return Friend)
  <|> (string "noanswer" >> return NoAnswer)

-- | Parser of log entries.
logEntryParser :: Parser LogEntry
logEntryParser = do
  t <- timeParser
  char ' '
  ip <- parseIP
  char ' '
  p <- productParser
  s <- option NoAnswer $ char ' ' >> sourceParser
  return $ LogEntry t ip p s

logParser :: Parser Log
logParser = many $ logEntryParser <* endOfLine

timeParser2 :: Parser LocalTime
timeParser2 = do
  d  <- count 2 digit
  char '/'
  mm <- count 2 digit
  char '/'
  y  <- count 4 digit
  char ' '
  h  <- count 2 digit
  char ':'
  m  <- count 2 digit
  char ':'
  s  <- count 2 digit
  return $
    LocalTime { localDay = fromGregorian (read y) (read mm) (read d)
              , localTimeOfDay = TimeOfDay (read h) (read m) (read s)
                }

productParser2 :: Parser Product
productParser2 = productFromID . read . (:[]) <$> digit

logEntryParser2 :: Parser LogEntry
logEntryParser2 = do
  ip <- parseIP
  char ' '
  t <- timeParser2
  char ' '
  p <- productParser2
  char ' ' 
  s <- sourceParser
  return $ LogEntry t ip p s

logParser2 :: Parser Log
logParser2 = many $ logEntryParser2 <* endOfLine

-----------------------
------- MERGING -------
-----------------------

merge :: Ord a => [a] -> [a] -> [a]
merge xs [] = xs
merge [] ys = ys
merge (x:xs) (y:ys) =
  if x <= y
     then x : merge xs (y:ys)
     else y : merge (x:xs) ys

-- show
----------------------
-------- MAIN --------
----------------------

main :: IO ()
main = do
  file1 <- B.readFile logFile
  file2 <- B.readFile logFile2
          -- We are using the Either monad here.
  let r = do xs <- parseOnly logParser  file1
             ys <- parseOnly logParser2 file2
             return $ merge xs ys
  case r of
   Left err -> putStrLn $ "A parsing error was found: " ++ err
   Right log -> mapM_ print log
-- /show

Extracting information from the log file

Once the log file is parsed, we can extract information from it. Following the previous example, we can check what is the product sold with more frequency or where most users found our webshop.

Let's calculate the product that has been sold more times. We may create an association list containing pairs (product,number of sales) for each product. It would have the following type:

type Sales = [(Product,Int)]

Given a list like this, we can check how many times a product has been sold.

import Data.Maybe (fromMaybe)

salesOf :: Product -> Sales -> Int
salesOf p xs = fromMaybe 0 $ lookup p xs

We can also add one sale more to the list.

addSale :: Product -> Sales -> Sales
-- If we have no sales, we add the product with 1 sale.
addSale p [] = [(p,1)]
addSale p ((x,n):xs) = if p == x then (x,n+1):xs
                                 else (x,n) : addSale p xs

Calculating the most sold product can be done using maximumBy (from the Data.List module) to compare the elements of the list using the second component of each pair.

import Data.List (maximumBy)

-- | Given a list of sales, returns the most sold product along with
--   its number of sales.
mostSold :: Sales -> Maybe (Product,Int)
mostSold [] = Nothing
mostSold xs = Just $ maximumBy (\x y -> snd x `compare` snd y) xs

We need to use Maybe to handle the event when nothing has been sold yet.

The last task remainding is to build a list of type Sales from a value of Log type. Since each log entry contains one product, we can use a fold in the log list using addSale for each entry product, adding all these items to the empty list.

sales :: Log -> Sales
sales = foldr (addSales . entryProduct) []

Using now the same data as before, we output the product with more sales.

{-# START_FILE sellings.log #-}
2013-06-29 11:16:23 124.67.34.60 keyboard
2013-06-29 11:32:12 212.141.23.67 mouse
2013-06-29 11:33:08 212.141.23.67 monitor
2013-06-29 12:12:34 125.80.32.31 speakers
2013-06-29 12:51:50 101.40.50.62 keyboard
2013-06-29 13:10:45 103.29.60.13 mouse
2013-06-29 16:40:15 154.41.32.99 monitor internet
2013-06-29 16:51:12 103.29.60.13 keyboard internet
2013-06-29 17:13:21 121.95.68.21 speakers friend
2013-06-29 18:20:10 190.80.70.60 mouse noanswer
2013-06-29 18:51:23 102.42.52.64 speakers friend
2013-06-29 19:01:11 78.46.64.23 mouse internet

{-# START_FILE sellings2.log #-}
154.41.32.99 29/06/2013 15:32:23 4 internet
76.125.44.33 29/06/2013 16:56:45 3 noanswer
123.45.67.89 29/06/2013 18:44:29 4 friend
100.23.32.41 29/06/2013 19:01:09 1 internet
151.123.45.67 29/06/2013 20:30:13 2 internet

{-# START_FILE Main.hs #-}
{-# LANGUAGE OverloadedStrings #-}

import Data.Word
import Data.Time
import Data.Attoparsec.Char8
import Control.Applicative
import qualified Data.ByteString as B
import Data.List (maximumBy)
import Data.Maybe (fromMaybe)

-----------------------
------ SETTINGS -------
-----------------------

-- | File where the log is stored.
logFile :: FilePath
logFile = "sellings.log"

-- | Second file where the log is stored.
logFile2 :: FilePath
logFile2 = "sellings2.log"

-----------------------
-------- TYPES --------
-----------------------

-- | Type for IP's.
data IP = IP Word8 Word8 Word8 Word8 deriving (Eq,Show)

-- | Type for products.
data Product = Mouse | Keyboard | Monitor | Speakers deriving (Eq,Show,Enum)

productFromID :: Int -> Product
productFromID n = toEnum (n-1)

data Source = Internet | Friend | NoAnswer deriving (Eq,Show)

data LogEntry =
  LogEntry { entryTime :: LocalTime
           , entryIP   :: IP
           , entryProduct   :: Product
           , source    :: Source
               -- We derive Eq since is needed to be able
               -- to write an instance of Ord.
             } deriving (Eq, Show)

instance Ord LogEntry where
  le1 <= le2 = entryTime le1 <= entryTime le2

type Log = [LogEntry]

-----------------------
------- PARSING -------
-----------------------

-- | Parser of values of type 'IP'.
parseIP :: Parser IP
parseIP = do
  d1 <- decimal
  char '.'
  d2 <- decimal
  char '.'
  d3 <- decimal
  char '.'
  d4 <- decimal
  return $ IP d1 d2 d3 d4

-- | Parser of values of type 'LocalTime'.
timeParser :: Parser LocalTime
timeParser = do
  y  <- count 4 digit
  char '-'
  mm <- count 2 digit
  char '-'
  d  <- count 2 digit
  char ' '
  h  <- count 2 digit
  char ':'
  m  <- count 2 digit
  char ':'
  s  <- count 2 digit
  return $
    LocalTime { localDay = fromGregorian (read y) (read mm) (read d)
              , localTimeOfDay = TimeOfDay (read h) (read m) (read s)
                }

-- | Parser of values of type 'Product'.
productParser :: Parser Product
productParser =
     (string "mouse"    >> return Mouse)
 <|> (string "keyboard" >> return Keyboard)
 <|> (string "monitor"  >> return Monitor)
 <|> (string "speakers" >> return Speakers)

sourceParser :: Parser Source
sourceParser =
      (string "internet" >> return Internet)
  <|> (string "friend" >> return Friend)
  <|> (string "noanswer" >> return NoAnswer)

-- | Parser of log entries.
logEntryParser :: Parser LogEntry
logEntryParser = do
  t <- timeParser
  char ' '
  ip <- parseIP
  char ' '
  p <- productParser
  s <- option NoAnswer $ char ' ' >> sourceParser
  return $ LogEntry t ip p s

logParser :: Parser Log
logParser = many $ logEntryParser <* endOfLine

timeParser2 :: Parser LocalTime
timeParser2 = do
  d  <- count 2 digit
  char '/'
  mm <- count 2 digit
  char '/'
  y  <- count 4 digit
  char ' '
  h  <- count 2 digit
  char ':'
  m  <- count 2 digit
  char ':'
  s  <- count 2 digit
  return $
    LocalTime { localDay = fromGregorian (read y) (read mm) (read d)
              , localTimeOfDay = TimeOfDay (read h) (read m) (read s)
                }

productParser2 :: Parser Product
productParser2 = productFromID . read . (:[]) <$> digit

logEntryParser2 :: Parser LogEntry
logEntryParser2 = do
  ip <- parseIP
  char ' '
  t <- timeParser2
  char ' '
  p <- productParser2
  char ' ' 
  s <- sourceParser
  return $ LogEntry t ip p s

logParser2 :: Parser Log
logParser2 = many $ logEntryParser2 <* endOfLine

-----------------------
------- MERGING -------
-----------------------

merge :: Ord a => [a] -> [a] -> [a]
merge xs [] = xs
merge [] ys = ys
merge (x:xs) (y:ys) =
  if x <= y
     then x : merge xs (y:ys)
     else y : merge (x:xs) ys

----------------------
------ COUNTING ------
----------------------

type Sales = [(Product,Int)]

salesOf :: Product -> Sales -> Int
salesOf p xs = fromMaybe 0 $ lookup p xs

addSale :: Product -> Sales -> Sales
addSale p [] = [(p,1)]
addSale p ((x,n):xs) = if p == x then (x,n+1):xs
                                 else (x,n) : addSale p xs
                        
-- | Given a list of sales, returns the most sold product along with
--   its number of sales.
mostSold :: Sales -> Maybe (Product,Int)
mostSold [] = Nothing
mostSold xs = Just $ maximumBy (\x y -> snd x `compare` snd y) xs

sales :: Log -> Sales
sales = foldr (addSale . entryProduct) []

----------------------
-------- MAIN --------
----------------------

-- show
main :: IO ()
main = do
  file1 <- B.readFile logFile
  file2 <- B.readFile logFile2
  let r = do xs <- parseOnly logParser  file1
             ys <- parseOnly logParser2 file2
             return $ merge xs ys
  case r of
   Left err -> putStrLn $ "A parsing error was found: " ++ err
   Right log ->
     case mostSold (sales log) of
       Nothing -> putStrLn "We didn't sell anything yet."
       Just (p,n) -> putStrLn $ "The product with more sales is " ++ show p
                  ++ " with " ++ show n ++ " sales."
-- /show

From log file to CSV

CSV (Comma Separated Values) files store tabular data and can be used from a large number of applications. In fact, one of the advantages of using the CSV format is that data stored in this format can be imported and exported from very different programs. After gathering all the log file information, we are going to render a CSV table containing it. Then, we will develop a parser to get the data back into Haskell.

Rendering to CSV

The process of rendering to CSV is straightforward. Rendering is in general simpler than parsing, and CSV rendering is not an exception.

We define rendering methods for each type, as we defined parsers for each type. Sometimes, the renderer looks similar to the parser (see renderIP below).

Some functions useful when rendering:

  • <>: This operator from Data.Monoid appends values of types instance of the Monoid class. ByteString is one of them.
  • foldMap: Apply a function over the elements of a structure instance of the Foldable class to values of a type instance of the Monoid class then append all the results.
  • fromString: It takes a String and return it as a value of any type in the IsString class, defined at Data.String.
{-# START_FILE sellings.log #-}
2013-06-29 11:16:23 124.67.34.60 keyboard
2013-06-29 11:32:12 212.141.23.67 mouse
2013-06-29 11:33:08 212.141.23.67 monitor
2013-06-29 12:12:34 125.80.32.31 speakers
2013-06-29 12:51:50 101.40.50.62 keyboard
2013-06-29 13:10:45 103.29.60.13 mouse
2013-06-29 16:40:15 154.41.32.99 monitor internet
2013-06-29 16:51:12 103.29.60.13 keyboard internet
2013-06-29 17:13:21 121.95.68.21 speakers friend
2013-06-29 18:20:10 190.80.70.60 mouse noanswer
2013-06-29 18:51:23 102.42.52.64 speakers friend
2013-06-29 19:01:11 78.46.64.23 mouse internet

{-# START_FILE sellings2.log #-}
154.41.32.99 29/06/2013 15:32:23 4 internet
76.125.44.33 29/06/2013 16:56:45 3 noanswer
123.45.67.89 29/06/2013 18:44:29 4 friend
100.23.32.41 29/06/2013 19:01:09 1 internet
151.123.45.67 29/06/2013 20:30:13 2 internet

{-# START_FILE Main.hs #-}
{-# LANGUAGE OverloadedStrings #-}

import Data.Word
import Data.Time
import Data.Attoparsec.Char8
import Control.Applicative
-- show
import Data.ByteString.Char8 (ByteString,singleton)
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as BC
import Data.String
import Data.Char (toLower)
import Data.Monoid hiding (Product)
import Data.Foldable (foldMap)
-- /show

-----------------------
------ SETTINGS -------
-----------------------

-- | File where the log is stored.
logFile :: FilePath
logFile = "sellings.log"

-- | Second file where the log is stored.
logFile2 :: FilePath
logFile2 = "sellings2.log"

-----------------------
-------- TYPES --------
-----------------------

-- | Type for IP's.
data IP = IP Word8 Word8 Word8 Word8 deriving (Eq,Show)

-- | Type for products.
data Product = Mouse | Keyboard | Monitor | Speakers deriving (Eq,Show,Enum)

productFromID :: Int -> Product
productFromID n = toEnum (n-1)

data Source = Internet | Friend | NoAnswer deriving (Eq,Show)

data LogEntry =
  LogEntry { entryTime :: LocalTime
           , entryIP   :: IP
           , entryProduct   :: Product
           , source    :: Source
               -- We derive Eq since is needed to be able
               -- to write an instance of Ord.
             } deriving (Eq, Show)

instance Ord LogEntry where
  le1 <= le2 = entryTime le1 <= entryTime le2

type Log = [LogEntry]

-----------------------
------- PARSING -------
-----------------------

-- | Parser of values of type 'IP'.
parseIP :: Parser IP
parseIP = do
  d1 <- decimal
  char '.'
  d2 <- decimal
  char '.'
  d3 <- decimal
  char '.'
  d4 <- decimal
  return $ IP d1 d2 d3 d4

-- | Parser of values of type 'LocalTime'.
timeParser :: Parser LocalTime
timeParser = do
  y  <- count 4 digit
  char '-'
  mm <- count 2 digit
  char '-'
  d  <- count 2 digit
  char ' '
  h  <- count 2 digit
  char ':'
  m  <- count 2 digit
  char ':'
  s  <- count 2 digit
  return $
    LocalTime { localDay = fromGregorian (read y) (read mm) (read d)
              , localTimeOfDay = TimeOfDay (read h) (read m) (read s)
                }

-- | Parser of values of type 'Product'.
productParser :: Parser Product
productParser =
     (string "mouse"    >> return Mouse)
 <|> (string "keyboard" >> return Keyboard)
 <|> (string "monitor"  >> return Monitor)
 <|> (string "speakers" >> return Speakers)

sourceParser :: Parser Source
sourceParser =
      (string "internet" >> return Internet)
  <|> (string "friend" >> return Friend)
  <|> (string "noanswer" >> return NoAnswer)

-- | Parser of log entries.
logEntryParser :: Parser LogEntry
logEntryParser = do
  t <- timeParser
  char ' '
  ip <- parseIP
  char ' '
  p <- productParser
  s <- option NoAnswer $ char ' ' >> sourceParser
  return $ LogEntry t ip p s

logParser :: Parser Log
logParser = many $ logEntryParser <* endOfLine

timeParser2 :: Parser LocalTime
timeParser2 = do
  d  <- count 2 digit
  char '/'
  mm <- count 2 digit
  char '/'
  y  <- count 4 digit
  char ' '
  h  <- count 2 digit
  char ':'
  m  <- count 2 digit
  char ':'
  s  <- count 2 digit
  return $
    LocalTime { localDay = fromGregorian (read y) (read mm) (read d)
              , localTimeOfDay = TimeOfDay (read h) (read m) (read s)
                }

productParser2 :: Parser Product
productParser2 = productFromID . read . (:[]) <$> digit

logEntryParser2 :: Parser LogEntry
logEntryParser2 = do
  ip <- parseIP
  char ' '
  t <- timeParser2
  char ' '
  p <- productParser2
  char ' ' 
  s <- sourceParser
  return $ LogEntry t ip p s

logParser2 :: Parser Log
logParser2 = many $ logEntryParser2 <* endOfLine

-----------------------
------- MERGING -------
-----------------------

merge :: Ord a => [a] -> [a] -> [a]
merge xs [] = xs
merge [] ys = ys
merge (x:xs) (y:ys) =
  if x <= y
     then x : merge xs (y:ys)
     else y : merge (x:xs) ys

-- show
-----------------------
------ RENDERING ------
-----------------------

-- | Character that will serve as field separator.
--   It should not be one of the characters that
--   appear in the fields.
sepChar :: Char
sepChar = ','

-- | Rendering of IP's to ByteString.
renderIP :: IP -> ByteString
renderIP (IP a b c d) =
     -- Function @show@ creates a String and
     -- fromString makes it a ByteString.
     fromString (show a)
  <> singleton '.'
  <> fromString (show b)
  <> singleton '.'
  <> fromString (show c)
  <> singleton '.'
  <> fromString (show d)

-- | Render a log entry to a CSV row as ByteString.
renderEntry :: LogEntry -> ByteString
renderEntry le =
     fromString (show $ entryTime le)
  <> singleton sepChar
  <> renderIP (entryIP le)
  <> singleton sepChar
     -- We use @fmap toLower@ to write the product name
     -- in lowercase letters.
  <> fromString (fmap toLower $ show $ entryProduct le)
  <> singleton sepChar
  <> fromString (fmap toLower $ show $ source le)

-- | Render a log file to CSV as ByteString.
renderLog :: Log -> ByteString
renderLog = foldMap $ \le -> renderEntry le <> singleton '\n'

----------------------
-------- MAIN --------
----------------------

main :: IO ()
main = do
  file1 <- B.readFile logFile
  file2 <- B.readFile logFile2
          -- We are using the Either monad here.
  let r = do xs <- parseOnly logParser  file1
             ys <- parseOnly logParser2 file2
             return $ merge xs ys
  case r of
   Left err -> putStrLn $ "A parsing error was found: " ++ err
   Right log -> BC.putStrLn $ renderLog log
-- /show

Parsing from CSV

Again, as with log files, we use attoparsec for parsing. Note that the CSV format is similar to the log format, except in how fields are separated. Therefore, we can re-use our field parsers.

We start defining a parser for rows, and then we iterate it using many exactly as before.

{-# START_FILE sellings.csv #-}
2013-06-29 11:16:23 , 124.67.34.60  , keyboard , noanswer
2013-06-29 11:32:12 , 212.141.23.67 , mouse    , noanswer
2013-06-29 11:33:08 , 212.141.23.67 , monitor  , noanswer
2013-06-29 12:12:34 , 125.80.32.31  , speakers , noanswer
2013-06-29 12:51:50 , 101.40.50.62  , keyboard , noanswer
2013-06-29 13:10:45 , 103.29.60.13  , mouse    , noanswer
2013-06-29 15:32:23 , 154.41.32.99  , speakers , internet
2013-06-29 16:40:15 , 154.41.32.99  , monitor  , internet
2013-06-29 16:51:12 , 103.29.60.13  , keyboard , internet
2013-06-29 16:56:45 , 76.125.44.33  , monitor  , noanswer
2013-06-29 17:13:21 , 121.95.68.21  , speakers , friend
2013-06-29 18:20:10 , 190.80.70.60  , mouse    , noanswer
2013-06-29 18:44:29 , 123.45.67.89  , speakers , friend
2013-06-29 18:51:23 , 102.42.52.64  , speakers , friend
2013-06-29 19:01:09 , 100.23.32.41  , mouse    , internet
2013-06-29 19:01:11 , 78.46.64.23   , mouse    , internet
2013-06-29 20:30:13 , 151.123.45.67 , keyboard , internet

{-# START_FILE Main.hs #-}
{-# LANGUAGE OverloadedStrings #-}

import Data.Word
import Data.Time
import Data.Attoparsec.Char8
import Control.Applicative
import qualified Data.ByteString as B
-- show
-----------------------
------ SETTINGS -------
-----------------------

-- | File where the CSV is stored.
csvFile :: FilePath
csvFile = "sellings.csv"

-- | Character that will serve as field separator.
--   It should not be one of the characters that
--   appear in the fields.
sepChar :: Char
sepChar = ','
-- /show

-----------------------
-------- TYPES --------
-----------------------

-- | Type for IP's.
data IP = IP Word8 Word8 Word8 Word8 deriving (Eq,Show)

-- | Type for products.
data Product = Mouse | Keyboard | Monitor | Speakers deriving (Eq,Show,Enum)

data Source = Internet | Friend | NoAnswer deriving (Eq,Show)

data LogEntry =
  LogEntry { entryTime :: LocalTime
           , entryIP   :: IP
           , entryProduct   :: Product
           , source    :: Source
               -- We derive Eq since is needed to be able
               -- to write an instance of Ord.
             } deriving (Eq, Show)

type Log = [LogEntry]

-- show
-----------------------
------- PARSING -------
-----------------------
-- /show

-- | Parser of values of type 'IP'.
parseIP :: Parser IP
parseIP = do
  d1 <- decimal
  char '.'
  d2 <- decimal
  char '.'
  d3 <- decimal
  char '.'
  d4 <- decimal
  return $ IP d1 d2 d3 d4

-- | Parser of values of type 'LocalTime'.
timeParser :: Parser LocalTime
timeParser = do
  y  <- count 4 digit
  char '-'
  mm <- count 2 digit
  char '-'
  d  <- count 2 digit
  char ' '
  h  <- count 2 digit
  char ':'
  m  <- count 2 digit
  char ':'
  s  <- count 2 digit
  return $
    LocalTime { localDay = fromGregorian (read y) (read mm) (read d)
              , localTimeOfDay = TimeOfDay (read h) (read m) (read s)
                }
                
-- | Parser of values of type 'Product'.
productParser :: Parser Product
productParser =
     (string "mouse"    >> return Mouse)
 <|> (string "keyboard" >> return Keyboard)
 <|> (string "monitor"  >> return Monitor)
 <|> (string "speakers" >> return Speakers)

sourceParser :: Parser Source
sourceParser =
      (string "internet" >> return Internet)
  <|> (string "friend" >> return Friend)
  <|> (string "noanswer" >> return NoAnswer)

-- show
rowParser :: Parser LogEntry
rowParser = do
  -- Parser of field separators. It skips space characters before
  -- and after the CSV separator char.
  -- Characters considered as space are simple whitespaces and tabs.
  let spaceSkip = many $ satisfy $ inClass [ ' ' , '\t' ]
      sepParser = spaceSkip >> char sepChar >> spaceSkip
  -- Skip spaces at the beginning of the line.
  spaceSkip
  t  <- timeParser
  sepParser
  ip <- parseIP
  sepParser
  p  <- productParser
  sepParser
  s  <- sourceParser
  -- Skip remaining spaces at the end of the line
  spaceSkip
  return $ LogEntry t ip p s

csvParser :: Parser Log
csvParser = many $ rowParser <* endOfLine

----------------------
-------- MAIN --------
----------------------

main :: IO ()
main = do
  file <- B.readFile csvFile
  case parseOnly csvParser file of
    Left err -> putStrLn $ "Error while parsing CSV file: " ++ err
    Right log -> mapM_ print log
-- /show

Using CSV across applications

Use renderLog and Data.ByteString.Char8.writeFile to write a CSV table using your log information. However, if you are using a character set different from ASCII or ISO-8859-15, you should consider using the type Text instead of ByteString. Almost the only change you have to do is to change the import of Data.Attoparsec.Char8 to Data.Attoparsec.Text (both modules export similar interfaces and are interchangeable) and adapt the types of the renderer.

Once you have written your data in CSV format, import it from another application. Use the table, make any changes that you may want and modified data back in Haskell by parsing the CSV output of your application. Make sure your application and the Haskell parser are using the same column separator.

Final App: Read several log files, merge data and render it in CSV

We now present a runnable application that read a list of log files, merge them and return the result as a CSV table. The files may be read from a local file or an URL.

{-# START_FILE sellings.log #-}
2013-06-29 11:16:23 124.67.34.60 keyboard
2013-06-29 11:32:12 212.141.23.67 mouse
2013-06-29 11:33:08 212.141.23.67 monitor
2013-06-29 12:12:34 125.80.32.31 speakers
2013-06-29 12:51:50 101.40.50.62 keyboard
2013-06-29 13:10:45 103.29.60.13 mouse
2013-06-29 16:40:15 154.41.32.99 monitor internet
2013-06-29 16:51:12 103.29.60.13 keyboard internet
2013-06-29 17:13:21 121.95.68.21 speakers friend
2013-06-29 18:20:10 190.80.70.60 mouse noanswer
2013-06-29 18:51:23 102.42.52.64 speakers friend
2013-06-29 19:01:11 78.46.64.23 mouse internet

{-# START_FILE sellings2.log #-}
2013-06-29 15:32:23 154.41.32.99 speakers internet
2013-06-29 16:56:45 76.125.44.33 monitor noanswer
2013-06-29 18:44:29 123.45.67.89 speakers friend
2013-06-29 19:01:09 100.23.32.41 mouse internet
2013-06-29 20:30:13 151.123.45.67 keyboard internet

{-# START_FILE Main.hs #-}
{-# LANGUAGE OverloadedStrings #-}

import Data.Word
import Data.Time
import Data.Attoparsec.Char8
import Control.Applicative
import Data.Either (rights)
import Data.Monoid hiding (Product)
import Data.String
import Data.Char (toLower)
import Data.Foldable (foldMap)
-- ByteString stuff
import Data.ByteString.Char8 (ByteString,singleton)
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as BC
import Data.ByteString.Lazy (toChunks)
-- HTTP protocol to perform downloads
import Network.HTTP.Conduit

----------------------
------- FILES --------
----------------------

data File = URL String | Local FilePath

-- | Files where the logs are stored.
--   Modify this value to read logs from
--   other sources.
logFiles :: [File]
logFiles =
  [ Local "sellings.log"
  , Local "sellings2.log"
  , URL "http://daniel-diaz.github.io/misc/sellings3.log"
    ]

getFile :: File -> IO ByteString
-- simpleHttp gets a lazy bytestring, while we
-- are using strict bytestrings.
getFile (URL str) = mconcat . toChunks <$> simpleHttp str
getFile (Local fp) = B.readFile fp

-----------------------
-------- TYPES --------
-----------------------

-- | Type for IP's.
data IP = IP Word8 Word8 Word8 Word8 deriving (Eq,Show)

-- | Type for products.
data Product = Mouse | Keyboard | Monitor | Speakers deriving (Eq,Show,Enum)

productFromID :: Int -> Product
productFromID n = toEnum (n-1)

data Source = Internet | Friend | NoAnswer deriving (Eq,Show)

-- | Each log entry in the log file is represented by a value
--   of this type. Modify the fields of 'LogEntry' accordingly
--   to your log file of interest. However, 'entryTime' is a
--   reasonable field and is also used for merging.
data LogEntry =
  LogEntry { entryTime :: LocalTime
           , entryIP   :: IP
           , entryProduct   :: Product
           , source    :: Source
             } deriving (Eq, Show)

instance Ord LogEntry where
  le1 <= le2 = entryTime le1 <= entryTime le2

type Log = [LogEntry]

-----------------------
------- PARSING -------
-----------------------

-- | Parser of values of type 'IP'.
parseIP :: Parser IP
parseIP = do
  d1 <- decimal
  char '.'
  d2 <- decimal
  char '.'
  d3 <- decimal
  char '.'
  d4 <- decimal
  return $ IP d1 d2 d3 d4

-- | Parser of values of type 'LocalTime'.
timeParser :: Parser LocalTime
timeParser = do
  y  <- count 4 digit
  char '-'
  mm <- count 2 digit
  char '-'
  d  <- count 2 digit
  char ' '
  h  <- count 2 digit
  char ':'
  m  <- count 2 digit
  char ':'
  s  <- count 2 digit
  return $
    LocalTime { localDay = fromGregorian (read y) (read mm) (read d)
              , localTimeOfDay = TimeOfDay (read h) (read m) (read s)
                }

-- | Parser of values of type 'Product'.
productParser :: Parser Product
productParser =
     (string "mouse"    >> return Mouse)
 <|> (string "keyboard" >> return Keyboard)
 <|> (string "monitor"  >> return Monitor)
 <|> (string "speakers" >> return Speakers)

sourceParser :: Parser Source
sourceParser =
      (string "internet" >> return Internet)
  <|> (string "friend" >> return Friend)
  <|> (string "noanswer" >> return NoAnswer)

-- | Parser of log entries.
logEntryParser :: Parser LogEntry
logEntryParser = do
  t <- timeParser
  char ' '
  ip <- parseIP
  char ' '
  p <- productParser
  s <- option NoAnswer $ char ' ' >> sourceParser
  return $ LogEntry t ip p s

logParser :: Parser Log
logParser = many $ logEntryParser <* endOfLine

-----------------------
------- MERGING -------
-----------------------

merge :: Ord a => [a] -> [a] -> [a]
merge xs [] = xs
merge [] ys = ys
merge (x:xs) (y:ys) =
  if x <= y
     then x : merge xs (y:ys)
     else y : merge (x:xs) ys

-----------------------
------ RENDERING ------
-----------------------

-- | Character that will serve as field separator.
--   It should not be one of the characters that
--   appear in the fields.
sepChar :: Char
sepChar = ','

-- | Rendering of IP's to ByteString.
renderIP :: IP -> ByteString
renderIP (IP a b c d) =
     fromString (show a)
  <> singleton '.'
  <> fromString (show b)
  <> singleton '.'
  <> fromString (show c)
  <> singleton '.'
  <> fromString (show d)

-- | Render a log entry to a CSV row as ByteString.
renderEntry :: LogEntry -> ByteString
renderEntry le =
     fromString (show $ entryTime le)
  <> singleton sepChar
  <> renderIP (entryIP le)
  <> singleton sepChar
  <> fromString (fmap toLower $ show $ entryProduct le)
  <> singleton sepChar
  <> fromString (fmap toLower $ show $ source le)

-- | Render a log file to CSV as ByteString.
renderLog :: Log -> ByteString
renderLog = foldMap $ \le -> renderEntry le <> singleton '\n'

----------------------
-------- MAIN --------
----------------------

main :: IO ()
main = do
  files <- mapM getFile logFiles
  let -- Parsed logs
      logs :: [Log]
      logs = rights $ fmap (parseOnly logParser) files
      -- Merged log
      mergedLog :: Log
      mergedLog = foldr merge [] logs
  BC.putStrLn $ renderLog mergedLog

Conclusion

Parsing is one of the tasks that Haskell is really good at. The parser code is much clearer and easier to write than in traditional languages and it may run faster than a C++ parser. I invite you to try to parse bigger things. Following the API reference it should not be hard. As an example, Bryan O'Sullivan wrote an HTTP parser here. I think it is easy to read once you know how HTTP is defined.