2 Jun 2013

If you're coming from a language like C, Haskell can take some getting used to. It's typical for a new language to feel a little different, but in Haskell the differences are more dramatic, and more fundamental. In particular...

## Where are the `for`

loops?

In most imperative languages, `for`

loops are all over the place, and are used for a wide variety of *different things*. Whether you're squaring every value of an array or finding its sum, you're probably using a `for`

loop.

In Haskell, control structures are more expressive. Sure, there's a counterpart to C's `for`

(Haskell's `forM_`

). But that's a discussion for another time. Today, we'll see some `for`

loops in C that can be written very differently in Haskell, and why that's a good thing.

Consider the simple example of computing the norm of a vector. For a vector **x**∈ℝ^{n}, this is just

Conceptually, there are three stages to this computation:

: Square each element`mapSq`

: Compute the sum`sum`

: Compute the square root`sqrt`

We can think of the first step as building (at least abstractly) a new array whose `i`

th element is `y[i] = x[i] * x[i]`

. In functional programming, we refer to this as *mapping* the square function over the array.

Putting everything in terms of functions, we can write this (in Haskell-like pseudocode) as

`norm(x) = sqrt(sum(mapSq(x))) ,`

To clean up the syntax a bit, we can instead use the notation for function composition and write

`norm(x) = (sqrt ○ sum ○ mapSq)(x) ,`

or just

`norm = sqrt ○ sum ○ mapSq .`

## Computing Norms in C

In C, this modular approach leads to something like

```
void mapSq(double *x, double *y, int n) {
int i;
for(i=0; i<n; i++) {
y[i] = x[i] * x[i];
}
}
double sum(double *x, int n) {
double result = 0;
int i;
for(i=0; i<n; i++) {
result += x[i];
}
return result;
}
double norm1(double *x, int n) {
double *y = malloc(n * sizeof(double));
mapSq(x, y, n);
theSum = sum(y, n);
free(y);
return sqrt(theSum);
}
```

We'd probably never do it this way, for a few reasons:

- For such a simple computation, it's way too verbose.
- It uses two separate
`for`

loops, where only one is needed. - There's an unnecessary
`malloc`

that's unlikely to be removed by the compiler.

If it weren't for these issues, we probably would write code like the above. It has the advantage of being more modular and of faithfully representing the concepts. But the problems are too big to ignore, so we typically make some code transformations in our head, and instead write

```
double norm2(double *x, int n) {
double theSum = 0.0;
for (int i = 0; i < n; i++) {
theSum += x[i]*x[i]; }
return sqrt(theSum);
}
```

This code is very clear, and performs well, but we've entirely lost modularity. In this case the code is very short, so we don't give the compromise a second thought. But at a larger scale, this kind of manual optimization reduces code reuse and makes components more complex. Code becomes harder to understand, and harder to test for correctness.

## Computing Norms in Haskell

In Haskell, we could write `norm`

as

```
import Prelude hiding (sum)
import Data.List (foldl')
-- show
mapSq x = map sq x
where
sq xi = xi * xi
sum x = foldl' (+) 0.0 x
norm x = (sqrt . sum . mapSq) x
main = print (norm [1,2,3,4,5])
```

While the C example is in terms of arrays, this Haskell example instead uses *lazy linked lists*. We'll change this in a bit.

Let's step through this code. The first function, `mapSq`

, is defined in terms of the `map`

function, and produces a new list by squaring every element of the original list. The squaring function `sq`

can also be written as `(\x -> x*x)`

, allowing us to write the function as

mapSq x = map (\x -> x*x) x

or simply

mapSq = map (\x -> x*x)

Next, the `sum`

function is defined in terms of `foldl'`

. Before you worry too much about the name, you should know that a *fold* is just a function that traverses some data structure, using a given function to update some *accumulator* as it goes. In the current case, the accumulator is the counterpart of "`theSum`

" in the C version. Accumulation is via addition, and starts at 0.0.

Lists can be folded from either side. In our case, we're using `foldl'`

. The "`l`

" in this function name is for "left", the side we're starting on, and the "`'`

" indicates that this fold is *strict*. Haskell is lazy by default, but for numeric there's no point in delaying evaluation of the accumulator (and for large lists it can lead to a stack overflow), so in this case we prefer to evaluate it at every step.

As it turns out, our `sum`

function is common enough that it's included in Haskell's Prelude; we can use it exactly as described with no need to define it.^{1} Our `norm`

function is now simple:

```
norm = sqrt . sum . map (\x -> x*x)
main = print (norm [1,2,3,4,5])
```

This looks remarkably like the mathematical specification; some have affectionately refered to Haskell as an *executable specification language*.

## But what about performance?

Let's compare performance with C. In order to measure easily across languages, we'll make the vector much bigger - lets say a billion elements. We'll start with a standard list implementation in Haskell and compare this to an implementation in C using arrays. Then we'll make a couple of minor changes in Haskell that allow us to use unboxed vectors.

We'll test using an AMD A6-3670 APU running Ubuntu 13.04. For compilation, we'll use GHC 7.6.3 (with `-O2`

) and GCC 4.7.3 (with `-O3`

).

### Haskell - Lists

Our Haskell version is certainly elegant, but using linked lists in this way is not the best approach for performance. For a billion elements, the running time is **over 2 minutes**. What's going on here?

Haskell lists are *lazy* (only those elements required by later calculations are computed) and *polymorphic* (you can have a list of elements of any type, even if elements of that type don't take up a fixed number of bytes). In order to implement this, a list in Haskell is really a *list of pointers to elements*. In cases like this, all that pointer chasing adds up.

We'll soon see an easy way to improve this and get Haskell running really fast.

### C - Arrays

Before we get Haskell zipping along, let's look at how we would do this in C. We'll use the faster version of our norm in C (`norm2`

above). Here's the plan:

`malloc`

an array of a billion elements- Fill the array so that
`x[i] = i`

- Find the norm of the vector
- Free the array and print the result

We'll use the OS to measure the time of the whole process, and insert some code to measure step (3) alone. So all together we have this:

```
#include <time.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define x_size 1000000000
double norm(double *x, int n) {
double theSum = 0.0;
int i;
for (i=0; i<n; i++) {
theSum += x[i]*x[i];
}
return sqrt(theSum);
}
int main() {
double *x = malloc(x_size * sizeof(double));
int i;
for(i=0; i<x_size; i++) {
x[i] = (double) i;
}
clock_t start = clock();
double result = norm(x, x_size);
clock_t end = clock();
printf("%3.2f\n", result);
printf("%3.2f sec\n", (end-start)/(double)CLOCKS_PER_SEC);
free(x);
return 0;
}
```

The whole thing takes about **7 seconds** to run, with the `norm`

function itself taking **1.9 seconds**.

### Haskell - Unboxed Vectors

To close in on C, let's make some minor changes. Instead of using lists, we'll use unboxed *vector*s. The "unboxed" qualifier just means that the elements are available directly, without the need to follow a pointer.

From a data structure standpoint, Haskell's vectors are a lot like C's arrays. There is an important difference, though; many operations on vectors (even boxed ones) are subject to *fusion*. A `map`

followed by a `sum`

will be represented as a tight loop, but the code can remain modular. Remember the malloc in our original C code? Through fusion, it's eliminated entirely.

Updating the original code to operate on vectors is straightforward. And To stay entirely outside the realm of lists, we can generate our vector using `iterateN`

, which plays the role of our data-filling loop in C.

[EDIT: Because School of Haskell compilation does not use `-O2`

, there is no fusion, so we'll need to keep the values a bit smaller. In this case, I'll use an unusually small value for `oneBillion`

, though for benchmarking I still use 10^{9}]

```
import Data.Vector.Unboxed as V
norm :: Vector Double -> Double
norm = sqrt . V.sum . V.map (\x -> x*x)
oneBillion = 100
main = print $ norm $ V.iterateN oneBillion (+1) 0.0
```

This code runs in about **2.5 seconds**. It's just a bit longer than C's inner loop alone, but much less than the 7 seconds if the malloc is included.

EDIT: As Pedro Vasconcelos points out, tight numerical loops like this can often benefit from using the LLVM back-end, via `-fllvm`

. This brings the time for the entire Haskell run down to **1.9 seconds**, the same as the inner loop alone in C!

## Conclusion

Haskell's control structures express that a reduction (a *fold* in Haskell) is very different than a *map*.^{2} Libraries like **vector** implement powerful fusion techniques to combine loops and eliminate intermediate data structures. And *unboxing* elminates excessive pointer chasing required by built-in lists.

All in all, this lets us write *high-level code without sacrificing high performance*. The need for any compromise between the two is becoming more rare every day.

^{1}The Prelude's `sum` function is not defined in terms of `foldl'`, but instead relies on GHC's strictness analysis. In GHC 7.6.3 there's a regression, as described in [this ticket](http://hackage.haskell.org/trac/ghc/ticket/7954). Because of this, we stuck with `foldl' (+) 0.0` in timing our Haskell version using lists.

^{2}EDIT: Michael Sloan has pointed out that the difference is not so fundamental, since `map f = foldr ((:) . f) []`.