At FP Complete we develop many tools for our clients to help them achieve their goals. Most of these tools are written in Haskell (and, more recently, some are in Rust), and that has helped us write them more quickly and produce more maintainable apps going forward.
In this post I would like to describe one of those tools which should be of interest to any person or company which has a present or legacy database with SQL server.
A client was migrating from SQL Server to PostgreSQL across the board, however, like in many large companies, they had many databases spread across the country in different servers and used by many people in different departments. It is not so easy to simply replace a database that many people are using directly. They still wanted to provide a SQL Server query interface to their various departments, so they asked us whether we could develop a standalone service that could pretend to be SQL Server but really talk to any JSON service behind the scenes. We did! This article takes a look at how it was done.
Requirements and Architecture
The high-level architecture of the system looks like this:
So, a user of the system would write a regular SQL Server query, such as
In order to implement a fake SQL Server, we need:
I looked into the protocol used by SQL Server and its clients. It’s called TDS, which means “tabular data stream”. It was initially designed and developed by Sybase Inc. for their Sybase SQL Server relational database engine in 1984. Later it was adopted by Microsoft for Microsoft SQL Server. It is a fairly straightforward binary protocol, as binary protocols go. Microsoft have documentation for the TDS protocol, which you can eyeball in your own time.
Regarding a client program, today it is much easier to get access to Microsoft client libraries in Linux or Mac OS X, thanks to their releases of their ODBC package. You can now get it on Linux and macOS! I was surprised too! This made it easy to write a test suite for our server.
How TDS works
TDS is a binary protocol. Communication is done using messages, where each message is a sequence of so-called packets, each consisting of a header and (usually, but not always) a payload:
The header is 8 bytes long and described by the table below. I’ve crossed out the ones we don’t use.
|Word8||Not used by us|
|Status||Word8||0 = normal message, 1 = end of message|
|Length||Word16||Length of packet big endian|
|Word16||Not used by us|
|Word8||Not used by us and ignored by SQL Server|
|Word8||Not used, should be ignored|
It turns out we only need the
A typical scenario:
We dealt with both types of messages, but we’ll just look at simple batch queries in this article.stack-column-center
My second step was to start up a Haskell project and plan the libraries that I would use for this task.
I used these Haskell packages:
In diagram form, that pipeline looks like this:
We’ll look at each one in detail.
Stream processing: conduit
Conduit is used to achieve streaming. Other languages such as Rust call these iterators, or in Python they are called generators (data producers) and coroutines (data consumers). In Haskell it’s implemented as a normal library. It is a conceptually simple streaming API, consisting of two really key pieces:
We make use of these things in the server and so do all the libraries that I use for this task.
Consider this example pipeline,
The source may be a socket, a file or a database. These three kinds of I/O sources of chunks can yield bytes, or e.g. rows from a database query. In a conduit pipeline, the sink (the final thing in the pipeline) drives the computation, creating a domino effect back down the pipline every time it awaits for input, all the way back to the source, which consults some external provider (the socket, file or database).
I used conduit-extra because you can put together a listening service in about 3 lines. That looks like this:
In this example, the source conduit coming from the socket is piped into a conduit (
Running it looks like this:
We’re going to build up a simple example server incrementally.
If you want to follow along and run these Haskell examples on your computer, they can be run conveniently as regular scripts when you have
Most libraries in Haskell pride themselves on type-safety and streaming libraries like Conduit are no exception. Let’s have a brief look at that. This is the type for a conduit:
It means “a conduit has an
So for example,
It yields an input downstream, and returns unit
It awaits for an output from upstream, and might return
A function like
Finally, plugging them together has to have the correct types:
This lets us plug pieces together like LEGO bricks, confident that our composition is correct. The inputs, outputs and returns all have to match up.
Incremental parsing: attoparsec
I used attoparsec to parse the binary protocol. attoparsec is a parser combinator library which supports incomplete parsing of input i.e. parsing data in chunks.
Here’s a simple, closed example:
Which outputs the following:
If I create a streaming program that feeds one-byte chunks into the parser like this:
We create a loop with an “eater”. The eater eats chunks of bytes. It produces either a done/fail result, or the next eater that is ready for the next chunk.
The output is:
Note however that
The above code can be rewritten in conduit as:
Looking at this example and the one above it, it’s easy to imagine how
Efficient binary writing: bytestring
Finally, to generate messages for the TDS protocol, I can efficiently generate binary streams using the bytestring Builder abstraction. Essentially, this abstraction let you insert strings of bytes into a buffer and they are appended efficiently. Additionally, these can be written to a socket or file incrementally in a streaming fashion, and therefore are also memory efficient.
Specifically for our use-case, it makes it trivial to output a binary format that involves word8s, word16s and to differentiate little-endian vs big-endian encodings easily.
For example to build our example above, we’d use
Or it can be written as a conduit:
That is, produce a source of a builder. Feed it to a conduit which converts builders to streams of bytestrings, then feed that to a sink that writes to stdout.
Implementation: the parser
With all that in mind, the message parser looks like this:
We actually ignore the headers and are just interested in the SQL query. We grab that from the stream and decode it as UTF16 little-endian, which is the specified text format for TDS.
An example looks like this, in hex editor format:
Or in tabular format:
|Type||01||SQL Batch request|
|Headers||16 00 00 00 12 00 00 00 02 00 00 00 00 00 00 00 00 01 00 00 00 00||Not important|
|SQL||0A 00 73 00 65 00 6C 00 65 00 63 00 74 00 20 00 27 00 66 00 6F 00 6F 00 27 00 20 00 61 00 73 00 20 00 27 00 62 00 61 00 72 00 27 00 0A 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00||
The other parsers follow the same patterns. Remember that everything in attoparsec is incremental, so we don’t have to worry about boundaries and packets being spread over multiple chunks. The parser is fed until it’s satisfied.
As it happens, for rendering replies to the server, there are really three server messages of interest. Sending login acceptance, prelogin acknowledgement, and so-called “general response”, which is what we’re interested in for the purpose of returning results from queries. The
Specifically, the implementation to render a response token is:
For returning rows for a query, we have
Which is easy to render as a builder:
Remember the properties of
Implementation: ferrying messages
To handle a connection we have two conduits:
We simply need to consume the incoming stream and dispatch on the message in a loop. Let’s show a more digestible version of that with our simple example we’ve been using.
Here’s what’s going on:
Here’s what the behaviour looks like:
The server reports:
Just as we’d expected! The fake SQL Server behaves in the same way, but using the parsers of the packets we’ve looked at briefly, and the renderers we’ve also looked at.
It’s worth taking a moment to digest the accomplishment of being able to handle a binary protocol this trivially. That’s thanks to the abstractions provided by attoparsec and conduit.
Testing was easy because using we’re using high-level abstractions. The conduit handling code is decoupled from the attoparsec parsing code and the rendering code.
For testing random outputs, we can implement an instance of QuickCheck’s Arbitrary,
to generate a random
With this, we can setup a test suite that generates a vector of values,
Exploring memory use
Let’s take a look at the memory use of this server. Here’s a simple server that accepts a number of rows to yield, and reads that many rows from a CSV file:
Here it simply reads a word8 value in from the input as “n thousand”. We then load up
Let’s compile this with runtime options enabled:
And run it with statistics output. Here is what happens when we ask for 1,000 rows:
And here is what happens when we ask for 2,000 rows, via:
Here are the differences:
That means that we never had to exceed the size of the largest row in the CSV for our total resident memory usage. Perfect for a server that needs to be memory efficient and shuffle lots of data around!
Let’s go back and look at our original goals:
The development experience was smooth because we were able to focus on the things that matter, like the protocol at hand, and not boundaries, buffer sizes, scale problems or memory issues.
Go FP Complete
If you’re a Haskell company that needs consulting work like this, get in touch!