This is part two of a series of blog posts on Pantry, a new storage and download system for Haskell packages. You can see part 1.
In March of last year, there was a bug on Hackage that went something like this:
foo-1.2.3.tar.gz, at 5:00am.
foo-1.2.3.tar.gz, at 6:00am.
The problem was resolved (see linked issue), but this made me wonder: is there any reason why checksums should depend on inconsequential artifacts of the tar format, like modification times, file order, user IDs, etc?
Alright, second question. Let’s take the
yaml package as an example. It’s got a bunch of
modules, each about (let’s just say) 5kb in size, and a fairly
sizeable C source file, let’s say 100kb in size. When I’m working
on this in Git, I don’t need to store that 100kb of data on each
commit. Instead, Git refers to this immutable data via its SHA1.
However, each release of
yaml to Hackage involves
including all 100kb of that C file, and all of the Haskell source
modules, in full size, with zero data deduplication. Doesn’t that
seem highly inefficient?
And finally, third question. For many tools (including Stack, Stackage, and things like OSS-license inspectors), we need to be able to download and parse the cabal file for the package before downloading the entire package contents. When dealing with packages from Hackage, that’s easy, since we already have the full package index on our system, containing all cabal files. But let’s say we’ve got a package at some arbitrary URL or in a Git repository. The only way to get the cabal file is to download the entire package contents. Can we more efficiently grab just the cabal file?
Alright, enough questions. Time for some answers!
Right now, in almost all cases, a package is provided by a
tarball, created via the
sdist command. As mentioned, these tarballs contain lots of
extra information irrelevant to actually building the package. If
we pare things down, it appears that we have relatively simple
requirements for a package specification:
Those familiar with it may see a strong overlap with Git’s definition of a tree. That’s not by accident, though there are some differences (left as an exercise to the reader).
It turns out that you can convert most tarballs on Hackage into a representation like the above, with a few exceptions (mentioned below). And with this representation, you get some really nice benefits:
.cabalfile for this representation versus a tarball. You’re essentially looking for the only file that matches the rule
fp -> not ('/' `elem` fp) && ".cabal" `isSuffixOf` fp.
Those are some pretty nice features for a package representation!
This representation does, however, come with some downsides, which as far as I can tell always relate to dealing with existing tarballs:
Whether these are dealbreakers or not is a good question. As you probably guessed, this tree structure is what Pantry is using for its primary package representation, and it will convert the other package source types into this representation. With repositories and archives, it seems OK to simply reject input which is not compliant with these requirements.
The question comes down to Hackage. Should Pantry keep to its strict representation of packages as the trees mentioned above, and refuse to work with some packages on Hackage? Or should it put into place a fallback to work directly with the original tarball when it has one of the issues mentioned above?
From my initial testing, it seems like the vast majority of packages will work with this strict tree definition, so I’m tempted to stick to the strict definition. My major concerns are:
Fortunately, extending Pantry to support more than one tree definition is entirely possible, and I’ve already stubbed out such support.
Alright, from this point, let’s assume we’re going to just have one tree representation, and all packages, no matter their source, can be converted into it. Onward!
The tree type above is amenable to a simple binary serialization. So let’s assume we have a function:
data Tree = ... renderTree :: Tree -> ByteString
We can now treat this tree representation like any other file, and store it in our Pantry database as yet another blob! This means that when, ultimately, we introduce our network layer, we’ll be able to use the same distributed caching logic for trees. But more importantly, it means we can do this:
data BlobKey = BlobKey !SHA256 !FileSize getBlobKey :: ByteString -> BlobKey newtype TreeKey = TreeKey BlobKey getTreeKey :: Tree -> TreeKey getTreeKey = TreeKey . getBlobKey . renderTree
And, finally, we can begin to specify packages to Stack and other tools with something like the following:
extra-deps: - name: foo version: 1.2.3 pantry-tree: key: deadbeef9012 size: 9001 archive: https://example.com/packages/foo-1.2.3.tar.gz sha256: abcdef size: 18000
By specifying the Pantry hash information here, we get two nice features:
An astute reader may ask why we need the SHA256 of the tarball
itself, or the
The former is still useful for detecting early if a tarball has
changed, and potentially in the future for finding mirrors of the
original tarball. The reason for embedding the redundant
version information will be the
topic of the next post in this series.
One final note: why do we need to include the file size information? The cryptographic hashes are sufficient to ensure we’re getting the right contents! One motivation would be paranoia. As many people know, SHA1 has been (to some extent) broken. However, even with this breakage, it’s still not easily possible to create a SHA1 collision where the two inputs are the same length. Added file size into the mix reduces the impact of a future attack on our hash function of choice.
However, there’s a more direct threat to address. Assume that one of our Pantry mirrors is compromised. When you connect to it (using the network protocol we haven’t discussed yet) and say “give me the contents that have SHA256 abcdef1234”, it begins to send you an endless stream of random data. What do you do? Until the stream has ended, you have no way of knowing if the server is sending the real data. As a client, you have a few options:
Specifying the file size in addition to the hash is a much more elegant solution. The client knows exactly how much data to consume, and the worst a nefarious server can do is waste the client’s time downloading that much data, after which the client will know that the server is either malfunctioning or nefarious and stop trusting it.
Next time on our Pantry tour, we’re going to investigate how packages are specified in Stack today, the problems with that specification, why users will hate a good solution, and how to fix that problem too. Stay tuned!