Back in January, I published a two part blog post on hash-based package downloads. Some project needs at FP Complete have pushed this to the forefront recently, and as a result I've gotten started on implementing these ideas. I'm hoping to publish regular blog posts on the topic as I continue implementation.
There are a few major goals in the refactoring I'm working on:
- Increased security and reproducibility of build plans
- More shared code across tooling (especially Stackage and Stack)
- Performance improvements (especially in Stack)
- More flexibility for the Stackage team
Today's post won't hit on all of these points, as I'm only going to discuss the first bit of rewrite I've completed: package index management. This work is occurring on the pantry branch of Stack, though you should be well aware that that branch is currently totally unusable outside of the
stack update command.
What's a package index?
A package index is a term that comes from the Cabal and Hackage worlds. Hackage itself provides a package index, and
cabal-install and Stack both download this index to discover packages. The index itself is a tarball (the
01-index.tar file) containing a single cabal file for each revision of a package/version combination. It also contains some other metadata files, like JSON files providing cryptographic hash information on package tarballs. The
01-index.tar file is intended to be downloaded by hackage-security, which provides both security (signature checking and other protections) and resumable downloads.
The need for an index
In its common use case, Stack discovers available packages via a snapshot configuration (e.g.,
lts-12.0), which tells it the name, version, and Hackage revision of any package available. As a result, it may seem like Stack doesn't really need the package index. However, it's still necessary for a few things:
- It's the only location for downloading the revised cabal files from Hackage
- When displaying error messages, we sometimes want to provide helpful information on the latest versions available on Hackage
- When using the solver from
cabal-install, we must have an index available so that the solver can discover new packages
Stack will automatically download the index today when needed (e.g., a snapshot refers to a revision not yet downloaded locally), and can be told to explicitly download a new index via
stack update. Because it is highly inefficient to traverse the tarball each time a lookup needs to occur, Stack will also create a cache file mapping package name/version/revision to the offset inside the tarball that it is located.
cabal-install—allows alternative package indices to be specified. One use case for this is the “corporate firewall” situation (though it applies to other cases too). Some companies have restrictive firewalls in place which block outgoing connections. Or, alternatively, bandwidth may be throttled, and a local mirror would be preferable. Either way, configuring an alternative location to download the Hackage package index from is case 1. To get ahead of myself a bit: there's no problem with this use case, and Stack will continue to support configurable mirror location.
The second case is for providing access to packages which are not on Hackage. I've used this approach in the past myself. It was one of the original ways you could configure
cabal-install to use Stackage. With such an alternative index in place,
foo-1.2.3 could mean something different on your machine than on mine. (Epic foreshadowment right there.)
Problems with the index
Let's start with the easy one: building up the offset indexing is slow and memory hungry today. I've tried optimizing this in the past, but this is really a pessimal case for Haskell's memory management: lots of binary blobs getting inserted into a
HashMap. Chris Done recently reported to me that this can take over 1GB of memory, discovered due to a build failure on a VPS with swap space disabled.
But there's a more fundamental problem with indices. I raised an issue two weeks back about a long time concern I've had with package indices. Remember that epic foreshadowment above? Allowing alternative, non-Hackage package indices means that
foo-1.2.3 is now ambiguous. And worse yet, because package index configuration can live in a user-wide configuration file, looking at your project's
stack.yaml may not reveal this at all.
This kind of trade-off made sense in the past. However, we've got two things in Stack pushing against such behavior:
- Stack's main goal is to provide reproducible build plans. Encouraging a situation where the build plan will be altered this way is an anti-pattern.
- Stack has built in support for specifying package locations not on Hackage, via archives (HTTPS links to tarballs/zip files), repos (Git and Mercurial), and local file paths. There is no compelling reason for using the package index hack.
Since we'll allow overriding the package index location for mirroring, there's obviously no way to stop a user from providing a location that doesn't mirror Hackage itself. However, we can discourage this by allowing just one package index location instead of the current cascading fallback. We can also drop support for legacy pre-hackage-security
00-index.tar indices, which do not provide security guarantees or access to revision information.
The second change we can make is to be much more thorough about referencing packages via cryptographic hashes instead of by name/version information. This is already necessary for proper reproducibility in a world of Hackage revisions. Part of the ongoing Pantry work will be to automate the process of rewriting configuration files to use cryptographic hashes, which currently is a pain.
Alright, so that's change one: you only get one package index in Stack, and it should be a Hackage mirror.
SQLite for the win
The overarching Pantry plans involve referencing many different kinds of files via their cryptographic hashes. We'll be able to query them over the network securely, and cache them locally. For that local cache, we're going to use SQLite, which is a great choice for lots of small files.
pantry branch of Stack no longer creates that cache with tarball offsets. Instead, when it downloads a new
01-index.tar file from Hackage, it populates an SQLite database with the raw file contents, as well as a table which is essentially
Map (PackageName, Version, RevisionNumber) HashOfCabalFile.
As I was bragging about a bit on Twitter, this completely solves the high memory usage for cache creation I mentioned above. Now, updating all ~111,000 cabal files from Hackage takes less than 4mb of resident memory.
At first, it seemed like due to inability to detect Hackage rebases (where the
01-index.tar gets updated), we'd need to totally recalcuate the cache each time
stack update runs. This is the slow behavior we already have today. Fortunately, thanks to some insight from Oleg Grenrus, this turns out to not be necessary, and we can instead track hashes of the tarball. See Hackage issue #779 for the full discussion, as well as potentially alternative implementations like parsing the
There is a downside to this approach, namely we will end up storing all of the cabal files twice. Fortunately, the SQLite storage format with proper table normalization turns out to be pretty good, resulting in about 0.5GB of storage (around the same as the
01-index.tar file itself). However, when we get to Pantry's network layer in later posts, we'll see that in many common cases, we won't need to download the full index at all, saving both bandwidth and disk space. For now, we're treating disk space as a cheap commodity, which is basically in line with how all of Haskell tooling behaves.
Besides the advantages above, some other nice outcomes of this are:
- No need for loading up a large binary offset cache each time Stack runs. We can instead use SQLite's intelligent indexing capabilities.
- To go along with the above: we're not relying on any Haskell-specific binary serialization, which can get changed through versions of Stack. This means less time wasted recalculating that cache. This likely affects Stack developers more than anyone else.
- This provides the potential for a unified interface for looking up cabal files for packages coming from any location. I haven't implemented this yet, but it's coming down the pipeline for a future blog post.
Now that we're caching the contents of the cabal files themselves in the SQLite database, the next thing will be caching the contents of the tarballs as well. This raises some interesting design questions regarding whether we cache the full original tarballs as they are, or normalize to a more compact format to allow for more data sharing. After weighing the options, we're going to go with the latter. I've already implemented a proof-of-concept for this which works quite well. Now I need to integrate that with the Stack code base.
If you're interested in the work going on here and would like to discuss, come hit me up on Stack's Gitter channel.