Theseus' Data Store
By Andy Jackson, Web Archiving Technical Lead, The British Library
My father used to joke about how he’d had his hammer his whole working life. He’d bought it when he’d started out as a joiner, and decades later it was still going strong. He’d replaced the shaft five times and the head twice, but it was still the same hammer! This simple story of maintenance and renewal springs to mind because a few days ago, we finally managed to replace the most important component of our data store. Our storage cluster has been running near-continuously for almost a decade, but as of now, every single hardware component has been renewed or replaced.
We use Apache Hadoop to provide our main data store, via the Hadoop Distributed File System (HDFS). We mostly like it because it's cheap, robust, and helps us run large-scale analysis and indexing tasks over our content. But we also like it because of how we can maintain it over time.
HDFS runs across multiple computers, all working together to ensure there are at least three copies of any data stored in the system, and that these copies are in separate machines and separate server racks. It runs like a beehive. The 'queen' is called the Namenode, and although it doesn't store any data, it keeps track of where all the data is and orchestrates the ingest and replication processes. The 'worker' nodes just store and maintain their own blocks of data, and send data back and forth between themselves as instructed by the Namenode. The Namenode also provides the interface we use to access the system, referring each client to the right set of worker nodes as files are accessed. All the time, the system calculates checksums of the chunks of data and uses this to verify the integrity of the files.
This architecture was designed to anticipate hardware failure and recover from it, which makes the system much easier to maintain. If a drive, or even a full server fails, we can simply remove it, replace it, and keep an eye on it as the data is re-distributed. As new, higher-capacity drives come along, we can upgrade the drives in each node one-by-one, in a rolling update that grows the capacity of the cluster.
Similarly, over time, we can upgrade the operating system and other supporting software on every node, to make sure we're up to date. Almost all of this can be done while the system is running, without interrupting access. But the exception is the Namenode – as a hive needs its Queen, HDFS needs its Namenode, so we avoid interrupting it unless absolutely required. It had been running on the same hardware all this time, but now it's happily running on a new bit of kit. At last.
Like the Ship of Theseus, every piece has been replaced, but it's still the same store, and the data is still safe. Of course, it's not as easy to manage and as transparently scaleable as Cloud storage, but for on-site storage it does a great job. Rather than having to shift between storage silos every few years, the data is in constant motion, and this design allows the components and support contracts for the different layers to move at different speeds and rates of renewal over the years. This is one of the advantages of open source systems – they can provide a stable interface for a service, decoupled from any particular vendor or hardware, allowing support methods, contracts and contractors to change over time.
But HDFS has strong competition these days. There's many other options, many of which are compatible with the defacto standard, S3 (Simple Storage Service).. Being able to work with the same interface whether storage is local or in the cloud might make all the difference. We're happy with HDFS for now, but we'll be preparing for the day when a new ship comes alongside and it's time to shift the cargo...