We at the UK Web Archive have been archiving selected websites since 2004, and throughout we have worked to ensure that the quality of those archived sites is acceptably high. This involves a lot of manual effort; it means inspecting the web pages on each site, tracking down display issues, and re-configuring and re-crawling as necessary. On this basis, we have to date archived over 60,000 individual snapshots of websites over nearly a decade.
Now that the Legal Deposit legislation is in place, we are presented with a formidable challenge. As we move from thousands of sites to millions, what can we do to ensure the quality is high enough? We have the resources to manually inspect a few thousand sites a year, but that's now a drop in the ocean.
At large scale, even fairly basic checks become difficult. When there are only a few crawls running at once, it is easy to spot when the crawl of a single site fails for some unexpected reason. When we have very large numbers of sites being crawled simultaneously, and at varying frequencies, simply keeping track of what is going on at any given moment is not easy, and failed crawls can go unnoticed.
This is also particularly important for those rare occasions when a web publisher has contacted us with an issue about our crawling activity. We need to be able to work out straight away what's been going on, in which crawler process, and to modify its behaviour. This is why we began to develop Monitrix, a crawl monitoring component to complement our crawler.
The core idea is quite simple: Monitrix consumes the crawl log files produced by Heritrix3 and, in real time, derives statistics and metrics from that stream of crawl events. That critical information is then made available via a web-based interface.
We initially trialled Monitrix during our first Legal Deposit crawl, relating to the reorganisation of the NHS in England and Wales in April. This worked very well, and the interface allowed us to track and explore the crawler activity as it happened. Simple things, like being able to flip back quickly through the chain of links that brought the crawlers to a particular site, proved very helpful in understanding the crawl's progress.
But then came the real challenge: using Monitrix during the domain crawl. The NHS collection contained only 5,500 sites, collecting just 1.8TB of archived data. In contrast, the domain crawl would eventually include millions of sites and over 30TB of data. Initially, Monitrix worked quite well, but as the crawl went on it became clear that it could not keep up with the sheer volume of data being pushed into it. The total number of URLs climbed into the millions, being collected at one point at a rate of 857 per second. Under this bombardment, Monitrix became slower and slower.
What was the problem ? With that twenty-twenty vision that comes only with hindsight, it became abundantly clear that the architecture of the MongoDB database (on which Monitrix is based) was not well suited to this, our largest scale use case. However, we now believe we have found at least one appropriate alternative technology, Apache Cassandra, and we are in the process of moving Monitrix over to that database system.
Andy Jackson, Web Archiving Technical Lead, British Library