2020 Domain Crawl Update
By Andy Jackson, Web Archiving Technical Lead at the British Library
On the 10th of September the 2020 Domain Crawl got underway. The annual Domain Crawl usually takes about three months to complete, it visits UK published websites on a UK Top Level Domain (TLD) like .uk, .cymru, .scot, .london etc., any web content hosted on a server registered in the UK as well as all the records manually created by the UK Web Archive teams across the UK Legal Deposit Libraries.
Update on crawl management
Due to the billions of URLs involved, the Domain Crawl is the most technically difficult crawl we run. As the crawl frontier grows and grows, the strain starts to show, particularly on the disk space required to store all of the status information about the URLs that have been crawled or are awaiting crawling. Worst of all, we found some mysterious problems with how Heritrix3 manages this information, meant that we could not safely stop and restart long crawls. We could usually restart once, but if we restarted again strange errors would appear, and sometimes these would be serious enough to cause the whole crawl to fail. Fortunately, in the last year, we finally tracked this down and updated the Heritrix3 crawler so that it can be safely stopped and restarted multiple times.
This has made managing the crawler much easier, as we can stop and restart the crawl with confidence if we need to change the software or hardware setup. This makes managing things like disk space much less stressful.
Update on the crawl performance
In the initial phase of the crawl, we threw in the roughly 11 million web hostnames that we have seen in past crawls, which then got whittled down to about 7 million active hosts. After this bumpy start and some system tuning, the crawl settled down and has been pretty consistently processing 250-300 URLs per second. This is acceptable, but isn’t quite as fast as we would like, so we are analysing the crawl while it runs to try and work out where the bottlenecks are.
What we have collected so far
The figure below shows the URLs collected over time.
The rather jagged start shows where we were able to stop and start the crawl in order to tune the initial hardware setup, and the flatter ‘pauses’ later on are from other maintenance activities like growing the available disk space. The advantage of being able to re-tune the crawler as we go is shown by the way the line gets steeper over time, corresponding to the increased crawl rate.
In terms of bytes downloaded, we see a similar result:
As you can see, we are rapidly approaching 90TB of downloaded data, which corresponds to roughly 50TB of compressed WARC.gz data.
Despite starting the crawl relatively late in the year (due to issues around the COVID-19 outbreak), we are making good and stable progress and are on track to download over two billion URLs by the end of the year.
Follow the UK Web Archive on Twitter for the latest updates on the Domain Crawl and other web archiving activities!