Dispatches from the domain crawl #1
After the blaze of publicity surrounding the advent of Non-Print Legal Deposit, the web archiving team have been busy putting the regulations into practice. This is the first of a series of dispatches from the domain crawl, documenting our discoveries as we begin crawling the whole of the UK web domain for this first time.
Firstly, some numbers. In the first week, we acquired nearly 3.6TB of compressed data (in its raw, uncompressed form, the data is ~40% larger) from some 191 million URLs. Although we staggered the launch as a series of smaller crawls, by the end of the week we reached a sustained rate of 300Mb/s. The bulk of this was from the general crawl of the whole domain, which we kicked off with a list of 3.8 million hostnames.
At this early stage it is also hard to determine reliably the difference between a erroneous response for a real link resource that has disappeared, and an occasion on which access to a real resource was blocked. Over time, we'll learn more about how best to answer some of these questions, which will hopefully start to reveal interesting things about the UK web as a whole.
Roger Coram / Andy Jackson / Peter Webster
Thanks for the details. So that's about 50 URLs per hostname? If you are fishing around for ideas about another post I would be really interested to hear how you assembled the list of hostnames, and are going to be maintaining it over time--although I imagine it is a work in progress. Keep up the good work!
Posted by: edsu | 01 May 2013 at 09:38 AM