Launching the UK Web Archive 2020 Annual Domain Crawl
By Helena Byrne, Curator of Web Archives at the British Library
Today (10th September 2020) the UK Web Archive team will be pushing the big red button to kickstart the annual Domain Crawl of the UK webspace. The current coronavirus pandemic will no doubt feature strongly in this year’s crawl. This will complement the curated collection that the web archive teams across the UK Legal Deposit Libraries are contributing. The British Library along with the National Library of Scotland are also selecting websites for the International Internet Preservation Consortium (IIPC) Content Development Group (CDG) Novel Coronavirus (COVID-19) collection.
What we collect
The UK Web Archive has been archiving UK published websites on a selective basis since 2005 and in 2020 is celebrating #15YearsOfUKWA. Domain Crawl 2020 is the seventh that has taken place. It wasn’t till after the implementation of the Non-Print Legal Deposit Regulations (NPLD) in April 2013, that we were able to run a broad crawl over the UK webspace. This includes anything with a .uk or other UK geographic Top Level Domain (TLD) such as .scot, .cymru or .london etc. It also includes websites on other TLDs that have been registered in the UK or that have been manually selected.
NPLD came into effect on the 6th April 2013 and the British Library hosted a special event to launch the first Domain Crawl. This was widely covered in the national press and you can still watch back a short video from the event on The Guardian website.
How much data is collected in the Domain Crawl?
The Domain Crawl usually runs for three months of the year and each year starts at a different time of year to avoid seasonal biases. Roughly 5-10 million hosts (websites) are archived every year. However, the amount of data collected each year varies. Also the way the data is collected and stored over time changes. We compress the data we store and as technology develops the amount of data that can be compressed into one terabyte changes. Last year 63.7 TB of compressed data was collected bringing the total collected during Domain Crawls from 2013 to 2019 to 477.62 TB.
When can I view this content?
Due to the enormous amounts of data that is collected each year from the annual Domain Crawl and our Frequent Crawls, there is a significant lag from when the content is archived and made available through the UK Web Archive website. The Frequent Crawl data collected from 2013-2019 was 250.34 TB bringing the combined total to 727.96 TB of compressed data. To make searching content easier the website allows you search across all the Selectively Crawled content from 2005 to 2013 as well as the Frequent Crawl content from 2013 to 2017 and the Domain Crawl content 2013 to 2015.
Under the Non-Print Legal Deposit (NPLD) Regulations 2013, we can archive all UK published websites but we are only able to make them available to people outside the Legal Deposit Libraries Reading Rooms, if the website owner has given permission.
Due to the NPLD Regulations, access to the archived content is a mix of open and onsite access. The ‘Viewable only on Library premises’ message on individual records indicates that you have to visit one of the six UK Legal Deposit Libraries. The UK Legal Deposit Libraries are the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge University Library and Trinity College Dublin Library.
Follow the UK Web Archive on Twitter for the latest updates on the domain crawl and other web archiving activities!