THE BRITISH LIBRARY

UK Web Archive blog

1 posts from October 2019

04 October 2019

UKWA Website Crawl - One hour in One minute

By Jason Webber, Web Archive Engagement Manager, The British Library

Each year we attempt to collect as much of the UK web space as we can. This typically involves millions of websites and billions of individual assets (images, pdf's, css files etc.). We send out our robots across the interwebs looking for websites that we can archive. The bots follow links to pages that have links to follow and it keeps going until we have archived (almost) everything. But what does it look like to 'crawl' the web? Here we have condensed an hour of live web crawling into a one minute video:

Every circle is a different website, and every line represents a link that was followed between websites. The size of the circle represents how many pages we visited from that site, and the width of the line represents the number of links we followed.

If you want to see what we are crawling at the moment, look here (NOTE: this link only works while we are crawling the web): https://jumbled-eggplant.glitch.me/graph.html

You can see what we have captured at our website (www.webarchive.org.uk/ukwa/), however, many of the sites themselves can only be viewed in the reading rooms of UK Legal Deposit Libraries. 

Despite our best efforts we can't collect every UK owned website as many are hosted abroad and not under a .UK (looking at you wordpress, squarespace and wix). You can nominate a website here: https://www.webarchive.org.uk/en/ukwa/info/nominate