THE BRITISH LIBRARY

UK Web Archive blog

02 September 2015

2015 UK Domain Crawl has started

 

We are proud to announce that the 2015 UK Domain Crawl has started !

Over the next weeks our web crawler will visit every website in the UK, download and keep it safe on the British Library archive servers.

Robot_icon.svg
https://commons.wikimedia.org/wiki/File%3ARobot_icon.svg By Bilboq (Own work) [Public domain], via Wikimedia Commons

Previous crawls

The first ever UK Domain crawl was run in 2013 it resulted in:

  • 3.8 million seeds (starting URLs)
  • 31TB data
  • 1.9 billion web pages and other assets

The 2014 built on experiences and yielded:

  • 20 million seeds
  • Geo IP check of UK hosted websites (2.5 million seeds)
  • 56TB data
  • 2.5 billion webpages and other assets
  • including: 4.7GB of viruses and 3.2TB of screenshots

Guesswork

What will the 2015 crawl be like? Will we find more urls? Surely the web grows every day, but how much? Will there be more data? Will we have more virus content?

Tweet your suggestions and thoughts about the UK Domain @UKWebArchive or use the #UKWebCrawl2015

 

 Homepage Crawl Log Flypast Â© Andy Jackson

 

 

Comments

The comments to this entry are closed.