THE BRITISH LIBRARY

UK Web Archive blog

3 posts from June 2014

24 June 2014

Your Web Archive Needs You!

With the centenary of the outbreak of World War One taking place this summer the British Library’s Web Archiving team has been working with colleagues across the Library and beyond to initiate a ‘First World War Centenary Special Collection’ of websites.

The collection is part of a wide range of centenary projects under way at the Library including:

These projects will enable thousands of people to engage with the centenary and to showcase the many significant items held by the Library relating to the war.

The Special Collection
The web archive collection will include a huge variety of websites related to the centenary including the various events which will be taking place; resources about the history of the war; academic sites on the meaning of the conflict in modern memory and patterns of memorialisation and critical reflections on British involvement in armed conflict more generally.

The collection will help researchers find out how the First World War shaped our society and continues to touch our lives at a personal level in our local communities and as a nation.

Archiving began in April 2014 and will continue until 2019. Some examples of websites archived so far include:

We need your help!
Do you know of a website which may be suitable for the First World War Centenary Collection? If so, we would love to hear from you, particularly if you edit or publish a WW1 themed website yourself.

Websites could include those created by museums, archives, libraries, special interest groups, universities, performing arts groups, schools and community groups, family and local history societies or individual publications. It does not cost anything to have your website archived by the British Library and involves no work on your part once nominated.

Please nominate UK based WW1 related websites through our nominate form.

If you have HLF funding for a First World War Centenary project, please send the URL (web address) to FWWURL@hlf.org.uk with your project reference number.

See what we have in the WW1 special collection so far.

Written by Nicola Bingham, Web Archivist, British Library

23 June 2014

Researcher in focus: Paul Thomas - UK and Canadian Parliamentary Archives

At the UK Web Archive, we’re always delighted to learn about specific uses that researchers have been able to make of our data. One such case is from the work of Paul Thomas, a doctoral student in political science at the University of Toronto.

Paul writes:

‘The UK Web Archive has been a huge asset to my dissertation. My research examines how backbench parliamentarians in Canada, the UK and Scotland are increasingly cooperating across party lines through a series of informal organizations known as All-Party Groups (APGs). For the UK, the most important source for my research is the registry of APGs that is regularly produced by the House of Commons. The document, which is published in both web and PDF formats, provides details on the more than 500 groups that are in operation, including which MPs and Peers are involved, and what funding groups have received from outside bodies like lobbyists or charities.

‘A key part of the study involved using the registries to construct a dataset that tracked membership patterns across the various groups, and how they changed over time. Unfortunately, each time a new version of the registry is produced, the previous web copy is taken down.’

While the Parliamentary Archives keep old copies of the registry on file, they only do so in PDF – a format that is not so conducive to the extraction of information into a dataset. Paul was able to find and use successive versions from the UK Web Archive going back to 2006, including a number that were missing from the Internet Archive. Paul was also able to obtain pre-2006 versions from the Internet Archive. ‘Without the UK Web Archive, I would have first needed to purchase the past registries in PDF from the Parliamentary Archives and then painstakingly copy the details on each group into a dataset.’ Overall, Paul writes, ‘the UK Web Archive saved me an enormous amount of time in compiling my data'.

Paul recently gave a paper drawing on this data at the Annual Conference of the Canadian Political Science Association:
http://pauledwinjames.files.wordpress.com/2014/05/paul-thomas-cpsa2014v2.pdf

12 June 2014

How big is the UK web?

The British Library is about to embark on its annual task of archiving the entire UK web space. We will be pushing the button, sending out our ‘bots to crawl every British domain for storage in the UK Legal deposit web archive. How much will we capture? Even our experts can only make an educated guess.

Red-button

You’ve probably played the time honoured village fete game, to guess how many jelly beans are in the jar and the winner gets a prize? Well perhaps we can ask you to guess the size of the UK internet and the nearest gets….the glory of being right. Some facts from last year might help.

2013 Web Crawl
In 2013 the Library conducted the first crawl of all .uk websites. We started with 3.86 million seeds (websites), which led to the capture of 1.9 billion URLs (web pages, docs, images). All this resulted in 30.84 terabytes (TB) of data! It took the library robots 70 days to collect.

Geolocation
In addition to the .uk domains the Library has the scope to collect websites that are hosted in the UK so we will therefore attempt to geolocate IP addresses within the geographical confines of the UK. This means that we will be pulling in many .com, .net, .info and many other Top Level Domains (TLDs). How many extra websites? How much data? We just don’t know at this time.

De-duplication
A huge issue in collecting the web is the large number of duplicates that are captured and saved, something that can add a great deal to the volume collected. Of the 1.9 billion web pages etc. a significant number are probably copies and our technical team have worked hard this time to attempt to reduce this or ‘de-duplicate’. We are, however, uncertain at the moment as to how much effect this will eventually have on the total volume of data collected.

Predictions
In summary then, in 2014 we will be looking to collect all of the .uk domain names plus all the websites that we can find that are hosted in the UK (.com, .net, .info etc.), overall a big increase in the number of ‘seeds’ (websites). It is hard, however, to predict what effect these changes will have compared to last year. What the final numbers might be is anyone’s guess? What do you think?

Let us know in the comments below, or on twitter (@UKWebArchive) YOUR predictions for 2014 – Number of URLs, size in terabytes (TBs) and (if you are feeling very brave), the number of hosts e.g. organisations like the BBC and NHS consist of lots of websites each but are one 'host'.

We want:

  • URLs (in billions)
  • Size (in terabytes)
  • Hosts (in millions) 

#UKWebCrawl2014

We will announce the winner when all the data is safely on our servers sometime in the summer. Good luck.