THE BRITISH LIBRARY

UK Web Archive blog

3 posts from August 2012

30 August 2012

Analysing File Formats in Web Archives

Knowledge of file formats is crucial to digital preservation. Without this, it is impossible to define a preservation strategy.  Andy Jackson, Web Archiving Technical Lead at the British Library explains how to analyse formats used in archived web resources for digital preservation purposes. This is also posted as an Open Planets Foundation Blog

UK Web Archive recently released a new suite of visualisations and datasets. Amongst these is a format profile, summarising the data formats (MIME types)  in the JISC UK Web Domain Dataset (1996-2010). This contains some 2.5 billion HTTP 200 responses stretching from 1996 to 2010, neatly packed into ARC files and stored on our HDFS cluster.  Storing it in HDFS allows us to run Map-Reduce tasks over the whole dataset, and analyse the results.

Given this infrastructure, my first thought was to use it to test and compare format identification processes by running multiple identification tools over the same corpus. By analysing the depth and coverage of the results, we can estimate which tools are better suited to which types of resources and collection. Furthermore, much as double re-keying can be used to establish 'groud truth' for OCR data, each tool acts as an independent opinion on the format of an resource and so permits us a little more confidence in their assertions when they are found to coincide. This allows us to focus our attention on where the tools disagree, and helps to ensure that our efforts to improve those tools will have the greatest impact.

To this end, I wrapped up Apache Tika and the DROID binary signature identifier as part of a Map-Reduce task and ran them over the entire corpus. I mapped the results of both to a formalised extended MIME type syntax, such that each PUID has a unique MIME type of the form 'application/pdf; version=1.4', and used that to compare the results of the tools.

Of course, as well as establishing trust in the tools, this kind of data helps us start to explore the way format usage has changed over time, and is a necessary first step in understanding the nature of format obsolescence. As a taster, here is a chart showing the usage of different version of HTML over time:

As you can see, each version rises to dominance and then fades away, but the fade slows down each time. Across the 2010 time-slice, all the old versions of HTML are still turning up in the crawl. You can find some more information and results on the UK Web Archive site.

Finally, as well as exporting the format identifiers, I also used Apache Tika to extract any information it found about the software or hardware platform the resource was created on.  All of this information was combined with the MIME type declared by the server and then aggregated by year to produce a rich and complex longtitudinal multi-tool format profile for this collection.

Fmt-html-versions

If this is of interest to you, please go and download the dataset and start exploring it. Please let me know if you find this dataset useful, and please share any interesting results you dig out of the dataset.

22 August 2012

Visualising the UK Web Domain

The UK Web Archive is a selective archive containing Websites selected and preserved by the British Library and partners since 2004.

  “.uk” is one of the largest country-code top level domains in the world with 10 million registrations in March 2012. Selective archiving has many advantages but is costly and fails to capture a comprehensive picture of the national domain. The Legal Deposit Libraries in the UK will be able to collect Web resources at scale when the non-print Legal Deposit legislations are in place, expected sometime in 2013.

The benefits of archived Web resources can only be realised when these are actively used, for research, learning and teaching.  This was the impetus for us to work with the Joint Information Systems Committee (JISC) and the Internet Archive on a collaborative project which extracted a copy of UK Websites from the Internet Archive’s collection. This research dataset , supported by JISC funding, contains Websites crawled between 1996 and 2010 by the Internet Archive and is the largest historical dataset of the UK domain in existence.  One of the objectives of the project is to develop visualisations and services to demonstrate how large scale Web archive collections can be used for analytics, showing embedded trends and patterns which would not have been possible by just consulting historical copies of Websites individually.

The visualisations and secondary datasets are now released on the UK Web Archive http://www.webarchive.org.uk/ukwa/visualisation. The N-gram search is a phrase-usage visualisation tool which charts the monthly occurrence of user-defined search terms or phrases over time, as found in the JISC UK Web domain dataset (1996-2010). The link visualisation shows the relationship between domain suffixes over time.  The format profile is a visualisation of the format analysis, summarising the data formats (MIME types) contained within all of the HTTP 200 OK responses.  We have also released two downloadable secondary datasets which can used to develop further applications, a list of MIME types and a postcode index.

The JISC has also funded two additional projects, using the JISC UK Web domain dataset (1996-2010) to develop analytical access to large scale Web archive collection. These are  Analytical Access to the Domain Dark Archive  and Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research.  We are running a joint workshop at Digital Research 2012 Conference: Digital Research Using Web Archives.  If you would like to find out more about our projects and Web archiving in general, please come along and join us.

01 August 2012

Diamond Jubilee Collection live

We are pleased to announce that our new web collection about the Queen’s Diamond Jubilee is now live. This collection represents an important historical record of online resources which is hoped will provide a lasting legacy of the event and fulfil our aim to prioritise selection of websites that feature political, cultural, social and economic events of national importance.  

The collection, comprising over 130 titles, was initiated in late 2011 by the British Library in collaboration with the Royal Archives and the Institute of Historical Research. Content has been selected by subject specialists from a variety of sources including the Twittervane tool developed by the British Library which enables curators to identify sites frequently shared on social media relevant to specified search terms. Websites were also selected by members of the public who submitted nominations on the UK Web Archive’s online nomination form.

Archiving of websites commenced in January 2012 with a focused period of high-frequency and intensity crawls in the weeks directly before and after the Jubilee weekend on June 2nd – 5th. All harvested websites were checked for quality and completeness before submission to the archive. We will continue to collect websites until December 2012 in order to capture analysis and debate on the issues around the Jubilee.

The aim of the collection was to cover the event as comprehensively as possible and to reflect a multiplicity of strands and themes including official events, the economic impact, public sentiment and political and constitutional debate. Staff at the Royal Household nominated sites of official interest such as the website of the British Monarchy and the official website of The Queen’s Diamond Jubilee.

Websites of official events initiated by Buckingham Palace have been archived including the Thames Diamond Jubilee Pageant, the Queen’s Diamond Jubilee Beacons, the Big Lunch and the BBC Concert at Buckingham Palace.

The Jubilee inspired local, unofficial celebrations such as street parties and other community based events and a selection of their websites have been captured, for example Newry Drama Festival, the Horsted Keyes Diamond Jubilee Organising Committee and Wetherby’s Diamond Jubilee Website.

Beginning in March 2012, The Queen, accompanied by The Duke of Edinburgh, conducted a series of royal tours throughout the UK to mark the Diamond Jubilee year. We have captured samples of local press coverage to cover Her Majesty’s regional visits. See for example the Queen’s visit to Ebbw Vale, Gwent and the Blog by photographer Chris Seddon capturing the Queens Diamond Jubilee Tour of Leicester.

As much of the UK geared up to celebrate the Diamond Jubilee, the occasion also impelled debate about the future of the monarchy. Dissenting voices and opposition to the monarchy have been captured in the archive, see for example the website of the Jubilee Protest ‘Protest at the Pageant’ and Republic: campaigning for a democratic alternative to the monarchy.

The Mass Observation Project worked with us to record online observations from members of the public about the Diamond Jubilee. The observations were hosted on a blog which has been harvested as part of the Diamond Jubilee collection.

New content will continue to be added until December 2012. The British Library would be delighted to receive your nominations for this collection via our online form.  

Nicola Johnson, Web Archivist 1st August 2012