THE BRITISH LIBRARY

UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

01 February 2018

A New Playback Tool for the UK Web Archive

We are delighted to announce that the UK Web Archive will be working with Rhizome to build a version of pywb (Python Wayback) that we hope will greatly improve the quality of playback for access to our archived content.

What is playback of a web archive?

When we archive the web, just downloading the content is not enough. Data can be copied from the web into an archive in a variety of ways, but to make this archive actually accessible takes more than just opening downloaded files in a web browser. Technical details of pages and scripts coming out of the archive need to be presented in a way that enables them to work just like the originals, although they aren’t located on their actual servers anymore. Today’s web users have come to expect interactive features and dynamic layouts on all types of websites. Faithfully reproducing these behaviors in the archive has become an increasingly complex challenge, requiring web archive playback software that is on-par with the evolution of the web as a whole.

Why change?

Currently, we use the OpenWayback playback system, originally developed by the Internet Archive. But in more recent years, Rhizome have led the development of a new playback engine, called pywb (Python Wayback). This Python toolkit for accessing web archives is part of the Webrecorder project, and provides a modern and powerful alternative implementation that is being run as an open source project. This has led to rapid adoption of pywb, as the toolkit is already being used by the Portuguese Web Archive, perma.cc, the UK National Archives, the UK Parliamentary Archive, and a number of others.

Open development
To meet our needs we need to modify pywb, but as strong believers in open source development, all work will be in the open, and wherever appropriate, we will fold the improvements back into the core pywb project.

If all goes to plan, we expect to contribute the following back to pywb for others to use:

Other UKWA-specific changes, like theming, implementing our Legal Deposit restrictions, and deployment support, will be maintained separately.

Initially we will work with Rhizome to ensure our staff and curators can access our archived material via both pywb and OpenWayback. If the new playback tool performs as expected  we will move towards using pywb to support public access to all our web archives.

23 January 2018

Archiving the UK Copyright Literacy blog

By Louise Ashton, Copyright & Licensing Executive, The British Library
Re-posted (with permission) from copyrightliteracy.org/

We were excited to discover recently that copyrightliteracy.org had been selected for inclusion in the UK Web Archive as an information resource with historical interest. However, even we faced some trepidation when considering the copyright implications of allowing archiving of the site (i.e. not everything on the site is our copyright). Firstly, this allowed us to get our house in order, contact our fellow contributors and ensure we had the correct re-use terms on the site (you can now see a CC-BY-SA licence at the footer of each web page). Secondly, this provided opportunity for another guest blog post and we are delighted that Louise Ashton who works in the Copyright & Licensing Department at The British Library has written the following extremely illuminating post for us. In her current role Louise provides copyright support to staff and readers of the British Library, including providing training, advising on copyright issues in digitisation projects and answering copyright queries from members of the public on any of their 150 million collection items!  Prior to this, Louise began her career in academic libraries, quickly specialising in academic liaison and learning technologist roles. 

Screenshot-beta-home-01

When people think of web archiving their initial response usually focuses on the sheer scale of the challenge. However another important issue to consider is copyright; copyright plays a significant role both in shaping web archives and in determining if and how they can be accessed. Most people in the UK Library and Information Science (LIS) sector are aware that in 2013 our legal deposit legislation was extended to include non-print materials which, as well as e-books and online journal articles, also covers websites, blogs and public social media content. This is known as the snappily titled ‘The Legal Deposit Libraries (Non-Print Works) Regulations 2013’ and is enabling the British Library and the UK’s five other legal deposit libraries to collect and preserve the nation’s online life. Indeed, given that the web will often be the only place where certain information is made available the importance of archiving the online world is clear.

UKWA-poster
UK Web Archive poster © British Library Board

What is less well known is that, unless site owners have given their consent, the Non-Print Legal Deposit Archive is only available within the reading rooms of the legal deposit libraries themselves and even then can only be accessed if using library PCs. Although this mirrors the terms for accessing print legal deposit, because of the very nature of the non-print legal deposit collection (i.e. websites that are generally freely available to anyone with an internet connection) people naturally expect to be able to access the collection off-site. The UK Web Archive offers a solution to this by curating a separate archive of UK websites that can be freely viewed and accessed online by anyone, anywhere, and with no need to travel to a physical reading room. The purpose of the UK Web Archive is to provide permanent online access to key UK websites with regular snapshots of the included websites being taken so that a website’s evolution can be tracked. There are no political agendas governing which sites are included in the UK Web Archive, the aim is simply to represent the UK’s online life as comprehensively and faithfully as possible (inclusion of a site does not imply endorsement).

However, a website will only be added to the (openly-accessible) UK Web Archive if the website owners’ permission has been obtained and if they are willing to sign a licence granting permission for their site to be included in the Archive and allowing for all versions of it to be made publically accessible. Furthermore, the website owner also has to confirm that nothing in their site infringes the copyright or other intellectual property rights of any third party and if their site does contain third party copyright, that they are authorised to give permission on the rights-holders’ behalf. Although the licence has been carefully created to be as user-friendly as possible the presence of any formal legal documentation is often perceived as intimidating. So even if a website owner is confident that their use of third party content is legitimate they may be reluctant to formally sign a licence to this effect – seeing it in black and white somehow makes it more real! Or, despite best efforts, site owners may have been unable to locate the rights-holders of third party content used in their site and although they may have been happy with their own risk assessments, this absence of consent negates them from being able to sign the licence to include the site in the UK Web Archive.

For other website owners this may be the first time they have thought about copyright. Fellow librarians will not be surprised to hear that some people are bewildered to learn that they may have needed to obtain permission to borrow content from elsewhere on the internet for use in their own sites! And then of course there are the inherent difficulties in tracking down rights-holders more generally; unless sites are produced by official bodies it can be difficult to identify who the primary site owners are and in big organisations the request may never make it to the relevant person. Others may receive the open access request but, believing it to be spam, ignore it. And of course site owners are perfectly entitled to refuse the request if they do not wish to take part. Information literacy plays its part and for sites where it is crucial that site visitors access the most recent information and advice (for example websites giving health advice) then for obvious reasons the site owners may not wish for their site to be included.

The reason Jane and Chris asked me to write this blog post is because the UK Copyright Literacy website has been selected for potential inclusion in the UK Web Archive. It was felt important that the Archive should contain a site that documented and discussed copyright issues given that copyright and online ethics are such big topics at the moment, particularly with the new General Data Protection Regulations coming into force next May. Another reason why the curators wanted to include the Copyright Literacy blog is, given that the website isn’t hosted in the UK and therefore does not have a UK top level domain (for example .uk or .scot), it had never been automatically archived as part of the annual domain crawl. This is an unfortunate point which affects many websites as it means that many de facto UK sites are not captured unless manual intervention occurs. To try and minimise the number of UK websites that unwittingly evade inclusion, the UK Web Archive team therefore welcomes site nominations from members of the public. Consequently, if you would like to nominate a site to be added to the archive, and in doing so perhaps help to play a role in preserving UK websites, you can do so via https://www.webarchive.org.uk/ukwa/info/nominate.

As a final note, we are pleased to report that Jane and Chris have happily agreed to their site being included which is great news as it means present day copyright musings will be preserved for years to come!

22 December 2017

What can you find in the (Beta) UK Web Archive?

We recently launched a new Beta interface for searching and browsing our web archive collections but what can you expect to find?

UKWA have been collecting websites since 2005 in a range of different ways, using changing software and hardware. This means that behind the scenes we can't bring all of the material collected into one place all at once. What isn't there currently will be added over the next six months (look out for notices on twitter). 

What is available now?

At launch on 5 December 2017 the Beta website includes all of the websites that have been 'frequently crawled' (collected more often than annually) 2013-2016. This includes a large number of 'Special Collections' selected over this time and a reasonable selection of news media.

DC07-screenshot-brexit

What is coming soon?

We are aiming to add 'frequently crawled' websites from 2005-2013 to the Beta collection in January/February 2018. This will add our earliest special collections (e.g. 2005 UK General Election) and should complete all of the websites that we have permission to publicly display.

What will be available by summer 2018?

The largest and most difficult task for us is to add all of the websites that have been collected as part of the 'Legal Deposit' since 2013. We do a vast 'crawl' once per year of 'everything' that we can identify as being a UK website. This includes all .UK (and .london, .scot, .wales and .cymru) plus any website we identify as being on a server in the UK. This amounts to tens of millions of websites (and billions of individual assets). Due to the scale of this undertaking we thank you for your patience.

We would love to know your experiences of using the new Beta service, let us know here: www.surveymonkey.co.uk/r/ukwasurvey01

 By Jason Webber, Web Archive Engagement Manager, The British Library