THE BRITISH LIBRARY

UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

23 January 2018

Archiving the UK Copyright Literacy blog

By Louise Ashton, Copyright & Licensing Executive, The British Library
Re-posted (with permission) from copyrightliteracy.org/

We were excited to discover recently that copyrightliteracy.org had been selected for inclusion in the UK Web Archive as an information resource with historical interest. However, even we faced some trepidation when considering the copyright implications of allowing archiving of the site (i.e. not everything on the site is our copyright). Firstly, this allowed us to get our house in order, contact our fellow contributors and ensure we had the correct re-use terms on the site (you can now see a CC-BY-SA licence at the footer of each web page). Secondly, this provided opportunity for another guest blog post and we are delighted that Louise Ashton who works in the Copyright & Licensing Department at The British Library has written the following extremely illuminating post for us. In her current role Louise provides copyright support to staff and readers of the British Library, including providing training, advising on copyright issues in digitisation projects and answering copyright queries from members of the public on any of their 150 million collection items!  Prior to this, Louise began her career in academic libraries, quickly specialising in academic liaison and learning technologist roles. 

Screenshot-beta-home-01

When people think of web archiving their initial response usually focuses on the sheer scale of the challenge. However another important issue to consider is copyright; copyright plays a significant role both in shaping web archives and in determining if and how they can be accessed. Most people in the UK Library and Information Science (LIS) sector are aware that in 2013 our legal deposit legislation was extended to include non-print materials which, as well as e-books and online journal articles, also covers websites, blogs and public social media content. This is known as the snappily titled ‘The Legal Deposit Libraries (Non-Print Works) Regulations 2013’ and is enabling the British Library and the UK’s five other legal deposit libraries to collect and preserve the nation’s online life. Indeed, given that the web will often be the only place where certain information is made available the importance of archiving the online world is clear.

UKWA-poster
UK Web Archive poster © British Library Board

What is less well known is that, unless site owners have given their consent, the Non-Print Legal Deposit Archive is only available within the reading rooms of the legal deposit libraries themselves and even then can only be accessed if using library PCs. Although this mirrors the terms for accessing print legal deposit, because of the very nature of the non-print legal deposit collection (i.e. websites that are generally freely available to anyone with an internet connection) people naturally expect to be able to access the collection off-site. The UK Web Archive offers a solution to this by curating a separate archive of UK websites that can be freely viewed and accessed online by anyone, anywhere, and with no need to travel to a physical reading room. The purpose of the UK Web Archive is to provide permanent online access to key UK websites with regular snapshots of the included websites being taken so that a website’s evolution can be tracked. There are no political agendas governing which sites are included in the UK Web Archive, the aim is simply to represent the UK’s online life as comprehensively and faithfully as possible (inclusion of a site does not imply endorsement).

However, a website will only be added to the (openly-accessible) UK Web Archive if the website owners’ permission has been obtained and if they are willing to sign a licence granting permission for their site to be included in the Archive and allowing for all versions of it to be made publically accessible. Furthermore, the website owner also has to confirm that nothing in their site infringes the copyright or other intellectual property rights of any third party and if their site does contain third party copyright, that they are authorised to give permission on the rights-holders’ behalf. Although the licence has been carefully created to be as user-friendly as possible the presence of any formal legal documentation is often perceived as intimidating. So even if a website owner is confident that their use of third party content is legitimate they may be reluctant to formally sign a licence to this effect – seeing it in black and white somehow makes it more real! Or, despite best efforts, site owners may have been unable to locate the rights-holders of third party content used in their site and although they may have been happy with their own risk assessments, this absence of consent negates them from being able to sign the licence to include the site in the UK Web Archive.

For other website owners this may be the first time they have thought about copyright. Fellow librarians will not be surprised to hear that some people are bewildered to learn that they may have needed to obtain permission to borrow content from elsewhere on the internet for use in their own sites! And then of course there are the inherent difficulties in tracking down rights-holders more generally; unless sites are produced by official bodies it can be difficult to identify who the primary site owners are and in big organisations the request may never make it to the relevant person. Others may receive the open access request but, believing it to be spam, ignore it. And of course site owners are perfectly entitled to refuse the request if they do not wish to take part. Information literacy plays its part and for sites where it is crucial that site visitors access the most recent information and advice (for example websites giving health advice) then for obvious reasons the site owners may not wish for their site to be included.

The reason Jane and Chris asked me to write this blog post is because the UK Copyright Literacy website has been selected for potential inclusion in the UK Web Archive. It was felt important that the Archive should contain a site that documented and discussed copyright issues given that copyright and online ethics are such big topics at the moment, particularly with the new General Data Protection Regulations coming into force next May. Another reason why the curators wanted to include the Copyright Literacy blog is, given that the website isn’t hosted in the UK and therefore does not have a UK top level domain (for example .uk or .scot), it had never been automatically archived as part of the annual domain crawl. This is an unfortunate point which affects many websites as it means that many de facto UK sites are not captured unless manual intervention occurs. To try and minimise the number of UK websites that unwittingly evade inclusion, the UK Web Archive team therefore welcomes site nominations from members of the public. Consequently, if you would like to nominate a site to be added to the archive, and in doing so perhaps help to play a role in preserving UK websites, you can do so via https://www.webarchive.org.uk/ukwa/info/nominate.

As a final note, we are pleased to report that Jane and Chris have happily agreed to their site being included which is great news as it means present day copyright musings will be preserved for years to come!

22 December 2017

What can you find in the (Beta) UK Web Archive?

We recently launched a new Beta interface for searching and browsing our web archive collections but what can you expect to find?

UKWA have been collecting websites since 2005 in a range of different ways, using changing software and hardware. This means that behind the scenes we can't bring all of the material collected into one place all at once. What isn't there currently will be added over the next six months (look out for notices on twitter). 

What is available now?

At launch on 5 December 2017 the Beta website includes all of the websites that have been 'frequently crawled' (collected more often than annually) 2013-2016. This includes a large number of 'Special Collections' selected over this time and a reasonable selection of news media.

DC07-screenshot-brexit

What is coming soon?

We are aiming to add 'frequently crawled' websites from 2005-2013 to the Beta collection in January/February 2018. This will add our earliest special collections (e.g. 2005 UK General Election) and should complete all of the websites that we have permission to publicly display.

What will be available by summer 2018?

The largest and most difficult task for us is to add all of the websites that have been collected as part of the 'Legal Deposit' since 2013. We do a vast 'crawl' once per year of 'everything' that we can identify as being a UK website. This includes all .UK (and .london, .scot, .wales and .cymru) plus any website we identify as being on a server in the UK. This amounts to tens of millions of websites (and billions of individual assets). Due to the scale of this undertaking we thank you for your patience.

We would love to know your experiences of using the new Beta service, let us know here: www.surveymonkey.co.uk/r/ukwasurvey01

 By Jason Webber, Web Archive Engagement Manager, The British Library

 

05 December 2017

A New (Beta) Interface for the UK Web Archive

The UK Web Archive has a new user interface! Try it now: 

beta.webarchive.org.uk/

Screenshot-beta-home-01

What's new?

  • For the first time you can search both the 'Open UK Web Archive'1 and the 'Legal Deposit Web Archive'2 from the same search box
  • We have improved the search and have included faceting so that it's easier to find what you are looking for
  • A simple, clean design that (hopefully) allows the content to be the focus
  • Easily browsable 'Special Collections' (curated groups of websites on a theme, topic or event)

 What next?

This is just the start of what will be a series of improvements, but we need your help! Please use the beta site and tell us how it went (good, bad or meh) by filling in this short 2 minute survey:

www.surveymonkey.co.uk/r/ukwasurvey01

Thank you!

by Jason Webber, Web Archive Engagement Manager, The British Library

1 The Open UK Web Archive was started in 2005 and comprises of approximately 15,000 websites that can be viewed anywhere

2 The Legal Deposit Web Archive was started in 2013 and comprises millions of websites but these can only be viewed in the Reading Rooms of UK Legal Deposit Libraries

10 November 2017

Driving Crawls With Web Annotations

By Dr Andrew Jackson, Web Archive Technical Lead, The British Library

The heart of the idea was simple. Rather than our traditional linear harvesting process, we would think in terms of annotating the live web, and imagine how we might use those annotations to drive the web-archiving process. From this perspective, each Target in the Web Curator Tool is really very similar to a bookmark on an social bookmarking service (like PinboardDiigo or Delicious1), except that as well as describing the web site, the annotations also drive the archiving of that site2.

In this unified model, some annotations may simply highlight a specific site or URL at some point in time, using descriptive metadata to help ensure important resources are made available to our users. Others might more explicitly drive the crawling process, by describing how often the site should be re-crawled, whether robots.txt should be obeyed, and so on. Crucially, where a particular website cannot be ruled as in-scope for UK legal deposit automatically, the annotations can be used to record any additional evidence that permits us to crawl the site. Any permissions we have sought in order to make an archived web site available under open access can also be recorded in much the same way.

Once we have crawled the URLs and sites of interest, we can then apply the same annotation model to the captured material. In particular, we can combine one or more targets with a selection of annotated snapshots to form a collection. These ‘instance annotations’ could be quite detailed, similar to those supported by web annotation services like Hypothes.is, and indeed this may provide a way for web archives to support and interoperate with services like that.3

Thinking in terms of annotations also makes it easier to peel processes apart from their results. For example, metadata that indicates whether we have passed those instances through a QA process can be recorded as annotations on our archived web, but the actual QA process itself can be done entirely outside of the tool that records the annotations.

To test out this approach, we built a prototype Annotation & Curation Tool (ACT) based on Drupal. Drupal makes it easy to create web UIs for custom content types, and we were able to create a simple, usable interface very quickly. This allowed curators to register URLs and specify the additional metadata we needed, including the crawl permissions, schedules and frequencies. But how do we use this to drive the crawl?

Our solution was to configure Drupal so that it provided a ‘crawl feed’ in a machine readable format. This was initially a simple list of data objects (one per Target), containing all the information we held about that Target, and where the list could be filtered by crawl frequency (daily, weekly, monthly, and so on). However, as the number of entries in the system grew, having the entire set of data associated with each Target eventually became unmanageable. This led to a simplified description that just contains the information we need to run a crawl, which looks something like this:

[
    {
        "id": 1,
        "title": "gov.uk Publications",
        "seeds": [
            "https://www.gov.uk/government/publications"
        ],
        "schedules": [
            {
                "frequency": "MONTHLY",
                "startDate": 1438246800000,
                "endDate": null
            }
        ],
        "scope": "root",
        "depth": "DEEP",
        "ignoreRobotsTxt": false,
        "documentUrlScheme": null,
        "loginPageUrl": null,
        "secretId": null,
        "logoutUrl": null,
        "watched": false
    },
    ...



This simple data export became the first of our web archiving APIs – a set of application programming interfaces we use to try to split large services into modular components4.

Of course, the output of the crawl engines also needs to meet some kind of standard so that the downstream indexing, ingesting and access tools know what to do. This works much like the API concept described above, but is even simpler, as we just rely on standard file formats in a fixed directory layout. Any crawler can be used as long as it outputs standard WARCs and logs, and puts them into the following directory layout:

/output/logs/{job-identifer}/{launch-timestamp}/*.log
/output/warcs/{job-identifer}/{launch-timestamp}/*.warc.gz

Where the {job-identifer} is used to specify which crawl job (and hence which crawl configuration) is being used, and the {launch-timestamp} is used to separate distinct jobs launched using the same overall configuration, reflecting repeated re-crawling of the same sites over time.

In other words, if we have two different crawler engines that can be driven by the same crawl feed data and output the same format results, we can switch between them easily. Similarly, we can make any kind of changes to our Annotation & Curation Tool, or even replace it entirely, and as long as it generates the same crawl feed data, the crawler engine doesn’t have to care. Finally, as we’ve also standardised the crawler output, the tools we use to post-process our crawl data can also be independent of the specific crawl engine in use.

This separation of components has been crucial to our recent progress. By de-coupling the different processes within the crawl lifecycle, each of the individual parts is able to be move at it’s own pace. Each can be modified, tested and rolled-out without affecting the others, if we so choose. True, making large changes that affect multiple components does require more careful management of the development process, but this is a small price to pay for the ease by which we can roll out improvements and bugfixes to individual components.

A prime example of this is how our Heritrix crawl engine itself has evolved over time, and that will be the subject of the next blog post.

  1. Although, noting that Delicious is now owned by Pinboard, I would like to make it clear that we are not attempting to compete with Pinboard. 

  2. Note that this is also a feature of some bookmarking sites. But we are not attempting to compete with Pinboard. 

  3. I’m not yet sure how this might work, but some combination of the Open Annotation Specification and Memento might be a good starting point. 

  4. For more information, see the Architecture section of this follow-up blog post 

03 November 2017

Guy Fawkes, Bonfire or Fireworks Night?

What do you call the 5th of November? As a child of the 70s and 80s it was 'Guy Fawkes' night and my friends and I might make a 'guy' to throw on the bonfire. It is interesting to see through an analysis of the UK Web Archive SHINE service that the popularity of the term 'Guy Fawkes' was overtaken by 'Bonfire night' in 2009. I've included 'Fireworks night' too for comparison.

Bonfire-night

Is this part of a trend away from the original anti-catholic remembrance and celebration to a more neutral event?

Examine this (and other) trends on our SHINE service.

By Jason Webber, Web Archive Engagement Manager, The British Library

24 October 2017

Web Archiving Tools for Legal Deposit

By Andy Jackson, Web Archive Technical Lead, The British Library - re-blogged from anjackson.net

Before I revisit the ideas explored in the first post in the blog series I need to go back to the start of this story…

Between 2003 and 2013 – before the Non-Print Legal Deposit regulations came into force – the UK Web Archive could only archive websites by explicit permission. During this time, the Web Curator Tool (WCT) was used to manage almost the entire life-cycle of the material in the archive. Initial processing of nominations was done via a separate Selection & Permission Tool (SPT), and the final playback was via a separate instance of Wayback, but WCT drove the rest of the process.

Of course, selective archiving is valuable in it’s own right, but this was also seen as a way of building up the experience and expertise required to implement full domain crawling under Legal Deposit. However, WCT was not deemed to be a good match for a domain crawl. The old version of Heritrix embedded inside WCT was not considered very scalable, was not expected to be supported for much longer, and was difficult to re-use or replace because of the way it was baked inside WCT.1

The chosen solution was to use Heritrix 3 to perform the domain crawl separately from the selective harvesting process. While this was rather different to Heritrix 1, requiring incompatible methods of set-up and configuration, it scaled fairly effectively, allowing us to perform a full domain crawl on a single server2.

This was the proposed arrangement when I joined the UK Web Archive team, and this was retained through the onset of the Non-Print Legal Deposit regulations. The domain crawls and the WCT crawls continued side by side, but were treated as separate collections. It would be possible to move between them by following links in Wayback, but no more.

This is not necessarily a bad idea, but it seemed to be a terrible shame, largely because it made it very difficult to effectively re-use material that had been collected as part of the domain crawl. For example, what if we found we’d missed an important website that should have been in one of our high-profile collections, but because we didn’t know about it had only been captured under the domain crawl? Well, we’d want to go and add those old instances to that collection, of course.

Similarly, what if we wanted to merge material collected using a range of different web archiving tools or services into our main collections? For example, for some difficult sites we may have to drive the archiving process manually. We need to be able to properly integrate that content into our systems and present them as part of a coherent whole.

But WCT makes these kind of things really hard.

If you look at the overall architecture, the Web Curator Tool enforces what is essentially (despite the odd loop or dead-end) a linear workflow (figure taken from here). First you sort out the permissions, then you define your Target and it’s metadata, then you crawl it (and maybe re-crawl it for QA), then you store it, then you make it available. In that order.

WCT-workflow

But what if we’ve already crawled it? Or collected it some other way? What if we want to add metadata to existing Targets? What if we want to store something but not make it available. What if we want to make domain crawl material available even if we haven’t QA’d it?

Looking at WCT, the components we needed were there, but tightly integrated in one monolithic application and baked into the expected workflow. I could not see how to take it apart and rebuild it in a way that would make sense and enable us to do what we needed. Furthermore, we had already built up a rather complex arrangement of additional components around WCT (this includes applications like SPT but also a rather messy nest of database triggers, cronjobs and scripts). It therefore made some sense to revisit our architecture as a whole.

So, I made the decision to make a fresh start. Instead of the WCT and SPT, we would develop a new, more modular archiving architecture built around the concept of annotations…

  1. Although we have moved away from WCT it is still under active development thanks to the National Library of New Zealand, including Heritrix3 integration! ↩
  2. Not without some stability and robustness problems. I’ll return to this point in a later post. ↩

25 September 2017

Collecting Webcomics in the UK Web Archive

By Jen Aggleton, PhD candidate in Education at the University of Cambridge

As part of my PhD placement at the British Library, I was asked to establish a special collection of webcomics within the UK Web Archive. In order to do so, it was necessary to outline the scope of the collection, and therefore attempt to define what exactly is and is not a digital comic. As anyone with a background in comics will tell you, comics scholars have been debating what exactly a comic is for decades, and have entirely failed to reach a consensus on the issue. The matter only gets trickier when you add in digital components such as audio and animation.

Under-construction

Due to this lack of consensus, I felt it was important to be very transparent about exactly what criteria have been used to outline the scope of this collection. These criteria have been developed through reference to scholarship on both digital and print comics, as well as my own analysis of numerous digital comics.

The scope of this collection covers items with the following characteristics:

  • The collection item must be published in a digital format
  • The collection item must contain a single panel image or series of interdependent images
  • The collection item must have a semi-guided reading pathway1

In addition, the collection item is likely to contain the following:

  • Visible frames
  • Iconic symbols such as word balloons
  • Hand-written style lettering which may use its visual form to communicate additional meaning

The item must not be:

  • Purely moving image
  • Purely audio

For contested items, where an item meets these categories but still does not seem to be a comic, it will be judged to be a comic if it self-identifies as such (e.g. a digital picturebook may meet all of these criteria, but self-identifies as a picturebook, not a comic).

Where the item is an adaptation of a print born comic, it must be a new expression of the original, not merely a different manifestation, according to FRBR guidelines: www.loc.gov/cds/FRBR.html.

1 Definition of a semi-guided reading pathway: The reader has autonomy over the time they spend reading any particular aspect of the item, and some agency over the order in which they read the item, especially the visual elements. However reading is also guided in the progression through any language elements, and likely to be guided in the order of movement from one image to another, though this pathway may not always be clear. This excludes items that are purely pictures, as well as items which are purely animation.

Alongside being clear about what the collection guidelines are, it is also important to give users information on the item acquisition process – how items were identified to be added to the collection. An attempt has been made to be comprehensive: including well known webcomics published in the UK and Ireland by award-winning artists, but also webcomics by creators making comics in their spare time and self-publishing their work. This process has, however, been limited by issues of discoverability and staff time.

Well known webcomics were added to the collection, along with webcomics discovered through internet searches, and those nominated by individuals after calls for nominations were sent out on social media. This process yielded an initial collection of 42 webcomic sites (a coincidental but nonetheless highly pleasing number, as surely comics do indeed contain the answers to the ultimate question of life, the universe, and everything). However, there are many more webcomics published by UK and Ireland based creators out there. If you know of a webcomic that should be added to our collection, please do nominate it at www.webarchive.org.uk/ukwa/info/nominate.

Jen Aggleton, PhD candidate in Education at the University of Cambridge, has recently completed a three month placement at the British Library on the subject of digital comics. For more information about what the placement has entailed, you can read this earlier blog.

16 August 2017

If Websites Could Talk (again)

By Hedley Sutton, Team Leader, Asian & African studies Reference Services

Here we are again, eavesdropping on a conversation among UK domain websites as to which one has the best claim to be recognized as the most extraordinary…

“Happy to start the ball rolling,” said the British Fantasy Society. “Clue in the name, you know.”

“Ditto,” added the Ghost Club.

“Indeed,” came the response. “However … how shall I put this? … don’t you think we need a site that’s a bit more … well, intellectual?” said the National Brain Appeal.

“Couldn’t agree more,” chipped in the Register of Accredited Metallic Phosphide Standards in the United Kingdom.

“Come off it,” chortled the Pork Pie Appreciation Society. “That would rule out lots of sites straightaway. Nothing very intellectual about us!”

“Too right,” muttered London Skeptics in the Pub.

Before things became heated the British Button Society. made a suggestion. “Perhaps we could ask the Witchcraft & Human Rights Information Network  to cast a spell to find out the strangest site?”

The silence that followed was broken by Campaign Bootcamp. “Come on – look lively, you ‘orrible lot! Hup-two-three, hup-two-three!”

“Sorry,” said the Leg Ulcer Forum. “I can’t, I’ll have to sit down. I’ll just have a quiet chat with the Society of Master Shoe Repairers. Preferably out of earshot of the Society for Old Age Rational Suicide.”

“Let’s not get morbid,” said Dream It Believe It Achieve It helpfully. “It’s all in the mind. You can do it if you really try.”

There was a pause. “What about two sites applying jointly?” suggested the Anglo Nubian Goat Society. “I’m sure we could come to some sort of agreement with the English Goat Breeders Association.”

“Perhaps you could even hook up with the Animal Interfaith Alliance,” mused the World Carrot Museum.

“Boo!” yelled the British Association of Skin Camouflage suddenly. “Did I fool you? I thought I would come disguised as the Chopsticks Club.

“Be quiet!” yelled the Mouth That Roars even louder. “We must come to a decision, and soon. We’ve wasted enough time as it is.”

The minutes of the meeting show that, almost inevitably, the site that was eventually chosen was … the Brilliant Club.

If there is a UK based website you think we should collect, suggest it here.