UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

4 posts from August 2014

27 August 2014

User driven digital preservation with Interject

When we archive the web, we want to do our best to ensure that future generations will be able to access the content we have preserved. This isn’t only a matter of keeping the digital files safe and ensuring they don’t get damaged. We also need to worry about the software that is required in order to access those resources.

A Digital Dark Age?
For many years, the spectre of obsolescence and the ensuing digital dark age drove much of the research into digital preservation. However, even back in 2006, Chris Rusbridge was arguing that this concern was overblown, and since at least 2007, David Rosenthal has been arguing that this kind of obsolescence is no longer credible.

What are the risks?
The current consensus among those who care for content seems to have largely (but not universally) shifted away from perceiving obsolescence as the main risk we face. Many of us expect the vast majority of our content to remain accessible for at least one or two decades, and any attempt to predict the future beyond the next twenty years should be taken with a large pinch of salt. In the meantime, we are likely to face much more basic issues concerning the economics of storage, and concerning the need to adopt scalable collection management techniques to ensure the content we have remains safe, discoverable, and is accompanied by the contextual information it depends upon.

This is not to say obsolescence is no risk at all, but rather that the scale of the problem is uncertain. Therefore, in order to know how best to take care of our digital history, we need to find ways of gathering more evidence about this issue.

Understanding our collections
One aspect of this is to analyse the content we have, and try to understand how it has changed over time. Examples of this kind of work include our paper on Formats Over Time (more here), and more recent work on embedding this kind of preservation analysis in our full-text indexing processes so we can explore these issues more readily.

But studying the content can only tell you half the story - for the other half, we need to work out what we mean by obsolescence.

Understanding obsolescence
If there is an open source implementation of a given format, then we might say that format cannot be obsolete. But if 99.9% of the visitors to the web archive are not aware of that fact (and even if they were, would not be able to compile and build the software in order to access the content), is that really an accurate statement? If the members of our designated community can’t open it, then it’s obsolete, whether or not someone somewhere has all the time, skills and resources needed to make it work.

Obsolescence is a user experience problem that ends with frustration. So how can we better understand the needs and capabilities of our users, to enable them to help drive the digital preservation process?

How Interject can help
To this end, and working with the SCAPE Project, we have built a prototype service that is designed to help us find the content that users are having difficulties with, and where possible, to provide alternative ways of accessing that content. This prototype service, called Interject, demonstrates how a mixture of feedback systems and preservation actions can be smoothly integrated into the search infrastructure of the UK Web Archive, by acting as an ‘access helper’ for end users.

ZX Spectrum Software
For example, if you go to our historical search prototype and look for a specific file called ‘lostcave.z80’ you’ll see the Internet Archive has a number of copies of this old ZX Spectrum game but, unless you have an emulator to hand, you won’t be able to use them. However, if you click ‘Use our access helper’, the Interject service will inspect the resource, summarise what we understand about it, and where possible offer transformed versions of that resource. In the case of ‘lostcave.z80’, this includes a full browser-based emulation so that you can actually play the game yourself. (Note that this example was partially inspired by the excellent work on browser-based emulated access being carried out by the Internet Archive).

The Interject service can offer a range of transformation options for a given format. For example, instead of running the emulator in your browser, the service can spin up an emulator in the background, take a screenshot, and then deliver that image back to you, like this:


These simple screenshots are not quite as impressive as the multi-frame GIFs created by Jason Scott’s Screen Shotgun, but they do illustrate the potential a simple web API that transforms content on demand.

Early image formats
As the available development time was relatively short, we were only able to add support for a few ‘difficult’ formats. For example, the X BitMap image format was the first image format on the web. However, despite this early and important role this format and the related X PixMap format (for colour images) are not widely supported today and so may require format conversion in order to enable access. Fortunately, there are a number of open source projects that support these formats, and Interject makes them easy to use. See for example image.xbm, xterm-linux.xpm and this embedded equation image shown below as a more modern PNG:


We also added support for VRML1 and VRML97, two early web-based formats for 3D environments that required a browser plugin to explore. Those plugins are not available for modern browsers, and the formats have been superseded by the X3D format. Unfortunately these formats are not backward compatible with each other, and tool support for VRML1 is quite limited. However, we were able to find suitable tools for all three formats, and using Interject, we are able to take a VRML1 file, and then combine a two format conversions (VRML1-to-VRML97 and VRML97-to-X3D) before passing the result to a browser-based X3D renderer, like this.

The future of Interject
Each format we decide to support adds an additional development and maintenance burden, and so it is not clear how sustainable this approach will be in the long term. This is one of the reasons why Interject is open source, and we would be very happy to receive ideas or contributions from other individuals and organisations.

Letting users lead the way
But even with a limited number of transformation services, the core of this idea is to find ways to listen to our users, so we have some chance of finding out what content is ‘obsolete’ to them. By listening when they ask for help, and by allowing our visitors to decide between the available options, the real needs of our designated communities can be expressed directly to us and so taken into account as part of the preservation planning process.

By Andy Jackson, Web Archiving Technical Lead, The British Library

15 August 2014

Archiving ‘screenshots’

Since the passing of Legal Deposit legislation in April of 2013 the UK Web Archive has been generating screenshots of the front-pages of each visited website. The manner in which we chose to store these has changed over the course of our activities, from simple JPEG files on disk (a silly idea) to HAR-based records with WARC metadata records (hopefully a less silly idea). What follows is our reasoning behind these choices.

What not to do
When Legal Deposit legislation passed in April of 2013 we were experimenting with the asynchronous rendering of web pages in a headless browser (specifically PhantomJS) to avoid some of the limitations we’d encountered with our crawler software. While doing so it occurred to us that if we were taking the time to actually render a page in a browser, why not generate and store screenshots?

As we had no particular use-case in mind they were simply stored in a flat file system, the filename being the Epoch time at which they were rendered, e.g:


There’s an obvious flaw here: the complete lack of any detail as to the provenance of the image. Unfortunately that wasn’t quite obvious enough and after approximately 15 weeks and 1,118,904 screenshots we changed the naming scheme to something more useful, e.g:

The above now includes the encoded URL and the epoch timestamp. This was again changed one final time, replacing the epoch timestamp with a human-readable version, e.g:

Despite a more sensible naming convention we were still left with a large number of files sitting in a directory on disk which could not be stored along with our crawled content and as a consequence of this, could not be accessed by normal channels.

A bit better?
A simple solution was to store the images in WARC files (ISO 28500—the current de facto storage format for web archives). We could then permanently archive them alongside our regular web content and access them in a similar fashion. However, the WARC format is designed specifically for storing web content, i.e. a resource referenced by a URL. Our screenshots, unfortunately, didn’t really fit this pattern. Each represented the Webkit rendering of a web page, actually likely to be the result of any number of web resources, not just the single HTML page referenced by the site’s root URL. We therefore did what anyone does when faced with a conundrum of sufficient brevity: we took to Twitter.

The WARC format contains several types of record which we potentially could have used: response, resource, conversion and metadata. The resource and response record types are intended to store the “output of a URI-addressable service” (thanks, @gojomo). Although our screenshots are actually rendered using a webservice, the URI used to generate them would not necessarily correspond to that of the original site, thus losing the relationship between the site and the image. We could have used a further metadata record to tie the two together but others (thanks to @tef and @junklight) had already suggested that a metadata record might be the best location for the screenshots themselves. Bowing to conventional wisdom, we opted for this latter method.

WARC-type: metadata
As mentioned earlier, screenshots are actually rendered using a PhantomJS-based webservice —the same webservice used to extract links from the rendered page. It is at this point worth noting that when rendering a page and outputting details of all requests/responses made/received during said rendering, PhantomJS by default returns this information in HTTP Archive (HAR) format. Although not specifically a formal standard, HAR has become the almost de facto method for communicating HTTP transaction information (most browsers will export data about a page in this format).

The HAR format permits information to be communicated not only about the page, but about each of the component resources used to create that page (i.e. the stylesheets, Javascript, images, etc.). As part of each response entry—each pertaining to one of the aforementioned resources—the HAR format allows you to store the actual content of that response in a “text” field, e.g:


Unfortunately, this is where the HAR format doesn't quite meet our needs. Rather than storing the content for a particular resource we need to store content for the rendered page—potentially the result of several resources and thus not suited to a response record.

Thankfully though, the HAR format does permit you to record information at a page level:


Better still…
Given the above we decided to leverage the HAR format by combining elements from the content section of the response with those of the page. There were two things we realised we could store:

1. As initially desired, an image of the rendered page.
2. The final DOM of the rendered page.

With this latter addition, it occured to us that the final representation of a HTML page—thanks to client-side Javascript altering the DOM—might differ from that which the server originally returned. As this final representation was the version which PhantomJS used to render the screenshot it made sense to attempt to record that too. In order to distinguish this final, rendered version from content of the HAR’s corresponding response record the name was amended to renderedContent.

Similarly named, we stored the screenshot of the whole, rendered page under a new element, renderedElements:


Given that we were trying to store a single screenshot, storing it in an element which is firstly, plural and secondly, an array, might seem a questionable choice. PhantomJS has one further ability that we decided to leverage for future use: the ability to render images of specific elements within a page.

In the course of a crawl there are some things we can't (yet) archive—videos, embedded maps, etc. However, if we can identify the presence of some non-archivable section of a page (an iframe which references a Google Map, for instance), we can at least render an image of it for later reference.
For this reason, the whole-page screenshot is simply referenced by its CSS selector (“:root”), enabling us to include a further series of images for any part of the page.

About those jpgs…
Currently, all screenshots are stored in the above format, in WARCs and will be ingested alongside regularly-crawled data. The older set of JPEGs on disk were converted, using the URL and timestamp in the filename, to the above HAR-like format (obviously lacking details of other resources and the final DOM which had been lost) and added to a series of WARC files.

The very earliest set of images, those simply stored using the Epoch time, are regrettably left as an exercise for future workers in Digital Preservation.

by Roger G. Coram, Web Crawl Engineer, The British Library

11 August 2014

Web Archiving in the JavaScript Age

Among the responses to our earlier post, 'How much of the UK’s HTML is valid?', Gary McGath’s 'HTML and fuzzy validity' deserves to be highlighted, as it explores an issue very close to our hearts: how to cope when the modern web is dominated by JavaScript.

The Age of JavaScript
In particular, he discusses one of the central challenges of the Age of JavaScript: making sure you have copies of all the resources that are dynamically loaded as the page is rendered. We tend to call this ‘dependency analysis’, and we consider this to be a much more pressing preservation risk than bit rot or obsolescence. If you never even know you need something, you’ll never go get it and so never even get the chance to preserve it.

The <script> tag
To give you an idea of the problem, the following graph shows how the usage of the <script> tag has varied over time:


In 1995, almost no pages used the <script> tag, but fifteen years later, over 95% of web pages require JavaScript. This has been a massive sea-change in the nature of the World Wide Web, and web archives have had to react to it or face irrelevance.

For example, for the Internet Archive’s Archive-It Service, they have developed the Umbra tool, which uses a browser testing engine based on Google Chrome to process URLs sent from the Heritrix crawler, extract the additional URLs that content depends upon, and send them back to Heritrix to be crawled.

We use a similar system during our crawls, including domain crawls. However, rendering web pages takes time and resources, so we don’t render every single URL of the billions in each domain crawl. Instead, we render all host home-pages, and we render the ‘catalogued’ URLs that our curators have indicated are of particular interest. The architecture is similar to that used by Umbra, based around our own page rendering service.

We’ve been doing this since the first domain crawl in 2013, and so this seems to be one area where the web archives are ahead of Google and their attempts to understand web pages better.

Furthermore, given we are having to render the pages anyway, we have used this as an opportunity to take screenshots of the original web pages during the crawl, and to add those screenshots to the archival store (we’ll cover more of the details on that in a later blog post). This means we are in a much better position to evaluate any future preservation actions we might require reconstructing the rendering process and we expect these historical screenshots to be of great interest to the researchers of the future.

By Andy Jackson, Web Archiving Technical Lead, The British Library

06 August 2014

Web Archiving Collection Development Policies Roundup

The British Library is only one of dozens of national libraries, universities and other organisations around the world that harvest, preserve and give access to web archive resources. We are also a member of the International Internet Preservation Consortium (IIPC) that has 48 members worldwide, committed to the long-term preservation of internet resources.

A world of web archiving
Following much recent discussion, member institutions of the International Internet Preservation Consortium (IIPC) recently added their collection development policies for web archiving to their website. This page also contains links to policies of some web archiving institutions which are not IIPC members. We are not looking at an exhaustive list for either category, however, it is still interesting to see policies brought together and read them in conjunction.

Different remits
These policies reflect the different scope of web archiving activities among the listed institutions. Some have a national remit (like The British Library), others selectively archive the web or undertake web archiving as a programme or for a project. Some (more fortunate) national institutions are supported by legislative frameworks such as Legal Deposit and can therefore archive the web at scale - the national libraries of France, UK, Finland and Austria belong to this category. Although not explicitly expressed, the lack of such framework has among others contributed to the choice of selective web archiving in some countries.

The policies vary in format, length and detail, but contain some common themes. They explain why institutions undertake web archiving, the scope of the material for collection, how websites are collected, stored and used. Some policies also cover the roles and responsibilities if web archiving is done using a collaborative model. The Bentley Historical Library’s policy is a good example, which includes a section on the responsibilities of the archive, the provider of the service the Library subscribes to, and that of the content owners.

Legal issues
Various legal aspects are covered by these polices, ranging from intellectual properties, permissions to sensitive data within the web archives. The Finnish policy, by far the most comprehensive among all policies, lists and describes legislations relevant to web archiving including the library’s interpretation. It also deals with the topic of sensitive data comprehensively, dividing it into the following categories:

  • Personal Data Illegitimately Published
  • Inspection and Correction of Personal Data
  • Confidential and Secret Information
  • Web Contents that Violate Law
  • Web Contents Illegal to Hold

Common strategies
The National Library of Austria provide a short overview of the common strategies for web harvesting, which are adopted by institutions individually or in some form of combination. Access and use of the archive is another common theme - a number of policies state restricted access to web archives, especially those collected under national legal framework as an exemption to copyright.

Web archiving institutions share the same mission, face similar legal and technical challenges. The IIPC is a great platform to collaborate and learn transferrable lessons and practices. It may be a good idea to develop a common template for policy statements and make sure the most up to date versions are published.

By Helen Hockx-Yu, Head of Web Archiving, The British Library