THE BRITISH LIBRARY

UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

21 March 2019

Save UK Published Google + Accounts Now!

The fragility of social media data was highlighted recently when Myspace deleted (by accident) user’s audio and video files without warning. This almost certainly resulted in the loss of many unique and original pieces of work. This is another example of how online social media platforms should not be seen as archives and that if things are important to you they should also be stored elsewhere. The UK Web Archive can play a role in this and we do what we can to preserve websites and selected social media. We do, however, need your help!

Google+
If you have a  Google + account you will have seen the warning that the service is shutting down on 2 April 2019 and have warned users to download any data they want to save by 31 March 2019.

However, it’s not easy to know how to preserve data from social media accounts and sometimes this information without the context of the platform it was hosted on doesn’t give the full picture. In a previous blog post we outlined the challenges involved in archiving social media. Currently the most popular social media platform in the UK Web Archive is Twitter, followed by Facebook, which we haven’t been able to successfully capture since 2015, and a limited amount of Instagram, Wiebo, WeChat and Google +.

Under the 2013 Non-Print Legal Deposit Regulations we can legally only collect digital content published in the UK. As these platforms are hosted outside the UK there is no automated way to identify UK accounts so it requires a person to look through and identify the profiles that are added. In general, these are profiles of politicians, public figures, people renowned in their field of study, campaign groups and institutions.

So far, we only have handful of Google + profiles in the UK Web Archive but we are keen to have more.

How to save your Google+ data
If you have a Google + profile or know of other profiles published in the UK that you think should be preserved, fill in our nomination form before March 29th 2019: https://www.webarchive.org.uk/en/ukwa/info/nominate

If the profiles you want to archive outside the UK you can use the save a website now function on the Internet Archive website: https://archive.org/web/

By Helena Byrne, Web Curator of Web Archiving, The British Library

02 January 2019

Extracting Place Names from Web Archives at Archives Unleashed Vancouver

By Gethin Rees, Lead Curator of Digital Mapping, The British Library

I recently attended the Archives Unleashed hackathon in Vancouver. The fantastic Archives Unleashed project aims to help scholars research the recent past by using big data from web archives. The project organises a series of datathons where researchers collaboratively work with web archive collections over the course of two days. The participants divide into small teams with the aim of producing a piece of research using the archives that they can present at the end of the event and compete for a prize. One of the most important tools that we used in the datathon was the Archives Unleashed Toolkit (AUT).

Archives-Unleashed-project

The team I was on chose to use a dataset that documented a series of Wildfires in British Columbia from 2017 and 2018 (ubc-bc-wildfires). I came to the datathon with an interest in visualising web archive data geographically: place names or toponyms contained in the text from web pages would form the core of such a visualisation. I had little experience of natural language processing before the datathon but, keen to improve my python skills, I decided to take on the challenge in the true spirit of unleashing archives!

My plan to produce such a visualisation consisted of several steps:

1) Pre-process the web archive data (Clean)
2) Extract named entities from the text (NER)
3) Determine which are place names (Geoparse)
4) Add coordinates to place names (Geocode)
5) Visualise the place names (Map)

This blog post is concerned primarily with steps 2 and 3.

An important lesson from the datathon for me is that Web Archive data are very messy. In order to get decent results from steps 2 and 3 it is important to really clean the data as thoroughly as possible. Luckily, the AUT contains several methods that can help to do this (outlined here). The analyses that follow were all run on the output of the AUT ‘Plain text minus boilerplate’ method.

There are a wealth of options available to achieve steps 2 and 3, the discussion that follows does not aim to be exhaustive but to evaluate the methods that we attempted in the datathon.

AUT NER

The first method we attempted was to use the AUT NER method (discussed here). The AUT does a great job of packaging up the Stanford Named Entity Recognizer for easy use with a simple scala command. We ran the method on the AUT derivative of the 2017 section of our Wildfires dataset (around 300mb) using the powerful virtual machines that were helpfully provided by the organisers. However, we found it difficult to get results as the analysis took a long time and often crashed the virtual machine. These problems persisted even when running the NER method on a small subset of the Wildfires dataset, making it difficult to use on a smallish set of WARCs.

The results came in in the following format:

    (20170809,dns:www.nytimes.com,{"PERSON":[],"ORGANIZATION":[],"LOCATION":[]})

Which required processing with a simple python script.

When we did obtain results, the “LOCATIONS” arrays seemed to contain only a fraction of the total place names that appeared in the text.

AUT
- Positives: Simple to execute, tailored to web archive data
- Negatives: Time consuming, processor intensive, output requires processing, not all locations returned

Geoparser

So we next turned our attention to the Edinburgh Geoparser and the excellent accompanying tutorial that I have used to great effect on other projects. Unfortunately the analysis resulted in several errors which prevented the Geoparser returning results. During the time available in the datathon we were not able to resolve these errors. The Geoparser appeared unable to deal with the output of AUT’s ‘Plain text minus boilerplate’ method. I attempted other methods to clean the data including changing the encoding and removing ctrl characters. The following python commands:

import re
s = open('9196-fulltext.txt', mode='r', encoding='utf-8-sig').read()
re.sub(r'[\x00-\x1F]+', '', s)
s.rstrip()

removed these errors:

Error: Input error: Illegal character <0x1f> immediately before file offset 6307408
in unnamed entity at line 2169 char 1 of <stream>
Error: Expected whitespace or tag end in start tag
in unnamed entity at line 4 char 6 of <stream>

However the following error remained which we could not fix even after breaking the text into small chunks:

Error: Document ends too soon
in unnamed entity at line 1 char 1 of <stream>

I would be grateful for any input about how to overcome this error as I would love to use the Geoparser to extract place names from Warc files in the future.

Geoparser
- Positives: well-documented, powerful software. Fairly easy to use. Excellent results with OCR or plain text.
- Negatives: didn’t seem to deal well with the scale and/or messiness of web archive data.

NLTK

My final attempt to extract place names involved using the python NLTK library with the following packages 'averaged_perceptron_tagger', 'maxent_ne_chunker', 'words'. The initial aim was to extract the named entities from the text. A preliminary script designed to achieve this can be found here.

This extraction does not separate place names from other named entities such as proper nouns and therefore a second stage involved checking if the entities returned by NLTK were present in a gazetteer. We found a suitable gazetteer with a wealth of different information and in the final hours of the datathon I attempted to hack together something to match the NER results with the gazetteer

Unfortunately I ran out of time both to write the necessary code and to run the script over the dataset. The script badly needs improvement using dataframes and other optimisation. Notwithstanding its preliminary nature, it is clear that this method of extracting place names is slow. The quality of results is also highly dependent on the quality and size of the gazetteer. Only place names found within the gazetteer will be extracted and therefore, if the gazetteer is biased or deficient in some way, the resulting output will be skewed. Furthermore, as the gazetteer becomes larger, the extraction of place names will become painfully slow.

The method described replicates the functionality of geoparser tools yet is a little more flexible, allowing the participant to take account for the idiosyncrasies of web archive data such as unusual characters.

NLTK
- positives: flexibility, works
- negatives: slow, reliant on the gazetteer, requires python skills


Concluding Remarks

Despite the travails that I have outlined, my team mates, adopting a non-programmatic approach, came up with this brilliant map by doing some nifty things with a gazetteer, Voyant tools and QGIS.

Voyant-map

From a programmatic perspective it appears that there is still work required to develop a method to extract place names from web archive data at scale, particularly in the hectic and fast-paced environment of a datathon. The main challenge is the messiness of the data with many tools throwing errors that were difficult to rectify. In terms of future datathons, speed of analysis and implementation is a critical consideration as datathons aim to deal with big data in a short amount of time. Of course, the preceding discussion has hardly considered the quality of information outputted by the tools. This is another essential consideration and requires further work. Another future direction would be to examine other tools such as spaCy, Polyglot and NER-Tagger as described in this article.

 

20 December 2018

The UK Web Archive gets a fresh look

Until recently, if you wanted to research a historic UK website you may have had to look in a number of different places. There was the 'Open' UK Web Archive that contained the 15,000 or so publicly available websites collected since 2005. If you also wanted to check the vast 'Legal Deposit' web archive  (containing the whole UK Web space) then you would need to travel to the reading room of a UK Legal Deposit Library to see if what you needed was there. For the first time, the new UKWA website offers:

  • The ability to search the entire collection in one place
  • the opportunity to browse over 100 curated collections on a wide range of topics.

Home

www.webarchive.org.uk

Who is the UK Web Archive?
UKWA is a partnership of all the UK Legal Deposit Libraries - The British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries Oxford, Cambridge University Libraries, Trinity College, Dublin. The Legal Deposit Web Archive is available in the reading rooms of all the Libraries. A readers pass for each library is required to gain access to a reading room. 

How much is available now?
At the time of writing, everything that a human (curators and collaborators) has selected since 2005 is searchable. This constitutes many thousands of websites and millions of individual web pages. We will be adding the huge yearly Legal Deposit collections over the coming year - we'll let you know as they become available.

Among the many websites available are the BBC and many newspaper websites such as The Sun, The Daily Mail and The Guardian.

Do the websites look and work as they did originally?
Yes and no. Every effort is made so that websites look how they did originally and internal links should work. However, for a variety of technical  issues many websites will look different or some elements may be missing. As a minimum, all of the text in the collection is searchable and most images should be there. Whilst we collect a considerable amount of video, much of this will not play back.

Is every UK website available?
We aim to collect every website made or owned by a UK resident, however, in reality it is extremely difficult to be comprehensive! Our annual Legal Deposit collections include every .uk (and .london, .scot, .wales and .cymru) plus any website on a server located in the UK. Of course, many websites are .com, .info etc. and on servers in other countries.

If you have or know of a UK website that should be in the archive we encourage you to nominate them here.

Keep in touch by following us on Twitter.

By Jason Webber, Web Archive Engagement Manager, The British Library