THE BRITISH LIBRARY

UK Web Archive blog

25 posts categorized "Collections"

23 June 2020

WARCnet and the UK Web Archive

Add comment

By Jason Webber, Web Archiving Engagement Manager

 

We at the UK Web Archive (UKWA) have recently taken part in a new initiative called WARCnet led by the University of Aarhus in Denmark (and funded by Independent Research Fund Denmark).

“The aim of the WARCnet network is to promote high-quality national and transnational research that will help us to understand the history of (trans)national web domains and of transnational events on the web, drawing on the increasingly important digital cultural heritage held in national web archives.”

 

Warcnetblog-01
WARCnet logo

 

The majority of participants are researchers currently using web archives as part of their studies, many with extensive experience and others new to the field. This makes this an exciting project to be part of as it is an excellent way for content holders such as UKWA to be able to work closely with a group of researchers and try and understand their needs and challenges. The project had a kick-off meeting in May 2020 that was originally intended to be in person but took place virtually. All the speakers pre-recorded their talks which does now mean that these are now all available (including one by myself). I’d particularly recommend viewing the two keynote speakers Matthew S. Weber and Ian Milligan.

 

Warcnetblog-02
Title slide for Jason Webber's WARCnet presentation

 

Working Groups
It is intended for any outcomes from WARCnet to be driven by the participants themselves and to this end four working groups have been formed:

 

  • Working Group 1 - Comparing entire web domains
  • Working Group 2 - Analysing transnational events
  • Working group 3 - Digital research methods and tools
  • Working group 4 - Research data management across borders

 

The UKWA team is involved with each of the first three working groups, all of which have met in the last weeks to see how we can take this project forward. You can read more about each group here.

There are at least three more small conferences planned (currently as in person), one later this year in Luxembourg and two next year in London and Aarhus.

Look out for updates on our involvement with this initiative on this blog and through our twitter account @UKWebArchive and @WARC_net.

29 May 2020

Using Webrecorder to archive UK political party leaders' social media after the UK General Election 2019

Add comment

This blog post is is by Nicola Bingham, Helena Byrne, Carlos Lelkes-Rarugal and Giulia Carla Rossi

Introduction to Webrecorder

The UK Web Archive aims to capture the whole of the UK web space at least once a year, and targeted websites at more frequent intervals. We conduct this activity under the auspices of the Legal Deposit Regulations 2013 which enable us to capture, preserve and make accessible the UK Web for the benefit of researchers now and in the future.

Along with many cultural and heritage institutions that perform at-scale web archiving, we use Heritrix 3, the state of the art crawler developed by the Internet Archive and maintained and improved by an international community of web archiving technologists.

Heritrix copes very well with large scale, bulk crawling but is not optimised for high fidelity crawling of dynamic content, and in particular does not archive social media content very well.

Researchers are increasingly turning their attention to social media as a significant witness to our times, therefore we have a requirement to capture this content, in certain circumstances and in line with our collection development policy. Usually this will be around public events such as General Elections where much of the campaigning over recent years has been played out online and increasingly on social media in particular. 

For this reason we have looked at alternative web archiving tools such as Webrecorder to complement our existing toolset. 

Webrecorder was developed by Ilya Kreymer under the auspices of Rhizome (a non-profit organisation based in New York which commissions, presents and preserves digital art), under its digital preservation program. It offers a browser based version, which offers free accounts up to 5GB storage and a Desktop App

Webrecorder was already well known to us at the UK Web Archive although we had not used it until recently. It is a web archiving service which creates an interactive copy of web pages that the user explores in their browser including content revealed by interactions such as playing video and audio, scrolling, clicking buttons etc. This is a much more sophisticated method of acquisition than that used by Hertrix which essentially only follows HTML links and doesn’t handle dynamic content very well. 


What we planned to do

The UK General Election Campaign ran from the 6th of November 2019 when Parliament was dissolved, until polling day on the 12th of December 2019. On the 13th of December 2019 the UK Web Archive team, based at the British Library attempted to archive various social media accounts of the main political party leaders. Seventeen political leaders from the four home nations were identified and a selection of three social media accounts were targeted: Twitter, Facebook and Instagram. Not all leaders have accounts on all three platforms, but in total forty four social media accounts were archived. These accounts are identified in the table below by an X. 

List of UK political political part leaders' social media accounts archived
Image credit: Carlos Lelkes-Rarugal

 

 

How we did it

On the 13th of December, 2019 we ran the Webrecorder Desktop App across twelve office PCs. Many were running the Webrecorder Autopilot function over the accounts, but we had mixed success, in that not all accounts captured the same amount of data. As the Autopilot functionality didn’t work well on all accounts, a combination of automated and manual capture processes were used where necessary. It took the team a lot longer than expected to archive the accounts therefore some were archived on a range of dates the following week.    

 

Large political party’s vs smaller party’s social media accounts

The two largest political party leaders, Jeremy Corbyn and Boris Johnson, have many more social media followers than the other home nations party leaders. This meant that it was more difficult to get a comprehensive capture of Corbyn and Johnson’s Twitter accounts than, for example, Arlene Foster’s. The more popular Twitter accounts took many hours to crawl; Corbyn’s took almost ten hours to archive thirteen day’s worth of Tweets (which only took us up to 1st December). 

 

Technical Issues

We experienced several technical issues with crawling, mainly concerned with issues around  IP addresses, the app crashing, and Autopilot working on some computers and not others. It was hard to get the app restarted after it crashed, so some time was lost when this happened. Different computers with the same specs ran differently. The Autopilot capture for Jeremy Corbyn’s and Boris Johnson’s Twitter accounts were started at the same time but Corbyn’s ran uninterrupted while Johnson’s crashed when it reached 475 MB. Although Corbyn’s account was crawled for nearly ten hours it only collected 93 MB of data. In contrast, Nigel Farage’s Twitter page was crawled for over four hours and only produced 506 MB. It is important to check the size of crawled data, as the hours the Webrecorder Desktop App is running on Autopilot does not necessarily translate into a high fidelity crawl. 

 

Added complications when using multiple devices with the same user profile:

Complications arose mainly from the auditing and collating of WARC files; performing QA and keeping track of which jobs were successful and those that were not. 

Initially, all participants in this project had planned to use their own work PC or work laptop and a local desktop installation of Webrecorder. However, an hour or so into the process(early in the day), it soon became apparent that there would not be enough time to archive all of the social media accounts within our time frame, given the volume of social media accounts and the unanticipated time it would take to archive each one. For example, it took one instance of a desktop Webrecorder application almost ten hours to archive Jeremy Corbyn’s Twitter account (only able to capture Tweets up to a month prior to the day of archiving).

It was then decided that we could potentially, and experimentally, run multiple parallel Webrecorder applications across a number of office desktop PCs; PCs that were free and available for us to use. This was possible because of the IT Architecture in place, allowing users to log into any office machine with the correct credentials and making their personal desktop load up along with all their files and user settings, regardless of the PC they log into. 

The British Library’s IT system, which incorporates a lot of the Windows ecosystem, gives each user their own dedicated central work directory where they are given a virtual hard drive and  their own storage space for all their documents and any other work related files. This allowed one user to be logged into several office PCs at the same time and therefore run a separate desktop Webrecorder application running on each machine. This was indeed very helpful as it allowed each machine to focus on one particular social media account, which in many cases took hours to archive. 

Having multiple Webrecorder jobs greatly increased our capacity to archive by removing the previous bottleneck, that was, one webrecorder job per user. Instead, this was increased to several webrecorder jobs per user.

Work flow of gathering WARC files from Webrecorder
Image credit: Carlos Lelkes-Rarugal

 

 

Having multiple Webrecorder jobs added complications down the line, not necessarily impacting the archiving process, but rather, complicating the auditing and collating of WARC files. When a user had several Webrecorder jobs running concurrently, each job would still be downloading to the same user work directory (the user’s virtual hard drive). So if a user had many parallel jobs running, this would create multiple WARC files in the same folder (but with different names, so no clashes), WARC files being produced by the different desktop PC that the user had logged in to. This was quite an elaborate setup because once a job had completed, the entire contents of the Webrecrder folder (where the WARCs were stored) was copied to a USB so that an initial Quality Assurance (QA) could be performed on the completed job on a more capable laptop. The difficulty was in finding the WARC file that corresponded to the completed job, which was somewhat convoluted as there would have been multiple WARC files with this type of file-naming convention:

 “rec-20191213100335021576-DESKTOP-AOCGH38-7B5SEXKS.warc.gz”. 

As you can imagine, taking a copy of Webrecorder’s folder contents not only has the completed job, but also the instances of other WARC files from other incomplete jobs. Coupled with multiple jobs per PC, and multiple PCs per user; keeping track of what had completed and which WARCs were either corrupted or not up to standard, was quite demanding. 

 


Review of the data collected 

File size of data collected from UK political party leaders' social media accounts
Image credit: Carlos Lelkes-Rarugal

 

How to access this data

The archived social media accounts can be accessed through the UK General Election 2019 collection in a UK Legal Deposit Library Reading Room. The UK Legal Deposit Libraries are the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge Unity Libraries and Trinity College Dublin Library.  

The 2019 collection is part of a time series of UK General Elections dating from 2005. They can be accessed over the Internet on the Topics and Themes page of the UK Web Archive website. All the party leaders' social media accounts are tagged into the subsection UK Party Leaders Social Media Accounts (access to individual websites depends on whether we have an additional permission to allow ‘open’ access). More information about what is included in the UK General Election 2019 collection is available through the UK Web Archive blog

 

Conclusion


Overall, undertaking this experiment was an interesting experience for our small team of British Library Web Archive Curators. Many valuable lessons were learnt on how best to utilise Webrecorder in our current practice. The major takeaway was that it was a lot more time consuming than we expected. Instead of taking up one working day, it took nearly a whole week to archive our targeted social media accounts with Webrecorder. Our usual practise is to archive social media accounts with the Heritrix crawler, which works reasonably well with Twitter but is less suited to capturing other platforms. For a long time, we were unable to capture any Facebook content with Heritrix, mainly due to the platform’s publishing model, however the way the platform is published has changed recently allowing us limited success. Archiving social media will always remain challenging for the UK Web Archive, for myriad technical, ethical and legal reasons. The sheer scale of the UK’s social media output is too large for us to capture adequately (and indeed, this may not even be desirable) and certainly too large a task for us to tackle with manual, high fidelity tools such as Webrecorder. However, our recent experience during the 2019 UK General Election has convinced us that using Webrecorder to capture significant events is a worthwhile exercise, as long as we target selected, in scope accounts on a case by case basis. 

 

15 April 2020

Adding Poetry Websites to the UK Web Archive

Add comment

By Pete Hebden, Phd Student Placement, Newcastle University

One of the great features of the UK Web Archive is its series of curated collections, which can be found on the UKWA Topics and Themes page. Each collection centres on a specific topic, some responding to particular events, such as Brexit or the First World War Centenary, others drawing upon the knowledge of contributors to create a set of in-depth examples around a particular subject. During my time at the British Library, I spent some time contributing to the Poetry Zines and Journals collection, originally started in 2016 by previous PhD placement student Joe McCarthy. The collection contains an amazing range of UK-based online outlets for poetry, encompassing blogs, Twitter accounts, online journals, and the personal websites of some poets, where there is a significant amount of the poet’s creative work on the site.

Poetry writing
Poetry writing

 Although the collection was already very well curated when I came to it, it had not been significantly updated in several years, so many newer publications were not included in the collection. The past few years have seen a serious increase in the number of high-quality online literary journals – a trend that this collection was very astutely responding to when it was first created – and so there were a number of recent but well-established poetry titles that I could add to the list. One example is perverse, an online-only poetry journal started by the poet Chrissy Williams in 2018.

Along with roughly a third of the Poetry Zines and Journals collection, the archived version of perverse is only accessible on-site at the reading rooms of the UK’s legal deposit libraries. The rest of the content, for which open access permission has been obtained, can be viewed from anywhere. The Rialto and Porridge magazine are two examples of recently added sites that are open access, and the links here lead to the archived versions of those sites.

My other choices for inclusion in the collection were guided by some of my own specific areas of knowledge and interest. I included several online journals that are based in, or focus on, the north of England and Scotland, as these are literary scenes that I am more familiar with. Butcher’s Dog and Another North are two relatively recent literary journals based in the north of England. Another North is entirely digital, while Butcher’s Dog is a print journal with a strong online presence. I also added several more websites for print magazines that feature a significant amount of poetry on their site. For example, Popshot and The Rialto, both print magazines, regularly feature poetry from their most recent issues on their websites and/or social media, which gives readers an idea of the journal’s editorial policy and marks a significant change in the way that poetry is distributed by these publications thanks to the internet.

One interesting problem that I encountered during the process was around the formats that some digital publications use for distribution. While most online poetry journals choose to publish in a standard website or blog form, some distribute each issue as a downloadable file, such as a PDF or EPUB. This method of delivery presents a problem when attempting to archive the content, as the web crawler is not necessarily able to access and download these files, meaning that the poetry itself goes unrecorded. For these journals, we had to use alternative ways of recording their poetry in the collection. For example, perverse (mentioned above), as well as distributing each issue as a PDF download, also posts each poem individually to Twitter, and so we set up a regular capture of their Twitter account in order to record all of the poetry. Many other journals use social media in a similar way, and so in these cases I was able to use this as a way of archiving the journal’s output.

Over the past few decades, the web has provided an exciting platform for a diverse range of poets and publishers to showcase their work and it has been a very enjoyable challenge to contribute to the cataloguing of this transformation. I hope that my work on the Poetry Zines and Journals collection will help other readers and researchers exploring the breadth and variety of UK poetry available online today. 

If you know of any websites that should be included in this collection or in the general UK Web Archive, please nominate it.

 

19 February 2015

Building a 'Historical Search Engine' is no easy thing

Add comment Comments (0)

Over the last year the UK Web Archive has been part of the Big UK Domain Data for the Arts and Humanities project, with the ambitious goal of building a ‘historical search engine’ covering the early history of the UK web. This continues the work of the Analytical Access to the Domain Dark Archive project but at a greater scale, and moreover, with a much more challenging range of use cases. We presented the current prototype at the International Digital Curation Conference last week (written up by the DCC), and received largely positive feedback, at least in terms of how we have so far handled the scale of the collection.

What the researchers found
However, we are eagerly awaiting the results of the real test of this system, from the project’s bursary holders. Ten researchers have been funded as ‘expert users’ of the system, each with a genuine historical research question in mind. Their feedback will be critical in helping us understand the successes and failures of the system, and how it might be improved.

One of those bursary holders, Gareth Millward, has already talked about his experience, including this (somewhat mis-titled but otherwise excellent) Washington Post article “I tried to use the Internet to do historical research. It was nearly impossible.” Based on that, it seems like the results are something of a mixed bag (and from our informal conversations with the other bursary holders, we suspect that Gareth’s experiences are representative of the overall outcome). But digging deeper, it seems that this situation arises not simply because of problems with the technical solution, but because of conflicting expectations of how the search should behave.

For example, as Gareth states, if you search for RNIB using Google, the RNIB site and information about it is delivered right at the top of the results.

But does this reflect what our search engine should do?

Is a historical search engine like Google?
When Google ranks its results, it is making many assumptions. About the most important meanings of terms, the current needs of its users and the information interests of specific users (also known as the filter bubble). What assumptions should we make? Are we even playing the same game?

One of the most important things we have learned so far is that we are not playing the same game, and the information needs of our researchers might be very different to those of a normal search (and indeed different between different users). When a user searches for ‘iphone’, Google might guess that you care about the popular one, but perhaps a historian of technology might mean the late 1990’s Internet Phone by VocalTec. Terms change their meaning over time, and we must enable our researchers to discover and distinguish the different usages. As Gareth says “what is ‘relevant’ is completely in the eye of the beholder.”

Moreover, in a very fundamental way, the historians we have worked with are not searching for the one top document, or a small set of documents about a specific topic. They look to the web archive as a refracting lens onto the society that built it, and are using these documents as intermediaries, carrying messages from the past and about the past. In this sense, caring about the first few hits makes no sense. Every result is equally important.

How results are sorted
To help understand these whole sets of results, we have endeavoured to add appropriate filtering and sorting options that can be used to ‘slice and dice’ the data down into more manageable chunks. At the most basic level (and contrary to the Washington Post article), the results are sorted, and the default is to sort by ascending harvest date. The contrast with a normal search engine is perhaps no more stark than here – where BING or Google will generally seek to bring you the most recent hits, we focus on the past, something that is very difficult to achieve using a normal search engine.

With so many search options, perhaps the biggest challenge has been to present them to our users in a comprehensible way. For example, the problem where the RNIB advertisements for a talking watch were polluting the search results can be easily remedied if you combine the right search terms. The text of the advert is highly consistent, and therefore it is possible to precisely identify those advertisements by searching for the text “in associate with the RNIB”. This means it is possible to refine a search for RNIB to make sure we exclude those results (as you can see below).

Shine-rnib-no-watch


The problems are even more marked when it comes to trying to allow network analysis to be exploited. We do already extract links from the documents, and so it is already possible to show how the number of sites linking to the RNIB has changed over time, but it is not yet clear how best to expose and utilize that information. At the moment, the best solution we have found is to present this network links as additional search facets. For example, here are the results for the sites that linked to rnib.org.uk in 2000, which you can contrast with those for 2010.

Refining searches further
Currently, we expect that refining a search on the web archive will involve a lot this kind of operation, combining new search terms and clauses to help focus in on the documents of interest. Therefore, looking further ahead, we envisage that future iterations of this kind of service might take the research queries and curatorial annotations we collect and start to try to use that information to semi-automatically classify resources and better predict user needs.

A ‘Macroscope’ rather than a search engine
Despite the fact that it helps get the overall idea across, calling this system a ‘historical search engine’ turns out to be rather misleading. The actual experience and ‘information needs’ of our researchers are very different from that case. This is why we tend to refer to this system as a Macroscope (see here for more on macroscopes), or as a Web Observatory. Sometimes a new tool needs a new term.

Throughout all of this, the most crucial part has been to find ways of working closely with our users, so we can all work together to understand what a ‘Macroscope’ might mean. We can build prototypes, and use our users’ feedback to guide us, but at the same time those researchers have had to learn how to approach such a complex, messy dataset.
Both the questions and the answers have changed over time, and all parties have had their expectations challenged. We look forward to continuing to build a better Macroscope, in partnership with that research community.

By Dr Andrew Jackson, Web Archiving Technical Lead, The British Library

28 January 2015

Spam as a very ephemeral (and annoying) genre…

Add comment Comments (0)

Spam is a part of modern life. Who hasn’t received any recently, is a lucky person indeed. But only try to put your email out there in the open and you’ll be blessed with endless messages you don’t want, from people you don’t know, from places you’ve never heard about! And then just delete, de-le-te, block sender command…

Imagine though someone researching our web lives in say 50 years and this part of our daily existence is nowhere to be found. Spam is the ugly sister of the Web Archive, it is unlikely we’ll keep spam messages in our inboxes, and almost certainly no institution will keep them for posterity. And yet they are such great research materials. They vary in topics, they can be funny, they can be dangerous (especially to your wallet), and they make you shake your head in disbelief…

We all know the spam emails about people who got stuck somewhere and they can’t pay the bill and ask for a modest sum of £2,500 or so. Theses always make me think: if I had spare £2,500, it’d be Bora Bora here I come, but that’s just selfish me! Now these are taken to a new level. It’s about giving us the money that is inconveniently placed in a bank somewhere far, far away:

Charity spree

From Mrs A.J., a widow of a Kuwait embassy worker in Ivory Coast with a very English surname:

…Currently, this money is still in the bank. Recently, my doctor told me I would not last for the next eight months due to cancer problem. What disturbs me most is my stroke sickness. Having known my condition I decided to donate this fund to a charity or the man or woman who will utilize this money the way I am going to instruct here godly.

Strangely two weeks a Libyan lady, who is also a widow, is writing to me that she also suffered a stroke and all she wants to shower me with money as part of her charity spree:

Having donated to several individuals and charity organization from our savings, I have decided to anonymously donate the last of our family savings to you. Irrespective of your previous financial status, please do accept this kind and peaceful offer on behalf of my beloved family.

Spam


Mr. P. N. ‘an accountant with the ministry of Energy and natural resources South Africa’ was straight to the point:

… presently we discovered the sum of 8.6 million British pounds sterling, floating in our suspense Account. This money as a matter of fact was an over invoiced Contract payment which has been approved for payment Since 2006, now we want to secretly transfer This money out for our personal use into an overseas Account if you will allow us to use your account to Receive this fund, we shall give you 30% for all your Effort and expenses you will incure if you agree to Help.

My favourite is quite light-hearted. Got it from a 32 year old Swedish girl:

My aim of writing you is for us to be friends, a distance friend and from there we can take it to the next level, I writing this with the purest of heart and I do hope that it will your attention. In terms of what I seek in a relationship, I'd like to find a balance of independence and true intimacy, two separate minds and identities forged by trust and open communication. If any of this strikes your fancy, do let me know...

So what I’m a girl too, with a husband and a kid? You never know what may be handy…

Blog post by Dorota Walker 
Assistant Web Archivist

@DorotaWalker 

 

Further reading: Spam emails received by web-archivist@bl.uk. Please note that the quotations come from the emails and I left the original spelling intact.

11 November 2014

Collecting First World War Websites – November 2014 update

Add comment Comments (0)

Earlier in 2014 we blogged about the new Special Collection of websites related to World War One that we’ve put together to mark the Centenary. As today is Armistice Day, commemorating the cessation of hostilities on the Western Front, it seems fitting to see what we have collected so far.

2849756987_447b0f638b_z

The collection has been growing steadily over the past few months and now totals 111 websites. A significant subset of the WW1 special collection comes from the output of the Heritage Lottery Funded projects. The collection also includes websites selected by subject specialists at the British Library and nominations from members of the public.

A wide variety of websites have been archived so far which can broadly be categorised into a few different types:

Critical reflections
They include critical reflections on British involvement in armed conflict more generally, for example the Arming All Sides website, which features a discussion of the Arms trade around WW1 and Naval-History.net, an invaluable academic resource on the history of naval conflict in the First and Second World Wars.

Artistic and literary
The First World War inspired a wealth of artistic and literary output. For example the website dedicated to Eugene Burnand (1850-1921) a Swiss artist who created a series of pencil and pastel portraits depicting various ‘military types’ of all races and nationalities drawn into the conflict on all sides. Burnand was a man of great humanity and his subjects included typical men and women who served in the War as well as those of more significant military rank.

The Collection also includes websites of contemporary artists who in connection with the Centenary are creating work reflecting on the history of the conflict. One such artist is Dawn Cole whose work on WW1 has focused on the archive of WW1 VAD Nurse Clarice Spratling’s diaries, creating a project of live performance, spoken word and art installations.

Similar creative reflections from the world of theatre, film and radio can be seen in the archive. See for example Med Theatre: Dartmoor in WW1, an eighteen-month project investigating the effect the First World War had on Dartmoor and its communities. Pals for Life is a project based in the north-west aiming to create short films enabling local communities to learn about World War One. Subterranean Sepoys, is a radio play resulting from the work of volunteers researching the forgotten stories of Indian soldiers and their British Officers in the trenches of the Western Front in the first year of the Great War.

Community stories
The largest number of websites archived so far comprise projects produced by individuals or local groups telling stories of the War at a community level across the UK. The Bottesford Parish 1st World War Centenary Project focusses on 220 local recruits who served in the War using wartime biographies, memorabilia and memories still in the community to tell their stories.

The Wylye Valley 1914 project has been set up by a Wiltshire-based local history group researching the Great War and the sudden dramatic social and practical effects this had on the local population. In 1914 24,000 troops descended suddenly on the Wylye Valley villages, the largest of which had a population of 500, in response to Kitcheners’ appeals for recruits. These men arrived without uniform, accommodation or any experience of organisation. The project explores the effects of the War on these men and the impact on the local communities.

An important outcome of commemorations of the Centenary of WW1 has been the restoration and transcription of war memorials across the UK. Many local projects have used the opportunity to introduce the stories of those who were lost in the conflict. Examples include the Dover War Memorial Project; the Flintshire War Memorials Project ; Leicester City, County and Rutland War Memorials project and St. James Toxteth War memorials project.

Collecting continues
This shows just some of the many ways people are choosing to commemorate the First World War and demonstrates the continued fascination with it.

We will continue collecting First World War websites through the Centenary period to 2018 and beyond. If you own a website or know of a website about WW1 and would like to nominate it for archiving then we would love to hear from you. Please submit the details on our nominate form.

By Nicola Bingham, Web Archivist, The British Library

16 October 2014

What is still on the web after 10 years of archiving?

Add comment Comments (2)

The UK Web Archive started archiving web content towards the end of 2004 (e.g. The Hutton Enquiry). If we want to look back at the (almost) ten years that have passed since then, can we find a way to see how much we’ve achieved? Are the URLs we’ve archived still available on the live web? Or are they long since gone? If those URLs are still working, is the content the same as it was? How has our archival sliver of the web changed?

Looking Back
One option would be to go through our archives and exhaustively examine every single URL, and work out what has happened to it. However, the Open UK Web Archive contains many millions of archived resource, and even just checking their basic status would be very time-consuming, never mind performing any kind of comparison of the content of those pages.

Fortunately, to get a good idea of what has happened, we don’t need to visit every single item. Our full-text index categorizes our holdings by, among other things, the year in which the item was crawled. We can therefore use this facet of the search index to randomly sample a number of URLs from each year the archive has been in operation, and use those to build up a picture that compares those holdings to the current web.

URLs by the Thousand
Our search system has built-in support for randomizing the order of the results, so a simple script that performs a faceted search was all that was needed to build up a list of one thousand URLs for each year. A second script was used to attempt to re-download each of those URLs, and record the outcome of that process. Those results were then aggregated into an overall table showing how many URLs fell into each different class of outcome, versus crawl date, as shown below:

What-have-we-saved-01

Here, ‘GONE’ means that not only is the URL missing, but the host that originally served that URL has disappeared from the web. ‘ERROR’, on the other hand, means that a server still responded to our request, but that our once-valid URL now causes the server to fail.

The next class, ‘MISSING’, ably illustrates the fate of the majority of our archived content - the server is there, and responds, but no longer recognizes that URL. Those early URLs have become 404 Not Found (either directly, or via redirects). The remaining two classes show URLs that end with a valid HTTP 200 OK response, either via redirects (‘MOVED’) or directly (‘OK’).

The horizontal axis shows the results over time, since late 2004, broken down by each quarter (i.e. 2004-4 is the fourth quarter of 2004). The overall trend clearly shows how the items we have archived have disappeared from the web, with individual URLs being forgotten as time passes. This is in contrast to the fairly stable baseline of ‘GONE’ web hosts, which reflects our policy of removing dead sites from the crawl schedules promptly.

Is OK okay?
However, so far, this only tells us what URLs are still active - the content of those resources could have changed completely. To explore this issue, we have to dig a little deeper by downloading the content and trying to compare what’s inside.

This is very hard to do in a way that is both automated and highly accurate, simply because there are currently no reliable methods for automatically determining when two resources carry the same meaning, despite being written in different words. So, we have to settle for something that is less accurate, but that can be done automatically.

The easy case is when the content is exactly the same – we can just record that the resources are identical at the binary level. If not, we extract whatever text we can from the archived and live URLs, and compare them to see how much the text has changed. To do this, we compute a fingerprint from the text contained in each resource, and then compare those to determine how similar the resources are. This technique has been used for many years in computer forensics applications, such as helping to identify ‘bad’ software, and here we adapt the approach in order to find similar web pages.

Specifically, we generate ssdeep ‘fuzzy hash’ fingerprints, and compare them in order to determine the degree of overlap in the textual content of the items. If the algorithm is able to find any similarity at all, we record the result as ‘SIMILAR’. Otherwise, we record that the items are ‘DISSIMILAR’.

Processing all of the ‘MOVED’ or ‘OK’ results in this way leads to this graph:

What-have-we-saved-02

So, for all those ‘OK’ or ‘MOVED’ URLs, the vast majority appear to have changed. Very few are binary identical (‘SAME’), and while many of the others remain ‘SIMILAR’ at first, that fraction tails off as we go back in time.

Summarising Similarity
Combining the similarity data with the original graph, we can replace the ‘OK’ and ‘MOVED’ parts of the graph with the similarity results in order to see those trends in context:

What-have-we-saved-03

Shown in this way, it is clear that very few archived resources are still available, unchanged, on the current web. Or, in other words, very few of our archived URLs are cool.

Local Vs Global Trends
While this analysis helps us understand the trends and value of our open archive, it’s not yet clear how much it tells us about other collections, or global trends. Historically, the UK Web Archive has focused on high-status sites and sites known to be at risk, and these selection criteria are likely to affect the overall trends. In particular, the very rapid loss of content observed here is likely due to the fact that so many of the sites we archive were known to be ‘at risk’ (such as the sites lost during the 2012 NHS reforms). We can partially address this by running the same kind of analysis over our broader, domain-scale collections. However, that would still bias things towards the UK, and it would be interesting to understand how these trends might differ across countries, and globally.

By Andy Jackson, Web Archiving Technical Lead, The British Library

07 October 2014

Thoughts on website selecting for the UK Web Archive

Add comment Comments (0)

Hedley Sutton, Asian & African Studies Reference Team Leader at The British Library gives his thoughts and experiences of web archiving.

A Reference Team Leader spends most of their day answering queries sent in by e-mail, fax and letter or manning Reading Room enquiry desks. Some, however, also help with contributing to the selection of sites for inclusion in the UK Web Archive.

The rise of digital
Digital content is of course increasingly important for researchers, and is certain to become ever more so as publishers slowly move away from print to online formats. The Library recognized this when it began to archive websites in 2004, aiming to harvest a segment of the vast national web domain by providing free access both to live sites and to snapshots of existing and defunct sites as they developed over time.

Those which have been fully ‘accessioned’, as it were, are available to view online, and can be found alphabetically by title, or subject/keyword, or in some cases grouped in themed collections such as the 2012 London Olympics or the ‘Credit crunch’. 

Websites of interest
I volunteered to become a selector in 2008, planning initially to concentrate on tracing websites within my own specialism of Asian and African studies. I soon discovered, however, that it was more rewarding (addictive, even) to look beyond conventional subject divisions to home in on all and anything that looked of potential interest to present and future users of the archive.

Worthy, unusual and not-quite-believe-it
Over the years this has ranged from the worthy (such as the UK Web Designers’ Association and the Centre for Freudian Analysis and Research), through the unusual (step forward the Federation of Holistic Therapists, the Fellowship of Christian Magicians, and the Society for the Assistance of Ladies in Reduced Circumstances), to the I-see-it-but-do-not-quite-believe-it (yes, I mean you, British Leafy Salads Association; no, don’t try and run away, Ferret Education and Research Trust; all power to you, Campaign Against Living Miserably). Being paid to spend part of your time surfing the web – what’s not to like?

Permission required
The only mildly disappointing aspect of selecting websites is the fact that at present only about 20% of recommended sites actually make it into the Open UK Web Archive. The explanation is simple – the Library requires formal permission from website owners before it can ingest and display their sites.

This is offset in part by the amendment to the Legal Deposit legislation that (since 2013) has allowed The British Library to archive all UK websites. These, however, can only be viewed in the Reading Rooms of the UK Legal Deposit Libraries.

If you know of a website that you feel should be in the Open UK Web Archive, please nominate it.


By Hedley Sutton - Asian & African Studies Reference Team Leader, The British Library