UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

2 posts from January 2022

19 January 2022

Explore Women’s Football in the UK Web Archive

By Helena Byrne, Curator Web Archives, The British Library

On 5 December 1921, the Football Association (FA) banned women from playing football on affiliated grounds and stated that football is “quite unsuitable for females and ought not to be encouraged” (FIFA.com). It took almost fifty years to overturn this ban. With the formation of the Women’s Football Association (WFA) in 1969 the FA were under more pressure to remove the ban. It was at the FA Council Meeting on January 19th, 1970 that the FA made the decision to rescind the Councils Resolution of 1921.

To celebrate 52 years since the ban was lifted, this blog post gives a quick overview of women’s football in the UK Web Archive (UKWA). To mark National Sporting Heritage Day back in 2018 we published a blog post outlining the UKWA sports collection policies. 

History of women's football website in the uk web archive

History of the Women's FA, archived in 2018

Sport has always been included in the UKWA archive since it’s formation in 2005. In recent years we have been blogging more about these collections. Football in all its varieties is probably the most popular sport in the UK, which is why there is a collection dedicated exclusively to football and related activities. The most developed subsection of this collection is on soccer with almost 4,000 items in the collection. These range from individual web pages, subsections of websites as well as full websites, blogs and some social media platforms. 

Explore the extensive Soccer collection on the UK Web Archive Website.

We have collected a wide range of content from sports clubs (amateur and professional), fan sites, football research and events. There is no distinction in the collection based on gender as all content related to the sport is treated equally. 

Accessing the UK Web Archive
Under the Non-Print Legal Deposit Regulations 2013, we can archive UK published websites but are only able to make the archived version available to people outside the Legal Deposit Libraries Reading Rooms, if the website owner has given permission. 

Some of the websites  in UKWA that have already had permission granted, include Charlton Athletic Women, Sent Her Forward and Tartan Kicks: The Magazine For Scottish Women's Football. Some examples of websites that are onsite-only access include the Crawley Old Girls (COGS), Her Game Too and Dick, Kerr Ladies FC 1917-1965: Women's Football History.

Tartan kicks website in the UK Web Archive

Tartan Kicks website, archived in 2019

As the content of UKWA has mixed access, the message ‘Viewable only on Library premises’ will appear under the title of the website if you need to visit a Legal Deposit Library to view the content. If there is no message underneath then the archived version of the website should be available on your personal device.

Get involved with preserving women’s football online with the UK Web Archive
The UK Web Archive works across the six UK legal Deposit Libraries and with other external partners to try and bridge gaps in our subject expertise. But we can’t curate the whole of the UK web on our own, we need your help to ensure that information, discussion and creative output related to women’s football are preserved for future generations. Anyone can suggest UK published websites to be included in the UK Web Archive by filling in our nominations form.

Keep an eye on the UKWA blog and Twitter account to find out more details on our forthcoming collection to preserve the UEFA Women's Euro 2022 competition taking place across England from July 6 to July 31, 2022. 

06 January 2022

UKWA 2021 Technical update

By Andy Jackson, UKWA Technical Lead, British library

During the last quarter of 2021, the technical services that make up the UK Web Archive underwent lot of changes behind the scenes. These changes should help us to improve our services, so it’s worth explaining a little about what’s been going on.

Starting the Hadoop 3 Migration
Our Hadoop cluster is now quite old, and updating this to a newer version has been a long-standing issue. The old Hadoop version no longer gets updates, and is not supported by modern tools and libraries, which prevents us from making the most of what’s available.

For a long time, it was unclear how best to proceed – an in-place update seemed too risky, but a cluster-to-cluster migration appeared to require too much hardware. So, over recent years, we have spent time learning how to set up and maintain a Hadoop 3 cluster, and evaluating different migration strategies, focusing on how we might maintain service during any migration.

We eventually decided a cluster-to-cluster migration should be possible, as long as we can purchase higher-density storage so we have enough headroom to migrate content over ahead of migrating hardware. Earlier in the year, following some procurement delays, we were able to purchase and establish this new Hadoop 3 cluster, with each server providing over 450TB of raw storage (compared to about 85TB per server for the older cluster).

While this was being set up, we also had to generalize our services so that all important process can be run across both clusters, and that WARC records can be retrieved from either. This has been quite time-consuming, but as 2021 drew to a close (and space on the older cluster was getting tight!), we were finally able to shift things so that newly-harvested content is written to the new Hadoop 3 cluster.

Behind the scenes, our file tracking database was updated to scan both clusters and act as a record of which files are where, and to update this record hourly rather than just once per day. A new WARC Server component was created that takes Wayback request for WARC records, and uses the tracking database to work out which cluster they are on, and then grabs and returns the WARC record in question.

In the future, the tracking database will be used to help orchestrate the movement of content to Hadoop 3, with hardware being shifted over as it becomes available. The new WARC Server means that we will be able to maintain an uninterrupted service throughout.

But to avoid interruption now, we also needed to enable access to the newer content on Hadoop 3 by indexing it for playback. To this end, a new CDX indexer implementation has been created that can be run on either cluster (built with Webrecorder’s Python tools rather than Java) . As before, the tracking database is used to keep track of what’s been indexed, but both clusters can now be indexed promptly.

Similarly, although not fully moved into production yet, the Document Harvester document extractor and the Solr full-text indexing tasks have been re-written to be able to run on either cluster, and be more robust than the prior implementations.

At time time of writing, the main public website and the internal Storage Report have not been fully moved over to run across both systems, so there may be some slight inconsistencies there in the short term. However, we expect to resolve this in the next week or two.

Task Orchestration via Apache Airflow
This large set of changes has also been used as an opportunity to update how our critical web-archiving tasks are implemented and orchestrated. We were using the Luigi framework to define tasks and their dependencies, but over time we have found this to be problematic in a number of ways:

  • The code that performs tasks and the code that orchestrates those tasks were mixed together in the same source files. This made it very hard to work on improving any individual task on it’s own, and made testing difficult.
  • The Luigi task scheduling seems to be unreliable, with processors occasionally getting stuck and not making any progress, or not raising any errors on failure. This particularly affected the Document Harvester, leading to a number of outages.
  • The Luigi task management interface is not very useful. It does not make it easy to look at previous runs, and presents very little detail.
  • The way Luigi encourages task dependencies to be coded makes it very difficult to clear out those dependencies so task can be re-run.

Therefore, while updating the various web archive tasks, they have been modified to run under Apache Airflow.

Apache airflow

This is a popular and very widely used workflow definition and scheduling system, with both Google and Amazon offering Airflow as a fully-managed cloud services as well as a healthy open source community around it. Along with this choice of workflow platform, we have also chosen to implement each task tool as a separate standalone Python command-line program. This means:

  • Task code is separate from orchestration, can be developed independently, and tasks can be deployed as Docker containers, which keeps the underlying software dependencies apart.
  • We get to use the Airflow scheduler, which appears to be more reliable, will warn us when tasks get stuck or fail, and provides Prometheus integration for monitoring.
  • The Airflow Web UI is very detailed, allows access to task logs, summaries of runs and statistics, makes workflow management easier, and provides a framework for documenting each workflow.
  • The Airflow Web UI also makes it easy to clear the status of failed workflow runs so they can be re-run as needed.

Over time, we expect to move all web archiving tasks over to this system.

W3ACT
W3ACT is used by UKWA curators and other authorised users to add targets and manage Quality assurance and licencing. There only have been minor updates to the W3ACT curation service lately, rolled out towards the end of December. 

  • QA Wayback is now running PyWB version 2.6.3 for improved playback (e.g. ukwa-pywb#70).
  • Improvements to how the W3ACT authentication cookie is handled, resolving w3act#662.

UKWA Website
Most of the recent work on the UKWA website (www.webarchive.org.uk) user interface has focused on improving the presentation of our large set of curated collections by grouping them into categories. This work is still being discussed and developed internally, so isn’t part of the public website yet. However, we’re making good progress and hope to release a new version of the website over the coming weeks.

Apart from the interface itself, some additional work has been done to update the internal services (e.g update PyWB to version 2.6.3 and add the WARC Server to read content from both Hadoop clusters), and move the deployment to our newer production platform. As indicated above, these updates should be rolled out shortly.

2021 Domain Crawl
As in 2020, the 2021 Domain Crawl was run on the Amazon Web Services cloud. This time, following improvements to Heritrix and building on prior experience, the crawl ran more smoothly and efficiently than in 2020, using less memory and disk space for the crawl frontier. The crawler was started up early in August for penetration testing, and then taken down while the security concerns were addressed. The actual crawl began on the 24th of August, starting with 10 million seed URLs, and the vast majority of the crawl had completed by mid-November. Most of the 27 million hosts we visited were crawled completely, but ~57,200 hosts did hit the 500MB size cap. However, some of these were content distribution networks (CDNs), i.e. services hosting resources for other sites, so some caps were lifted manually and the crawl was allowed to continue.

URL rates in UKWA domain crawl

On the 30th of December, the crawl was stopped, having processed 2.04 billion URLs and downloaded 99.6 TB of data (uncompressed). However, a lot of the CDN content remained uncollected, and would take a very long time to collect under Heritrix’s normal ‘politeness’ rules. In the future, it would be good to find a way to allow Heritrix to crawl these sites much more quickly, without having to manually intervene to decide which hosts are CDNs.

At this time, it has not been decided whether the 2022 Domain Crawl will be run in the cloud or from our Boston Spa site. Either way, we expect to begin the process of transferring domain crawl 2020/2021 content from AWS to our Hadoop 3 cluster over this next year.

Upcoming work
In the next quarter (Jan-Mar 2022), as well as the future updates outlined above, we are also expecting to:

  • Receive hardware for the additional Hadoop 3 replication cluster, then start setting it up and populating it ahead of it being transferred to the National Library of Scotland later in the year
  • Improve monitoring of the process of moving WARCs and logs to Hadoop (in part to ensure we spot problems with the Document Harvester earlier)
  • Add improved reporting services, replacing the current Storage Report with one that is up-to-date and runs across both clusters (ukwa-notebook-apps#12)
  • Integrate static documentation and translations into the main website, via a simple CMS (ukwa-services#48). This will make it easier to add more pages and manage the translation of those pages to/from Welsh and Scottish Gaelic.
  • Begin implementing the NPLD Player, which we need in order to improve reading-room access across the Legal Deposit libraries. We’re currently finalizing the details of how our external partner will help us do this, and more details will be made available over the next couple of months.