UK Web Archive Technical Update - Autumn 2022
By Andy Jackson, Web Archive Technical Lead, British Library
This is a summary of what’s been going on since the update at the start of the summer.
On 16 August 2022 we relaunched the UK Web Archive website, although you might not have noticed!
The previous version of the website treated page content like it was software, so updating what the pages said was far too difficult. This quarter, we finally got to release some changes we’d made so that most of the website pages are statically generated from Markdown source held on GitHub, using Hugo. This means we could add in a content management system called NetlifyCMS, which should make editing and translating the pages of our site much easier.
We’ve taken care to match the old website presentation and carefully overlay the new system while falling back on the old system for more complex dynamic pages. You might notice some minor differences to the styling between the two, if you look closely…
An important part of this was our automated accessibility testing. While accessibility evaluation cannot be fully automated, these tools help us manage the process of making changes to our website and minimise the risks of making things worse in time periods between full accessibility evaluations.
2022 Domain Crawl Launch
As the British Library networks are in the final stages of being upgraded, 2022 is the last year we expect to run the domain crawl on Amazon Web Services.
We launched the 2022 crawl on the 17th August 2022, and since the British Library is now a member of Nominet we were able to use an up-to-date list of UK domains as our starting point.
So far, we’ve processed nearly over 500 million URLs, totaling over 20TiB of data (uncompressed).
However, we’ve noticed what seems to be an uptick in systems like fail2ban automatically mis-reporting our crawler activity as abusive behaviour. This means we have to put more work into managing our relationship with AWS, and has slowed things down a bit. Nevertheless, we expect the crawl to run successfully until the end of the year, as in previous years.
After many weeks of steady progress, our replica Hadoop storage service is now pretty much at capacity. Filling the thing up with about one petabyte of content took a while, but it’s been taking us a bit longer to be sure we’ve double-checked the transfer worked.
We are now awaiting a decision on whether we can purchase another server for this cluster, so we can make sure there’s room for the most recent crawls, and for content we expect to get in the near future. Either way, we’ll then start to plan shifting the hardware up the the National Library of Scotland.
Exporting Collection Metadata
Working with the Archives of Tomorrow project, we’ve been developing a way to export our collection metadata so it’s more suitable for reuse.
Having real use cases drive the work has been useful, and over the next weeks we’re hoping to integrate the outputs into the UKWA API so anyone can use that data.
Legal Deposit Access & NPLD Player
Working with Webrecorder we’ve seen some good progress on a new version of PyWB that supports direct rendering of PDFs and ePubs, and on the secure player application that will be used to provide access in some reading rooms.
Much of the work has focussed on the challenges around testing and preparation for a new version of a service that works across multiple independent institutions. But it’s been good to start to get some user feedback on how the system works in practice, which has already flushed out some additional requirements for the first release.
As covered in this dedicated blog post, iPres 2022 included a presentation partly based on lessons learned from managing the technical aspects of the UK Web Archive. The plan is to publish a longer version of that work later in the year.
After the successes of the iPres conference, we were quickly brought back down to earth by a severe hardware failure on the 25th of September. One of the network switches failed, and the whole UKWA dedicated network locked-up in a way that made it difficult to understand and route around the failure.
This took a while to diagnose and resolve, so we moved some critical components onto other machines so our curators and users could use our services. While this was relatively successful, it also showed that some of our automated tasks need breaking down so that different functions can be managed independently. For example, we need crawl launches to be able to proceed even if nothing else is running. These problems meant that our daily crawling activity was delayed and patchy for most of last week.
These complications mean it’s taken a bit longer than expected to undo all the interim changes that were made during the hardware outage. However, as of last week, everything is back to normal