UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

2 posts from April 2023

20 April 2023

UK Web Archive Technical Update - Spring 2023

By Andy Jackson, Web Archive Technical Lead, British Library

This is a summary of what’s been going on since the 2022 Q4 report.

Summarising Our Holdings

We regularly report on our holdings so other teams across the Legal Deposit Libraries have an understanding of how much data we hold and how we grow over time. Until recently, the reporting mechanism we used did not fully take into account the storage used across different clusters, and on Amazon Web Services.

In January the old reporting mechanism was replaced with a new implementation, better integrated with our other systems and covering all storage services. The Airflow scheduler (discussed in previous reports) generates updated lists of holdings from different systems, and a Jupyter notebook is then used as a dashboard. This is made accessible via the W3ACT curation service, unlike the old system, which was only available to British Library staff.

While it doesn’t get updated automatically, there’s also an older copy of the notebook on GitHub. See UK Web Archive Holdings Summary Report. As you can see there, the UK Web Archive now holds over 1.4 PB of WARCs and logs.

The new system for Reading Room access to Non-Print Legal Deposit material has also made steady progress. An alpha version of the system has been rolled out across all LDLs so staff can access the service for testing, and a beta service is being rolled out to run alongside the current system in reading rooms. The deployment of the services themselves has also been automated, using GitLab CI/CD to updated the systems rather than relying on updating them by hand.

Staff testing raised some additional requirements to be met before the service roll-out can proceed. Working with Webrecorder to meet these requirements will be the focus for the next quarter.

UKWA Website

Edited 28th April 2023 to include translation updates.

The main website has been updated to run version 2.6.9 of our PyWB playback engine, and version 1.4.5 of the main search interface. Version 1.4.5 does not change the sites basic functionality, but does significantly improve the Scotting Gaelic version of the site.

However, we’ve also looked at more significant changes to the public interface to the archive.

Firstly, we’d like to update to newer version of PyWB, which now features an updated timeline and calendar display. Secondly, some experimentation with letting search engines to index selected website showed that it may be necessary to include links to the archived sites somewhere in the main site so that the crawler finds and prioritizes those URLs for indexing. To test this out, a page has been added to the site that lists any archived sites that require indexing, and that page has been included in the site map.

Finally, we’ve found a lot of queries are better answered by direct URL search than keyword search, so wanted to find ways to better integrate PyWB’s URL search functionality with the main site. To make URL search easier to use, we want to change the the main search interface on the front page of the website to spot URL searches and direct the user to the right results.

The BETA version of the website has been updated to include these changes, and is now available For review. If you have any feedback, please let us know.

The BETA homepage for the UK Web Archive  offering URL or Full Text search

Image: The BETA homepage for the UK Web Archive, offering URL or Full Text search

Web Archive Discovery tool updates

One long-standing issue we have is that our full-text search does not contain recent material, and over the next year we hope to revisit the scaling problems we’ve seen and try to improve the situation.

As an initial step towards this, we spent some time updating our search tools. The webarchive-discovery indexer has been updated to use version 2 of Apache Tika, along with other upgrades to other dependencies like the Nanite wrapper that makes is possible for us to use National Archive’s PRONOM/DROID format identification engine. This changes are quite significant, so the version number has been bumped from 3.3.x to 3.4.x.

We are also considering an alternative workflow, where we store the extracted metadata in an intermediate form, rather than going directly to Apache Solr or Elasticsearch. To enable us to experiment with this approach, the indexer has been modified to support writing the extracted metadata to JSON Lines output files so that we can use it to support multiple forms of indexing or analysis.

2023 Domain Crawl Preparation

As discussed in the previous report, this year we are bringing the domain crawl back on-site rather than running on the cloud. The technical preperation for this was fairly straightforward, given the deployment of the crawl is largely automated. The main change from the last on-site crawl is that we switched to using a server with plenty of fast SSD disks. The cloud crawls had shown us how much the whole thing can benefit from faster disks, so we have attempted to match that when running on our own servers.

Add some updated seed lists from Nominet and from our curators, and we are ready to roll on the anniversary of the first Non-Print Legal Deposit domain crawl. That one started on the 12th of April 2013, and so we’ve chosen that for our start date this year. This will be part of the wider celebrations from across the legal deposit libraries.


Addendum - 13th April 2023

Due to staff holidays, we are only now publishing this quarterly report, so we can add some notes on the launch of the 2023 domain crawl.

The crawl was set up on the 11th, and loaded with the 11 million seed URLs from Nominet and the 27,059 domain crawl seeds from W3ACT (including 13,460 non-UK seeds). On the morning of the 12th, the crawl was launched, and seems to be running well, at around 400 URLs per second. If the system can sustain this rate, which corresponds to around one billion URLs per month, the whole crawl should complete in 2-3 months time.

Dashboard for the first 24 hours of the 2023 Domain

 Image: Dashboard for the first 24 hours of the 2023 Domain

For more information on the anniversary of Non-Print Legal Deposit, see Celebrating ten years of collecting the UK Web Space.

04 April 2023

Celebrating ten years of collecting the UK Web Space

Nicola Bingham, Lead Curator, Web Archiving, British Library

This April, we are celebrating ten years of collecting and preserving digital publications in the UK such as websites, e-books, and online journals, under legal deposit regulations. The UK Web Archive forms an important part of our collecting activity, across all six legal deposit libraries. We aim to preserve a copy of every UK website that we can identify, reflecting the broad range of experience and expression across the UK.

Large upper case text in a dark colour that reads - Everything Forever. The subtitle is - 10 Years Electronic Legal Deposit. At the bottom of the image is the logo of the six UK Legal Deposit Libraries - British Library, Bodleian Libraries, Cambridge University Library, National Library Scotland, The Library of Trinity College Dublin and the National Library of Wales.

The UK Web Archive provides a detailed insight into the evolution of online public communication over the past two decades. Communication on the web is central to understanding the history, politics, culture and society of the 21st century. However, we know that information shared publicly on the web is rapidly changed, deleted and replaced. The UK Web Archive helps people to understand current events, and the recent past, by preserving that information before it is lost.

Here are a few examples of topics and themes that we have preserved in the archive:

  • General elections: We have archived websites related to every UK general election since 2005. These websites provide a fascinating insight into the political campaigns, issues, and debates of each election.
  • London Olympics and Paralympics 2012: These websites document the planning, organisation, and events of the games, as well as the cultural and social impact they had on the UK.
  • Brexit: This collection documents the political, social, and economic impacts of Brexit. It contains official sources as well as voices from all sides of the debate across the UK.
  • Online Enthusiast Communities: This collection provides insight into hobbyists in the UK. It covers a wide range of interests from more traditional areas, such as stamp collecting and cycling, to the more esoteric, such as the UK Roundabout Appreciation Society.

The UK Web Archive is used by researchers to answer significant questions on various topics. Recent examples include:

The UK Web Archive has been in existence since 2004. Legal deposit regulations came into effect on 6 April 2013 which increased our capacity to collect the UK’s online heritage and ensure it is available for future generations to research and study.

Prior to these regulations, we had to ‘hand pick’ websites to archive, and then could only proceed with written permission of the website owner. From 6 April 2013, the six legal deposit libraries of the UK and Ireland (the British Library, the National Library of Scotland, the National Library of Wales, the Bodleian Libraries, Cambridge University Library and the Library of Trinity College Dublin) were empowered to collect and preserve all web content that could be identified as published in the UK. Since then, we have been archiving the UK Web at the “domain” level and hold many millions of websites - or over a Petabyte of digital content. The 11th annual “domain crawl” will be launched this week.

How can I access it?
Anyone can access the UK Web Archive, free of charge, at the six UK Legal Deposit Libraries.

You can search the archive, and view thousands of openly accessible archived websites at https://www.webarchive.org.uk/

Help us build the archive
Even though we aim to collect as much of the UK Web as possible, we miss many websites as we cannot automatically identify all of them as being published in UK. If you know of a UK website that should be preserved, please suggest it here: https://www.webarchive.org.uk/en/ukwa/info/nominate