UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

25 September 2024

Archiving the 2024 General Election

By Carlos Lelkes-Rarugal, Assistant Web Archivist

While the UK Web Archiving team have extensive experience archiving UK General Elections, the 2024 election presented us with significant challenges. Not only did we have less time to prepare, due to the snap election, but our usual workflows were significantly disrupted by the cyberattack that occurred on the British Library in October 2023.

To successfully archive the social media presence of prominent political figures and organisations, we turned to a new tool—one that could handle the high-fidelity crawling and archival of web content.

Challenges with Our Previous Workflow

Overview of the Previous Workflow

The UK Web Archive is a collaboration between the six Legal Deposit Libraries in the UK, working together building many collections. For each collection, we designate a voluntary lead to create a thorough scoping document, outlining the collection's purpose and the type of content to include.

Historically, we used the Annotation and Curation Tool (ACT), a web-based application that allowed us to create records for websites and build collections collaboratively. ACT made it easy to nominate new content or re-tag existing records of websites into specific collections. This system worked well for many years, but after the 2023 cyberattack, all of our systems were taken offline as a precaution. Consequently, we had no option but to adapt quickly, finding alternative workflows that minimised disruption.

Thanks to the web archiving technical team, we were still able to  archive websites, even without access to ACT and our servers. Instead, we transitioned to using online spreadsheets to manage nominations and cloud services to crawl, allowing us to maintain collaboration across institutions.

Collaboration is Key

Collaboration has always been at the heart of the UK Web Archive’s success. For example, during the 2017 General Election, many ACT users from the UK various Legal Deposit Libraries contributed to the collection-building effort, archiving circa 2300 websites. The 2024 General Election, however, was different. With ACT unavailable and with less time to prepare, we had to adjust quickly.

Our goal was clear: build a large collection in a short timeframe using new workflows and tools. The 2024 General Election collection concluded with around 2200 websites.

Each web archive collection can be broad in nature, and sometimes divided into subcategories. The 2024 General Election, like past elections, followed a familiar structure, which allowed for a more consistent comparison of general themes as well as the use of the web as a communication tool across the years. The scope of the collection was decided collaboratively, with a lead curator guiding the overall process. The strong representation of diverse political topics reflected the collective efforts of our colleagues and volunteers.

The Archiving Process

To understand how we archive websites, it’s helpful to think of online search engines. Search engines crawl websites, extract information like text and images, and store it to provide search results. When archiving, however, we go further—we don’t just extract information; we attempt to download the entire website (with some exceptions), allowing us to recreate it in a web browser as closely as possible to the original.

We use Heritrix as our web crawler, to download website resources, and PythonWayback to reproduce archived copies from those downloaded resources. Previously, ACT served as the tool for curating the metadata (such as URLs, crawl depth, frequency and right’s information); this functionality metadata gives us the granularity to customise how we archive each website, which is used to drive Heritrix crawls.

After the cyberattack, our systems were offline, but we wanted to retain as much of our old workflow as possible. Heritrix remained our main crawler, but without ACT, we needed a new method for collaboration and for managing metadata. Our technical lead proposed using online spreadsheets to emulate ACT’s metadata fields, with each column representing a field needed for the Heritrix crawls. A custom script then transformed this spreadsheet data into a format Heritrix could use (.JSONL).

Here’s a simplified view of the workflow:

  1. Websites are nominated and listed in a spreadsheet with appropriate metadata
  2. The spreadsheet is transformed into a Heritrix-compatible format(.JSONL)
  3. Heritrix crawls the nominated sites, archiving the content.

Introducing Browsertrix

What is Browsertrix?

Browsertrix is the new application we adopted to archive social media and other complex web content. It offers advanced capabilities, such as high-fidelity crawling, that is particularly useful for platforms that are traditionally difficult to archive, like social media sites.

Why Browsertrix?

We chose Browsertrix for several reasons. Its ability to handle JavaScript-heavy websites made it a natural fit for capturing modern web content. We had also experimented with the platform in its early development phase, so we knew what to expect.

It is important to note that Browsertrix is open source and can be deployed locally (something we were very close to completing prior to the cyberattack), but it also offers a paid-for cloud service; something we were keen to sign up to given our situation.

Key Features

  • High-Fidelity Crawling: Browsertrix can accurately capture dynamic and interactive content from social media platforms.
  • Ease of Use: Browsertrix was relatively easy for our team to adopt, thanks to its intuitive interface and extensive documentation. It is also used by many International Internet Preservation Coalition (IIPC) members, so we could draw on them for additional information and support if needed.

By using Browsertrix, we addressed many of our current challenges; we’ll cover our use of Browsertrix in more detail in another blog post.

Bringing It All Together

Though we anticipated the 2024 General Election, we didn’t expect it to be called as quickly as it was. Early in 2024, we had already started discussing what the archival process might look like in a post-cyberattack environment. When the election was officially announced, we quickly organised, dividing tasks and focusing on different political parties and subcategories.

The new workflow was tested and, with few changes, was ready to go. We chose Google Sheets to manage website nominations, which were then downloaded weekly as .TSV files, and then transformed into.JSONL for Heritrix to ingest. Meanwhile (and separately), Browsertrix was used to archive social media and other difficult-to-capture content. The adoption of two workflows for the 2024 General Election—one traditional, using Heritrix, and another experimental, using Browsertrix, meant that in a relatively short time, we were able to collaborate effectively to archive a wider range of content than previous General Elections. In terms of successful collection building, we succeeded at:  

  •       organising and managing the division of work
  •       remote collaboration using online spreadsheets
  •       deploying two separate archival workflows that focused on different types of content
  •       quickly adapting to and learning new technologies
  •       effectively communicating and giving timely feedback

It's still uncertain what our workflows will look like once we're fully back online, but we're actively discussing and exploring various possibilities. One thing is clear: we have a wealth of experience and technical expertise to draw upon, both from the Legal Deposit Libraries and our external partnerships. Regardless of the solutions we choose, we’re confident we’ll continue delivering our vital work.

18 September 2024

Creating and Sharing Collection Datasets from the UK Web Archive

By Carlos Lelkes-Rarugal, Assistant Web Archivist

We have data, lots and lots of data, which is of unique importance to researchers, but presents significant challenges for those wanting to interact with it. As our holdings grow by terabytes each month, this creates significant hurdles for the UK Web Archive team who are tasked with organising the data and for researchers who wish to access it. With the scale and complexity of the data, how can one first begin to comprehend what it is that they are dealing with and understand how the collection came into being? 

This challenge is not unique to digital humanities. It is a common issue in any field dealing with vast amounts of data. A recent special report on the skills required by researchers working with web archives was produced by the Web ARChive studies network (WARCnet). This report, based on the Web Archive Research Skills and Tools Survey (WARST), provides valuable insights and can be accessed here: WARCnet Special Report - An overview of Skills, Tools & Knowledge Ecologies in Web Archive Research.

At the UK Web Archive, legal and technical restrictions dictate how we can collect, store and provide access to the data. To enhance researcher engagement, Helena Byrne, Curator of Web Archives at the British Library, and Emily Maemura, Assistant Professor at the School of Information Sciences at the University of Illinois Urbana-Champaign, have been collaborating to explore how and which types of datasets can be published. Their efforts include developing options that would enable users to programmatically examine the metadata of the UK Web Archive collections.

Thematic collections and our metadata

To understand this rich metadata, we first have to examine how it is created and where it is held..

Since 2005 we have used a number of applications, systems, and tools to enable us to curate websites. The most recent being the Annotation and Curation Tool (ACT), which enables authenticated users, mainly curators and archivists, to create metadata that define and describe targeted websites. The ACT tool also serves  to help users build collections around topics and themes, such as the UEFA Women's Euro England 2022. To build collections, ACT users first input basic metadata to build a record around a website, including information such as website URLs, descriptions, titles, and crawl frequency. With this basic ACT record describing a website, additional metadata can be added, for example metadata that is used to assign a website record to a collection. One of the great features of ACT is its extensibility, allowing us, for instance, to create new collections.

These collections, which are based around a theme or an event, give us the ability to highlight archived content. The UK Web Archive holds millions of archived websites, many of which may be unknown or rarely viewed, and so to help showcase a fraction of our holdings, we build these collections which draw on the expertise of both internal and external partners.

Exporting metadata as CSV and JSON files

That’s how we create the metadata, but how is it stored? ACT  is a web application and the metadata created through it is stored in a Postgres relational database, allowing authenticated users to input metadata in accordance to the fields within ACT. As the Assistant Web Archivist, I was given the task to extract the metadata from the database, exporting each selected collection as a CSV and JSON file. To get to that stage, the Curatorial team first had to decide which fields were to be exported. 

The ACT database is quite complex, in that there are 50+ tables which need to be considered. To enable local analysis of the database, a static copy is loaded into a database administration application, in this case, DBeaver. Using the free-to-use tool, I was able to create entity relationship diagrams of the tables and provide an extensive list of fields to the curators so that they could determine which fields are the most appropriate to export.

I then worked on a refined version of the list of fields, running a script for the designated Collection and pulling out specific metadata to be exported. To extract the fields and the metadata into an exportable format, I created an SQL (Structured Query Language) script which can be used to export results in both JSON and/or CSV: 

Select

taxonomy.parent_id as "Higher Level Collection",

collection_target.collection_id as "Collection ID",

taxonomy.name as "Collection or Subsection Name",

CASE

     WHEN collection_target.collection_id = 4278 THEN 'Main Collection'

     ELSE 'Subsection'

END AS "Main Collection or Subsection",

target.created_at as "Date Created",

target.id as"Record ID",

field_url.url as "Primary Seed",

target.title as "Title of Target",

target.description as "Description",

target.language as "Language",

target.license_status as "Licence Status",

target.no_ld_criteria_met as "LD Criteria",

target.organisation_id as "Institution ID",

target.updated_at as "Updated",

target.depth as "Depth",

target.scope as "Scope",

target.ignore_robots_txt as "Robots.txt",

target.crawl_frequency as "Crawl Frequency",

target.crawl_start_date as "Crawl Start Date",

target.crawl_end_date as "Crawl End Date"

From

collection_target

Inner Join target On collection_target.target_id = target.id

Left Join taxonomy On collection_target.collection_id = taxonomy.id

Left Join organisation On target.organisation_id = organisation.id

Inner Join field_url On field_url.target_id = target.id

Where

collection_target.collection_id in (4278, 4279, 4280, 4281, 4282, 4283, 4284) And

(field_url.position Is Null Or field_url.position In (0))

JSON Example
JSON output example for the Women’s Euro Collection

Accessing and using the data

The published metadata is available from the BL Research Repository within the UK Web Archive section, in the folder “UK Web Archive: Data”. Each dataset includes the metadata seed list in both CSV and JSON formats, a data dictionary and a datasheet which gives provenance information about how the dataset was created as well as a data dictionary that defines each of the data fields. The first collections selected for publication were:

  1. Indian Ocean Tsunami December 2004 (January-March 2005) [https://doi.org/10.23636/sgkz-g054]
  2. Blogs (2005 onwards) [https://doi.org/10.23636/ec9m-nj89] 
  3. UEFA Women's Euro England 2022 (June-October 2022) [https://doi.org/10.23636/amm7-4y46] 

31 July 2024

If websites could talk (part 6)

By Ely Nott, Library, Information and Archives Services Apprentice

After another extended break, we return to a conversation between UK domain websites as they try to parse out who among them should be crowned the most extraordinary…

“Where should we start this time?” asked Following the Lights. “Any suggestions?”

“If we’re talking weird and wonderful, clearly we should be considered first.” urged Temporary Temples, cutting off Concorde Memorabilia before they could make a sound.

“We should choose a website with a real grounding in reality.” countered the UK Association of Fossil Hunters.

“So, us, then.” shrugged the Grampian Speleological Group. “Or if not, perhaps the Geocaching Association of Great Britain?”

“We’ve got a bright idea!” said Lightbulb Languages, “Why not pick us?”

“There is no hurry.” soothed the World Poohsticks Champsionships, “We have plenty of time to think, think, think it over.”

“This is all a bit too exciting for us.” sighed the Dull Men’s Club, who was drowned out by the others.

“The title would be right at gnome with us.” said The Home of Gnome, with a little wink and a nudge to the Clown Egg Gallery, who cracked a smile.

“Don’t be so corny.” chided the Corn Exchange Benevolent Society. “Surely the title should go to the website that does the most social good?”

“Then what about Froglife?” piped up the Society of Recorder Players.

“If we’re talking ecology, we’d like to be considered!” the Mushroom enthused, egged on by Moth Dissection UK. “We have both aesthetic and environmental value.”

“Surely, any discussion of aesthetics should prioritise us.” preened Visit Stained Glass, as Old so Kool rolled their eyes.

The back and forth continued, with time ticking on until they eventually concluded that the most extraordinary site of all had to be… Saving Old Seagulls.

Check out previous episodes in this series by Hedley Sutton - Part 1Part 2, Part 3 Part 4 and Part 5