Archiving the 2024 General Election
By Carlos Lelkes-Rarugal, Assistant Web Archivist
While the UK Web Archiving team have extensive experience archiving UK General Elections, the 2024 election presented us with significant challenges. Not only did we have less time to prepare, due to the snap election, but our usual workflows were significantly disrupted by the cyberattack that occurred on the British Library in October 2023.
To successfully archive the social media presence of prominent political figures and organisations, we turned to a new tool—one that could handle the high-fidelity crawling and archival of web content.
Challenges with Our Previous Workflow
Overview of the Previous Workflow
The UK Web Archive is a collaboration between the six Legal Deposit Libraries in the UK, working together building many collections. For each collection, we designate a voluntary lead to create a thorough scoping document, outlining the collection's purpose and the type of content to include.
Historically, we used the Annotation and Curation Tool (ACT), a web-based application that allowed us to create records for websites and build collections collaboratively. ACT made it easy to nominate new content or re-tag existing records of websites into specific collections. This system worked well for many years, but after the 2023 cyberattack, all of our systems were taken offline as a precaution. Consequently, we had no option but to adapt quickly, finding alternative workflows that minimised disruption.
Thanks to the web archiving technical team, we were still able to archive websites, even without access to ACT and our servers. Instead, we transitioned to using online spreadsheets to manage nominations and cloud services to crawl, allowing us to maintain collaboration across institutions.
Collaboration is Key
Collaboration has always been at the heart of the UK Web Archive’s success. For example, during the 2017 General Election, many ACT users from the UK various Legal Deposit Libraries contributed to the collection-building effort, archiving circa 2300 websites. The 2024 General Election, however, was different. With ACT unavailable and with less time to prepare, we had to adjust quickly.
Our goal was clear: build a large collection in a short timeframe using new workflows and tools. The 2024 General Election collection concluded with around 2200 websites.
Each web archive collection can be broad in nature, and sometimes divided into subcategories. The 2024 General Election, like past elections, followed a familiar structure, which allowed for a more consistent comparison of general themes as well as the use of the web as a communication tool across the years. The scope of the collection was decided collaboratively, with a lead curator guiding the overall process. The strong representation of diverse political topics reflected the collective efforts of our colleagues and volunteers.
The Archiving Process
To understand how we archive websites, it’s helpful to think of online search engines. Search engines crawl websites, extract information like text and images, and store it to provide search results. When archiving, however, we go further—we don’t just extract information; we attempt to download the entire website (with some exceptions), allowing us to recreate it in a web browser as closely as possible to the original.
We use Heritrix as our web crawler, to download website resources, and PythonWayback to reproduce archived copies from those downloaded resources. Previously, ACT served as the tool for curating the metadata (such as URLs, crawl depth, frequency and right’s information); this functionality metadata gives us the granularity to customise how we archive each website, which is used to drive Heritrix crawls.
After the cyberattack, our systems were offline, but we wanted to retain as much of our old workflow as possible. Heritrix remained our main crawler, but without ACT, we needed a new method for collaboration and for managing metadata. Our technical lead proposed using online spreadsheets to emulate ACT’s metadata fields, with each column representing a field needed for the Heritrix crawls. A custom script then transformed this spreadsheet data into a format Heritrix could use (.JSONL).
Here’s a simplified view of the workflow:
- Websites are nominated and listed in a spreadsheet with appropriate metadata
- The spreadsheet is transformed into a Heritrix-compatible format(.JSONL)
- Heritrix crawls the nominated sites, archiving the content.
Introducing Browsertrix
What is Browsertrix?
Browsertrix is the new application we adopted to archive social media and other complex web content. It offers advanced capabilities, such as high-fidelity crawling, that is particularly useful for platforms that are traditionally difficult to archive, like social media sites.
Why Browsertrix?
We chose Browsertrix for several reasons. Its ability to handle JavaScript-heavy websites made it a natural fit for capturing modern web content. We had also experimented with the platform in its early development phase, so we knew what to expect.
It is important to note that Browsertrix is open source and can be deployed locally (something we were very close to completing prior to the cyberattack), but it also offers a paid-for cloud service; something we were keen to sign up to given our situation.
Key Features
- High-Fidelity Crawling: Browsertrix can accurately capture dynamic and interactive content from social media platforms.
- Ease of Use: Browsertrix was relatively easy for our team to adopt, thanks to its intuitive interface and extensive documentation. It is also used by many International Internet Preservation Coalition (IIPC) members, so we could draw on them for additional information and support if needed.
By using Browsertrix, we addressed many of our current challenges; we’ll cover our use of Browsertrix in more detail in another blog post.
Bringing It All Together
Though we anticipated the 2024 General Election, we didn’t expect it to be called as quickly as it was. Early in 2024, we had already started discussing what the archival process might look like in a post-cyberattack environment. When the election was officially announced, we quickly organised, dividing tasks and focusing on different political parties and subcategories.
The new workflow was tested and, with few changes, was ready to go. We chose Google Sheets to manage website nominations, which were then downloaded weekly as .TSV files, and then transformed into.JSONL for Heritrix to ingest. Meanwhile (and separately), Browsertrix was used to archive social media and other difficult-to-capture content. The adoption of two workflows for the 2024 General Election—one traditional, using Heritrix, and another experimental, using Browsertrix, meant that in a relatively short time, we were able to collaborate effectively to archive a wider range of content than previous General Elections. In terms of successful collection building, we succeeded at:
- organising and managing the division of work
- remote collaboration using online spreadsheets
- deploying two separate archival workflows that focused on different types of content
- quickly adapting to and learning new technologies
- effectively communicating and giving timely feedback
It's still uncertain what our workflows will look like once we're fully back online, but we're actively discussing and exploring various possibilities. One thing is clear: we have a wealth of experience and technical expertise to draw upon, both from the Legal Deposit Libraries and our external partnerships. Regardless of the solutions we choose, we’re confident we’ll continue delivering our vital work.