Creating and Sharing Collection Datasets from the UK Web Archive
By Carlos Lelkes-Rarugal, Assistant Web Archivist
We have data, lots and lots of data, which is of unique importance to researchers, but presents significant challenges for those wanting to interact with it. As our holdings grow by terabytes each month, this creates significant hurdles for the UK Web Archive team who are tasked with organising the data and for researchers who wish to access it. With the scale and complexity of the data, how can one first begin to comprehend what it is that they are dealing with and understand how the collection came into being?
This challenge is not unique to digital humanities. It is a common issue in any field dealing with vast amounts of data. A recent special report on the skills required by researchers working with web archives was produced by the Web ARChive studies network (WARCnet). This report, based on the Web Archive Research Skills and Tools Survey (WARST), provides valuable insights and can be accessed here: WARCnet Special Report - An overview of Skills, Tools & Knowledge Ecologies in Web Archive Research.
At the UK Web Archive, legal and technical restrictions dictate how we can collect, store and provide access to the data. To enhance researcher engagement, Helena Byrne, Curator of Web Archives at the British Library, and Emily Maemura, Assistant Professor at the School of Information Sciences at the University of Illinois Urbana-Champaign, have been collaborating to explore how and which types of datasets can be published. Their efforts include developing options that would enable users to programmatically examine the metadata of the UK Web Archive collections.
Thematic collections and our metadata
To understand this rich metadata, we first have to examine how it is created and where it is held..
Since 2005 we have used a number of applications, systems, and tools to enable us to curate websites. The most recent being the Annotation and Curation Tool (ACT), which enables authenticated users, mainly curators and archivists, to create metadata that define and describe targeted websites. The ACT tool also serves to help users build collections around topics and themes, such as the UEFA Women's Euro England 2022. To build collections, ACT users first input basic metadata to build a record around a website, including information such as website URLs, descriptions, titles, and crawl frequency. With this basic ACT record describing a website, additional metadata can be added, for example metadata that is used to assign a website record to a collection. One of the great features of ACT is its extensibility, allowing us, for instance, to create new collections.
These collections, which are based around a theme or an event, give us the ability to highlight archived content. The UK Web Archive holds millions of archived websites, many of which may be unknown or rarely viewed, and so to help showcase a fraction of our holdings, we build these collections which draw on the expertise of both internal and external partners.
Exporting metadata as CSV and JSON files
That’s how we create the metadata, but how is it stored? ACT is a web application and the metadata created through it is stored in a Postgres relational database, allowing authenticated users to input metadata in accordance to the fields within ACT. As the Assistant Web Archivist, I was given the task to extract the metadata from the database, exporting each selected collection as a CSV and JSON file. To get to that stage, the Curatorial team first had to decide which fields were to be exported.
The ACT database is quite complex, in that there are 50+ tables which need to be considered. To enable local analysis of the database, a static copy is loaded into a database administration application, in this case, DBeaver. Using the free-to-use tool, I was able to create entity relationship diagrams of the tables and provide an extensive list of fields to the curators so that they could determine which fields are the most appropriate to export.
I then worked on a refined version of the list of fields, running a script for the designated Collection and pulling out specific metadata to be exported. To extract the fields and the metadata into an exportable format, I created an SQL (Structured Query Language) script which can be used to export results in both JSON and/or CSV:
Select
taxonomy.parent_id as "Higher Level Collection",
collection_target.collection_id as "Collection ID",
taxonomy.name as "Collection or Subsection Name",
CASE
WHEN collection_target.collection_id = 4278 THEN 'Main Collection'
ELSE 'Subsection'
END AS "Main Collection or Subsection",
target.created_at as "Date Created",
target.id as"Record ID",
field_url.url as "Primary Seed",
target.title as "Title of Target",
target.description as "Description",
target.language as "Language",
target.license_status as "Licence Status",
target.no_ld_criteria_met as "LD Criteria",
target.organisation_id as "Institution ID",
target.updated_at as "Updated",
target.depth as "Depth",
target.scope as "Scope",
target.ignore_robots_txt as "Robots.txt",
target.crawl_frequency as "Crawl Frequency",
target.crawl_start_date as "Crawl Start Date",
target.crawl_end_date as "Crawl End Date"
From
collection_target
Inner Join target On collection_target.target_id = target.id
Left Join taxonomy On collection_target.collection_id = taxonomy.id
Left Join organisation On target.organisation_id = organisation.id
Inner Join field_url On field_url.target_id = target.id
Where
collection_target.collection_id in (4278, 4279, 4280, 4281, 4282, 4283, 4284) And
(field_url.position Is Null Or field_url.position In (0))
Accessing and using the data
The published metadata is available from the BL Research Repository within the UK Web Archive section, in the folder “UK Web Archive: Data”. Each dataset includes the metadata seed list in both CSV and JSON formats, a data dictionary and a datasheet which gives provenance information about how the dataset was created as well as a data dictionary that defines each of the data fields. The first collections selected for publication were:
- Indian Ocean Tsunami December 2004 (January-March 2005) [https://doi.org/10.23636/sgkz-g054]
- Blogs (2005 onwards) [https://doi.org/10.23636/ec9m-nj89]
- UEFA Women's Euro England 2022 (June-October 2022) [https://doi.org/10.23636/amm7-4y46]