21 April 2020
Clean. Migrate. Validate. Enhance. Processing Archival Metadata with Open Refine
This blogpost is by Graham Jevon, Cataloguer, Endangered Archives Programme
Creating detailed and consistent metadata is a challenge common to most archives. Many rely on an army of volunteers with varying degrees of cataloguing experience. And no matter how diligent any team of cataloguers are, human error and individual idiosyncrasies are inevitable.
This challenge is particularly pertinent to the Endangered Archives Programme (EAP), which has hitherto funded in excess of 400 projects in more than 90 countries. Each project is unique and employs its own team of one or more cataloguers based in the particular country where the archival content is digitised. But all this disparately created metadata must be uniform when ingested into the British Library’s cataloguing system and uploaded to eap.bl.uk.
Finding an efficient, low-cost method to process large volumes of metadata generated by hundreds of unique teams is a challenge; one that in 2019, EAP sought to alleviate using freely available open source software Open Refine – a power tool for processing data.
This blog highlights some of the ways that we are using Open Refine. It is not an instructional how-to guide (though we are happy to follow-up with more detailed blogs if there is interest), but an introductory overview of some of the Open Refine methods we use to process large volumes of metadata.
Initial metadata capture
Our metadata is initially created by project teams using an Excel spreadsheet template provided by EAP. In the past year we have completely redesigned this template in order to make it as user friendly and controlled as possible.
But while Excel is perfect for metadata creation, it is not best suited for checking and editing large volumes of data. This is where Open Refine excels (pardon the pun!), so when the final completed spreadsheet is delivered to EAP, we use Open Refine to clean, validate, migrate, and enhance this data.
Replicating repetitive tasks
Open Refine came to the forefront of our attention after a one-day introductory training session led by Owen Stephens where the key takeaway for EAP was that a sequence of functions performed in Open Refine can be copied and re-used on subsequent datasets.
This encouraged us to design and create a sequence of processes that can be re-applied every time we receive a new batch of metadata, thus automating large parts of our workflow.
No computer programming skills required
Building this sequence required no computer programming experience (though this can help); just logical thinking, a generous online community willing to share their knowledge and experience, and a willingness to learn Open Refine’s GREL language and generic regular expressions. Some functions can be performed simply by using Open Refine’s built-in menu options. But the limits of Open Refine’s capabilities are almost infinite; the more you explore and experiment, the further you can push the boundaries.
Initially, it was hoped that our whole Open Refine sequence could be repeated in one single large batch of operations. The complexity of the data and the need for archivist intervention meant that it was more appropriate to divide the process into several steps. Our workflow is divided into 7 stages:
- Migration
- Dates
- Languages and Scripts
- Related subjects
- Related places and other authorities
- Uniform Titles
- Digital content validation
Each of these stages performs one or more of four tasks: clean, migrate, validate, and enhance.
Task 1: Clean
The first part of our workflow provides basic data cleaning. Across all columns it trims any white space at the beginning or end of a cell, removes any double spaces, and capitalises the first letter of every cell. In just a few seconds, this tidies the entire dataset.
Task 1 Example: Trimming white space (menu option)
Trimming whitespace on an individual column is an easy function to perform as Open Refine has a built in “Common transform” that performs this function.
Although this is a simple function to perform, we no longer need to repeatedly select this menu option for each column of each dataset we process because this task is now part of the workflow that we simply copy and paste.
Task 1 Example: Capitalising the first letter (using GREL)
Capitalising the first letter of each cell is less straightforward for a new user as it does not have a built-in function that can be selected from a menu. Instead it requires a custom “Transform” using Open Refine’s own expression language (GREL).
Having to write an expression like this should not put off any Open Refine novices. This is an example of Open Refine’s flexibility and many expressions can be found and copied from the Open Refine wiki pages or from blogs like this. The more you copy others, the more you learn, and the easier you will find it to adapt expressions to your own unique requirements.
Moreover, we do not have to repeat this expression again. Just like the trim whitespace transformation, this is also now part of our copy and paste workflow. One click performs both these tasks and more.
Task 2: Migrate
As previously mentioned, the listing template used by the project teams is not the same as the spreadsheet template required for ingest into the British Library’s cataloguing system. But Open Refine helps us convert the listing template to the ingest template. In just one click, it renames, reorders, and restructures the data from the human friendly listing template to the computer friendly ingest template.
Task 2 example: Variant Titles
The ingest spreadsheet has a “Title” column and a single “Additional Titles” column where all other title variations are compiled. It is not practical to expect temporary cataloguers to understand how to use the “Title” and “Additional Titles” columns on the ingest spreadsheet. It is much more effective to provide cataloguers with a listing template that has three prescriptive title columns. This helps them clearly understand what type of titles are required and where they should be put.
The EAP team then uses Open Refine to move these titles into the appropriate columns (illustrated above). It places one in the main “Title” field and concatenates the other two titles (if they exist) into the “Additional Titles” field. It also creates two new title type columns, which the ingest process requires so that it knows which title is which.
This is just one part of the migration stage of the workflow, which performs several renaming, re-ordering, and concatenation tasks like this to prepare the data for ingest into the British Library’s cataloguing system.
Task 3: Validate
While cleaning and preparing the data for migration is important, it also vital that we check that the data is accurate and reliable. But who has the time, inclination, or eye stamina to read thousands of rows of data in an Excel spreadsheet? What we require is a computational method to validate data. Perhaps the best way of doing this is to write a bespoke computer program. This indeed is something that I am now working on while learning to write computer code using the Python language (look out for a further blog on this later).
In the meantime, though, Open Refine has helped us to validate large volumes of metadata with no programming experience required.
Task 3 Example: Validating metadata-content connections
When we receive the final output from a digitisation project, one of our most important tasks is to ensure that all of digital content (images, audio and video recordings) correlate with the metadata on the spreadsheet and vice versa.
We begin by running a command line report on the folders containing the digital content. This provides us with a csv file which we can read in Excel. However, the data is not presented in a neat format for comparison purposes.
Restructuring data ready for validation comparisons
For this particular task what we want is a simple list of all the digital folder names (not the full directory) and the number of TIFF images each folder contains. Open Refine enables just that, as the next image illustrates.
Constructing the sequence that restructures this data required careful planning and good familiarity with Open Refine and the GREL expression language. But after the data had been successfully restructured once, we never have to think about how to do this again. As with other parts of the workflow, we now just have to copy and paste the sequence to repeat this transformation on new datasets in the same format.
Cross referencing data for validation
With the data in this neat format, we can now do a number of simple cross referencing checks. We can check that:
- Each digital folder has a corresponding row of metadata – if not, this indicates that the metadata is incomplete
- Each row of metadata has a corresponding digital folder – if not, this indicates that some digital folders containing images are missing
- The actual number of TIFF images in each folder exactly matches the number of images recorded by the cataloguer – if not this may indicate that some images are missing.
For each of these checks we use Open Refine’s cell.cross expression to cross reference the digital folder report with the metadata listing.
In the screenshot below we can see the results of the first validation check. Each digital folder name should match the reference number of a record in the metadata listing. If we find a match it returns that reference number in the “CrossRef” column. If no match is found, that column is left blank. By filtering that column by blanks, we can very quickly identify all of the digital folders that do not contain a corresponding row of metadata. In this example, before applying the filter, we can already see that at least one digital folder is missing metadata. An archivist can then investigate why that is and fix the problem.
Task 4: Enhance
We enhance our metadata in a number of ways. For example, we import authority codes for languages and scripts, and we assign subject headings and authority records based on keywords and phrases found in the titles and description columns.
Named Entity Extraction
One of Open Refine’s most dynamic features is its ability to connect to other online databases and thanks to the generous support of Dandelion API we are able to use its service to identify entities such as people, places, organisations, and titles of work.
In just a few simple steps, Dandelion API reads our metadata and returns new linked data, which we can filter by category. For example, we can list all of the entities it has extracted and categorised as a place or all the entities categorised as people.
Not every named entity it finds will be accurate. In the above example “Baptism” is clearly not a place. But it is much easier for an archivist to manually validate a list of 29 phrases identified as places, than to read 10,000 scope and content descriptions looking for named entities.
Clustering inconsistencies
If there is inconsistency in the metadata, the returned entities might contain multiple variants. This can be overcome using Open Refine’s clustering feature. This identifies and collates similar phrases and offers the opportunity to merge them into one consistent spelling.
Linked data reconciliation
Having identified and validated a list of entities, we then use other linked data services to help create authority records. For this particular task, we use the Wikidata reconciliation service. Wikidata is a structured data sister project to Wikipedia. And the Open Refine reconciliation service enables us to link an entity in our dataset to its corresponding item in Wikidata, which in turn allows us to pull in additional information from Wikidata relating to that item.
For a South American photograph project we recently catalogued, Dandelion API helped identify 335 people (including actors and performers). By subsequently reconciling these people with their corresponding records in Wikidata, we were able to pull in their job title, date of birth, date of death, unique persistent identifiers, and other details required to create a full authority record for that person.
Creating individual authority records for 335 people would otherwise take days of work. It is a task that previously we might have deemed infeasible. But Open Refine and Wikidata drastically reduces the human effort required.
Summary
In many ways, that is the key benefit. By placing Open Refine at the heart of our workflow for processing metadata, it now takes us less time to do more. Our workflow is not perfect. We are constantly finding new ways to improve it. But we now have a semi-automated method for processing large volumes of metadata.
This blog puts just some of those methods in the spotlight. In the interest of brevity, we refrained from providing step-by-step detail. But if there is interest, we will be happy to write further blogs to help others use this as a starting point for their own metadata processing workflows.