This guest post is by Alex Hailey, Curator of Modern Archives and Manuscripts. He's on Twitter as @ajrhailey.
In late 2019 I was lucky enough to join BL and National Archives staff to trial a PG Certificate in Computing for Cultural Heritage at Birkbeck. The course provided an introduction to programming with Python, the basics of SQL, and using the two to work with data. Fellow attendees Graham, Nick, Chris and Giulia have written about their work previously, and I am going to briefly introduce one of my project tasks addressing issues with legacy metadata within the India Office Records.
The original data
The IOR/E/4 Correspondence with India series consists of 1,112 volumes dating from 1703-1858: four series of letters received by the East India Company (EIC) Court of Directors from the administration in India, and four series of dispatches sent to India. Catalogue entries for these volumes contain only basic information – title, dates, language, reference and former references – and subject, name and place access to the dispatches is provided through 72 index volumes (reference IOR/Z/E/4), which contain around 430,000 entries.
The original indexes were produced from 1901-1929 by staff of the Secretarial Bureau, led by indexing pioneer Mary Petherbridge; my colleague Antonia Moon has written about Petherbridge’s work in a previous post. When these indexes were converted to the catalogue in the early 2010s, entries within the index volumes were entered as child or sub-items of the index volumes themselves, with information on the related correspondence volumes entered into the free-text Related material field, as shown in the image above.
Problem and solution
This approach has caused some issues. Firstly, users attempting to order the related correspondence regularly end up trying to place an order for an index volume instead, which is frustrating. Secondly, it makes it practically impossible to determine the whole contents of a particular volume in a quick and easy manner, which frustrates access and use.
Manually working through 430,000 entries to group the entries by volume would be an impossible task, but I was able to use Python and a library called Pandas, which has a number of useful features for examining and manipulating catalogue data: methods for reading and writing data from multiple sources, flexible reshaping of datasets, and methods for aggregation, indexing, splitting and replacing strings, including regular expressions.
Using Pandas I was able to separate information in the Related material field, restructure the data so that each instance of an index entry formed an individual record, and then group these by volume and further arrange them alphabetically or by page order.
Outputs and analysis
Examining these outputs gave us new insights into the data. We now know that the indexes cover 230 volumes of the dispatches only. We were also able to identify incomplete references originally recorded in the Related material field, as well as what appear to be keying errors (references which fall outside of the range of the dispatches series). We can now follow these up and correct errors in the catalogue which were previously unknown.
Comparing the data at volume level arranged alphabetically and by page order, we could appreciate just how much depth there was to the index. Traditional indexes are written with a lot of information redundancy, which isn’t immediately apparent until you group the entries according to their location within a particular volume:
After discussion with the IOR team we have decided to take the alphabetically arranged data and import it to the archives catalogue, so that users selecting a dispatches volume are presented with the relevant index entries immediately.
The original dataset and derived datasets have been uploaded to the Library’s research repository where they are available for download and reuse under a CC0 licence.
To enable further analysis of the index data I have also tried my hand at creating a Jupyter Notebook to use with the derived data. This is intended to introduce colleagues to using Notebooks, Python and the Pandas library to examine catalogue metadata, conducting basic queries, producing a visualisation and exporting subsets for further investigation.
My Birkbeck project also included work to create place and institution authority files for the Proceedings of the Governments of India series using keyword extraction with existing catalogue metadata, and this will be discussed in a future post.
Huge thanks must go to Nora McGregor, Jo Pugh and the folks at Birkbeck Department of Computer Science for developing the course and providing us with this opportunity; Antonia Moon and the IOR team for helpful discussions about the IOR data; and the rest of the cohort for moral support when the computer just wouldn’t behave.
Curator of Modern Archives and Manuscripts