03 August 2023
My AHRC-RLUK Professional Practice Fellowship: A year on
A year ago I started work on my RLUK Professional Practice Fellowship project to analyse computationally the descriptions in the Library’s incunabula printed catalogue. As the project comes to a close this week, I would like to update on the work from the last few months leading to the publication of the incunabula printed catalogue data, a featured collection on the British Library’s Research Repository. In a separate blogpost I will discuss the findings from the text analysis and next steps, as well as share my reflections on the fellowship experience.
Since Isaac’s blogpost about the automated detection of the catalogue entries in the OCR files, a lot of effort has gone into improving the code and outputting the descriptions in the format required for the text analysis and as open datasets. With the invaluable help of Harry Lloyd who had joined the Library’s Digital Research team as Research Software Engineer, we verified the results and identified new rules for detecting sub-entries signaled by Another Copy rather than a main entry heading. We also reassembled and parsed the XML files, originally split in two sets per volume for the purpose of generating the OCR, so that the entries are listed in the order in which they appear in the printed volume. We prepared new text files containing all the entries from each volume with each entry represented as a single line of text, that I could use for the corpus linguistics analysis with AntConc. In consultation with the Curator, Karen Limper-Herz, and colleagues in Collection Metadata we agreed how best to store the data for evaluation and in preparation to update the Library’s online catalogue.
Whilst all this work was taking place, I started the computational analysis of the English text from the descriptions. The reason for using these partial descriptions was to separate what was merely transcribed from the incunabula from the more language used by the cataloguer in their own ‘voice’. I have recorded my initial observations in the poster I presented at the Digital Humanities Conference 2023. Discussing my fellowship project with the conference attendees was extremely rewarding; there was much interest in the way I had used Transkribus to derive the OCR data, some questions about how the project methodology applies to other data and an agreement on the need to contextualise collections descriptions and reflect on any bias in the transmission of knowledge. In the poster I also highlight the importance of the cross-disciplinary collaboration required for this type of work, which resonated well with the conference theme of Collaboration as Opportunity.
I have started disseminating the knowledge gained from the project with members of the GLAM community. At the British Library Harry, Karen and I ran an informal ‘Hack & Yack’ training session showcasing the project aims and methodology through the use of Jupyter notebooks. I also enjoyed the opportunity to discuss my research at a recent Research Libraries UK Digital Scholarship Network workshop and look forward to further conversations on this topic with colleagues in the wider GLAM community.
We intend to continue to enrich the datasets to enable better access to the collection, the development of new resources for incunabula research and digital scholarship projects. I would like to end by adding my thanks to Graham Jevon, for assisting with the timely publication of the project datasets, and above all to James, Karen and Harry for supporting me throughout this project.
This blogpost is by Dr Rossitza Atanassova, Digital Curator, British Library. She is on Twitter @RossiAtanassova and Mastodon @[email protected]