23 December 2024
AI (and machine learning, etc) with British Library collections
Machine learning (ML) is a hot topic, especially when it’s hyped as ‘AI’. How might libraries use machine learning / AI to enrich collections, making them more findable and usable in computational research? Digital Curator Mia Ridge lists some examples of external collaborations, internal experiments and staff training with AI / ML and digitised and born-digital collections.
Background
The trust that the public places in libraries is hugely important to us - all our 'AI' should be 'responsible' and ethical AI. The British Library was a partner in Sheffield University's FRAIM: Framing Responsible AI Implementation & Management project (2024). We've also used lessons from the projects described here to draft our AI Strategy and Ethical Guide.
Many of the projects below have contributed to our Digital Scholarship Training Programme and our Reading Group has been discussing deep learning, big data and AI for many years. It's important that libraries are part of conversations about AI, supporting AI and data literacy and helping users understand how ML models and datasets were created.
If you're interested in AI and machine learning in libraries, museums and archives, keep an eye out for news about the AI4LAM community's Fantastic Futures 2025 conference at the British Library, 3-5 December 2025. If you can't wait that long, join us for the 'AI Debates' at the British Library.
Using ML / AI tools to enrich collections
Generative AI tends to get the headlines, but at the time of writing, tools that use non-generative machine learning to automate specific parts of a workflow have more practical applications for cultural heritage collections. That is, 'AI' is currently more process than product.
Text transcription is a foundational task that makes digitised books and manuscripts more accessible to search, analysis and other computational methods. For example, oral history staff have experimented with speech transcription tools, raising important questions, and theoretical and practical issues for automatic speech recognition (ASR) tools and chatbots.
We've used Transkribus and eScriptorium to transcribe handwritten and printed text in a range of scripts and alphabets. For example:
- ‘Using Transkribus for Arabic Handwritten Text Recognition’, ‘Using Transkribus for automated text recognition of historical Bengali Books’.
- Investigating the legacies of curatorial voice in the descriptions of incunabula collections at the British Library and student work on Detecting Catalogue Entries in Printed Catalogue Data
- Handwritten Text Recognition of the Dunhuang manuscripts: the challenges of machine learning on ancient Chinese texts (eScriptorium)
- Reinventing the 'Convert-a-Card' crowdsourcing project as a semi-automated workflow: Convert-a-Card: Past, Present and Future of Catalogue Cards Retroconversion, Convert-a-Card: Helping Cataloguers Derive Records with OCLC APIs and Python; Convert-a-Card: Extracting Entities from Catalogue Cards to Create E-Records
Creating tools and demonstrators through external collaborations
Mining the UK Web Archive for Semantic Change Detection (2021)
This project used word vectors with web archives to track words whose meanings changed over time. Resources: DUKweb (Diachronic UK web) and blog post ‘Clouds and blackberries: how web archives can help us to track the changing meaning of words’.
Living with Machines (2018-2023)
Our Living With Machines project with The Alan Turing Institute pioneered new AI, data science and ML methods to analyse masses of newspapers, books and maps to understand the impact of the industrial revolution on ordinary people. Resources: short video case studies, our project website, final report and over 100 outputs in the British Library's Research Repository.
Outputs that used AI / machine learning / data science methods such as lexicon expansion, computer vision, classification and word embeddings included:
- T-Res: A Toponym Resolution Pipeline for Digitised Historical Newspapers
- MapReader: A computer vision pipeline for exploring and analyzing images at scale
- DeezyMatch: A Flexible Deep Neural Network Approach to Fuzzy String Matching
- The Living Machine: A Computational Approach to the Nineteenth-Century Language of Technology
- Generating metadata from digitised 'Mitchells' Press Directories
- Generating metadata from digitised Parliamentary Road Acts (with CogApp)
- Experimental pipelines for processing digitised pages with structured or semi-structured content
Tools and demonstrators created via internal pilots and experiments
Many of these examples were enabled by on-staff Research Software Engineers and the Living with Machines (LwM) team at the British Library's skills and enthusiasm for ML experiments in combination with long-term Library’s staff knowledge of collections records and processes:
- Identifying upside-down images in the Endangered Archives Project – projects within this important collection were often digitised under trying circumstances, so training machine learning to identify image attributes is useful.
- Languid: Language Identification Project (2020) – Metadata Services' Victoria Morris experimented with machine learning (and ‘human review’ from c40 enthusiastic language experts checking the results) and was able to add language codes to over 3 million catalogue records. Her project identified 471 languages in the records, 141 of which were not previously represented. Resources: short video, longer video, publication Automated Language Identification of Bibliographic Resources: Cataloging & Classification Quarterly: Vol 58, No 1.
- Flyswot (2021) – BL staff trained a machine learning model to find images of digitised manuscripts incorrectly labelled as ‘flysheets’.
- Trialling a book genre classification model (2022) - while the team concluded that the model worked well, but not well enough to use for creating catalogue data yet, they shared their model on Hugging Face and training data created by British Library staff on Zooniverse. Resources: blog post and tutorial.
British Library resources for re-use in ML / AI
Our Research Repository includes datasets suitable for ground truth training, including 'Ground truth transcriptions of 18th &19th century English language documents relating to botany from the India Office Records'.
Our ‘1 million images’ on Flickr Commons have inspired many ML experiments, including:
- Mario Klingemann (AKA Quasimondo) used semi-automated image classification and machine learning to group images into thematic collections ((2014 - 16). See also: ‘BL Labs Awards (2015): Creative/Artistic category Award winning project’
- British Library & Flickr Commons: The many hands (and some machines) making light work
- University of Oxford's Visual Geometry Group (VGG) using ‘BL1M’; multi-modal image searchSherlockNet: Using Convolutional Neural Networks (CNNs) to automatically tag and caption the British Library Flickr collection
The Library has also shared models and datasets for re-use on the machine learning platform Hugging Face.