Digital scholarship blog

Enabling innovative research with British Library digital collections

12 April 2022

Making British Library collections (even) more accessible

Daniel van Strien, Digital Curator, Living with Machines, writes:

The British Library’s digital scholarship department has made many digitised materials available to researchers. This includes a collection of digitised books created by the British Library in partnership with Microsoft. This is a collection of books that have been digitised and processed using Optical Character Recognition (OCR) software to make the text machine-readable. There is also a collection of books digitised in partnership with Google. 

Since being digitised, this collection of digitised books has been used for many different projects. This includes recent work to try and augment this dataset with genre metadata and a project using machine learning to tag images extracted from the books. The books have also served as training data for a historic language model.

This blog post will focus on two challenges of working with this dataset: size and documentation, and discuss how we’ve experimented with one potential approach to addressing these challenges. 

One of the challenges of working with this collection is its size. The OCR output is over 20GB. This poses some challenges for researchers and other interested users wanting to work with these collections. Projects like Living with Machines are one avenue in which the British Library seeks to develop new methods for working at scale. For an individual researcher, one of the possible barriers to working with a collection like this is the computational resources required to process it. 

Recently we have been experimenting with a Python library, datasets, to see if this can help make this collection easier to work with. The datasets library is part of the Hugging Face ecosystem. If you have been following developments in machine learning, you have probably heard of Hugging Face already. If not, Hugging Face is a delightfully named company focusing on developing open-source tools aimed at democratising machine learning. 

The datasets library is a tool aiming to make it easier for researchers to share and process large datasets for machine learning efficiently. Whilst this was the library’s original focus, there may also be other uses cases for which the datasets library may help make datasets held by the British Library more accessible. 

Some features of the datasets library:

  • Tools for efficiently processing large datasets 
  • Support for easily sharing datasets via a ‘dataset hub’ 
  • Support for documenting datasets hosted on the hub (more on this later). 

As a result of these and other features, we have recently worked on adding the British Library books dataset library to the Hugging Face hub. Making the dataset available via the datasets library has now made the dataset more accessible in a few different ways.

Firstly, it is now possible to download the dataset in two lines of Python code: 

Image of a line of code: "from datasets import load_dataset ds = load_dataset('blbooks', '1700_1799')"

We can also use the Hugging Face library to process large datasets. For example, we only want to include data with a high OCR confidence score (this partially helps filter out text with many OCR errors): 

Image of a line of code: "ds.filter(lambda example: example['mean_wc_ocr'] > 0.9)"

One of the particularly nice features here is that the library uses memory mapping to store the dataset under the hood. This means that you can process data that is larger than the RAM you have available on your machine. This can make the process of working with large datasets more accessible. We could also use this as a first step in processing data before getting back to more familiar tools like pandas. 

Image of a line of code: "dogs_data = ds['train'].filter(lamda example: "dog" in example['text'].lower()) df = dogs_data_to_pandas()

In a follow on blog post, we’ll dig into the technical details of datasets in some more detail. Whilst making the technical processing of datasets more accessible is one part of the puzzle, there are also non-technical challenges to making a dataset more usable. 

 

Documenting datasets 

One of the challenges of sharing large datasets is documenting the data effectively. Traditionally libraries have mainly focused on describing material at the ‘item level,’ i.e. documenting one dataset at a time. However, there is a difference between documenting one book and 100,000 books. There are no easy answers to this, but libraries could explore one possible avenue by using Datasheets. Timnit Gebru et al. proposed the idea of Datasheets in ‘Datasheets for Datasets’. A datasheet aims to provide a structured format for describing a dataset. This includes questions like how and why it was constructed, what the data consists of, and how it could potentially be used. Crucially, datasheets also encourage a discussion of the bias and limitations of a dataset. Whilst you can identify some of these limitations by working with the data, there is also a crucial amount of information known by curators of the data that might not be obvious to end-users of the data. Datasheets offer one possible way for libraries to begin more systematically commuting this information. 

The dataset hub adopts the practice of writing datasheets and encourages users of the hub to write a datasheet for their dataset. For the British library books, we have attempted to write one of these datacards. Whilst it is certainly not perfect, it hopefully begins to outline some of the challenges of this dataset and gives end-users a better sense of how they should approach a dataset. 

.