21 June 2024
blplaybills.org: leveraging open data from the British Library
In this guest post, developer Sak Supple describes his work turning digitised images of playbills into fully searchable documents... Digital Curator Mia Ridge says, 'we're absolutely delighted by Sak's work, and hope that his post helps others working with digitised collections'.
This blog post explores the creation of blplaybills.org, a website that showcases data made publicly available by the British Library.
The blplaybills.org website provides a way to search for, view and download archival playbills from Great Britain and Ireland, 1600-1902, as curated by the British Library (BL).
The website is independently produced using assets made available by the British Library under a Creative Commons licence as part of an open data initiative.
The playbill data
Playbills were promotional flyers advertising entertainment events at theatres, fairs and pleasure gardens.
The BL playbills data originated as document scans (digitised from microfilm, the most viable approach for fragile artefacts) in PDF format, each file containing hundreds of individual playbills, grouped by volume (usually organised by theatre, region and/or period of history).
In total there are more than 80,000 scanned playbills available.
Beside the PDFs, there is also metadata describing where in the Library these playbills could be found (volumes, shelfmarks etc). Including this information meant researchers could search for information online, and also have the volume reference at hand when visiting the Library.
This data is useful to anyone researching theatre, music, history and literature. Making it easy to find, view and download playbills using simple text searches over the internet is a good way to bring the playbills to a wider audience.
This is how blplaybills.org came into existence: the goal was to turn playbill data from the British Library into a searchable online database and image store.
The workflows
It is notoriously difficult to search PDF documents containing scans.
The text in these playbills is embedded in an image. This makes it especially difficult for computers to search the content of a scan, since a computer will interpret the text as a number of lines and curves within the image, without recognizing it as text.
Because internet technologies are well suited to searching for text, the first challenge is to turn the scanned playbill text into searchable text that a computer can more easily understand.
The chosen approach was to use Optical Character Recognition (OCR) software to capture text contained in the playbills.
OCR is a pattern matching technique, enhanced with machine learning, that finds text in an image by first using text detection algorithms to isolate character images, called glyphs, and comparing these with similarly stored glyphs. These glyphs are then further broken down into features (lines, loops etc), which are then used to find the best match amongst pre-trained glyphs.
The recognised text can then be processed using techniques like contextual analysis and grammar checking to improve accuracy.
The result can then be stored in a computer file to form text that a computer can recognise in the form of characters, words, phrases and sentences.
The resulting text is associated with individual playbills and related metadata, and the text and metadata stored in an online database to make it searchable.
In parallel to the above processes, high and low resolution JPEG versions of individual playbills were generated and uploaded to cloud storage for online access.
The general flow is shown below.
The details of each of these workflows is discussed in more detail below.
Text generation workflow
Since the goal is to make it possible to search for individual playbills, the first step was to break up PDFs containing multiple playbills into individual documents containing one playbill each.
This was done using open source software called poppler-utils that provides command line utilities for manipulating PDF documents, including generating single page documents from one multipage document.
The next step is to extract text using OCR. In 2018 my research showed that an effective open source solution for this was Tesseract.
Experiments showed that Tesseract produced best results by converting the PDF document to a lossless raster format like TIFF (Tag Image File Format) before running the OCR program. In fact, it was found that changing the size of the document, increasing the resolution and contrast and then converting to TIFF produced good output from Tesseract OCR.
The conversion from PDF to TIFF for each playbill was achieved using open source software called ImageMagick.
This workflow is shown below.
Doing this for 80,000+ individual playbills was achieved by automating the above workflow and processing multiple playbills in parallel. The individual playbills could be uniquely identified by the name of the original multipage PDF, together with the page number of the playbill.
Two other workflows were set up to work in parallel with this:
- Convert individual PDF playbills into high and low resolution JPEGs for online viewing
- Add metadata to the OCR text (volume, shelfmark, date, theatre etc) to produce a JSON file, and upload and index this information in a searchable online database
JPEG generation
As individual PDF playbills were generated from multipage PDFs, a copy of each single page PDF was sent to the JPEG generation workflow where its arrival triggered the workflow.
ImageMagick was used to create thumbnail and high resolution JPEG versions of the playbill suitable for online viewing.
The resulting JPEG files, identified by the original PDF filename and page number of the playbill, were then uploaded to cloud storage.
JSON generation
A popular choice to store searchable text in JSON format is a database called Elasticsearch. This provides fast indexing and search capabilities, and is available for non-commercial use.
This JSON should include the searchable playbill text and relevant metadata.
Each output from the text generation workflow triggered the JSON generation, allowing metadata for the individual playbill to be merged with OCR text into a single JSON file.
The resulting JSON was uploaded and indexed in an online Elasticsearch database. This became the searchable datastore for the web application that researchers use when visiting blplaybills.org.
The search interface
At this point the data is stored in a searchable online database, and images of individual playbills have been made available in online cloud storage.
The next step is to allow researchers to search for, view and download playbills.
The main requirements of the interface are:
- Simple text search to return playbills containing matching text
- These results to be quickly filtered using faceted search based on date, theatre, location, organisation and volume
- Quick copy of playbill text
- View and download a high resolution version of the playbill
- Responsive design
The interface is shown in Figure 3 below.
The web interface is hosted in AWS/EC2 (Amazon Web Services cloud compute service) and uses standard web frameworks used for the creation of single page applications.
Software development
Wherever open source software was available it was used: Tesseract, ImageMagick and poppler-utils.
Some software development was necessary to create backend workflows, and to automate and integrate them with each other.
This was achieved using a combination of scripting (NodeJS, Bourne shell and Python) and C programs.
The front-end was developed with Javascript, NodeJS, Angular and HTML5/CSS3.
Recent work and next steps
I recently made some modifications to the above approach to improve the quality of OCR generated text for each playbill.
Specifically, Tesseract has been replaced by a utility called textra (Swift/MacOS) that uses the Apple Vision framework for character recognition. This significantly improved the quality of the text generated by the OCR process, resulting in improved search accuracy. This technology was not available in 2018 when blplaybills.org was first created.
Another method to improve the accuracy of search might be to enhance OCR text with text transcribed as part of a crowdsourcing initiative from the British Library: In the Spotlight. This involved members of the public transcribing titles, names and locations in playbills. By adding this information to the indexed data already generated, search accuracy could be further improved.
An interesting piece of research would be to consider if LLMs (Large Language Models) could be fine tuned to enhance the results of traditional OCR techniques.
The goal would be to find a generalised approach that uses modern natural language processing techniques to improve the automatic transcription of less machine-readable archival material such as, but not limited to, these playbills. Ideally these techniques could also be applied to multi-lingual material.
This will be the focus of future work to improve the data behind blplaybills.org.