Digital scholarship blog

Enabling innovative research with British Library digital collections

4 posts from June 2024

28 June 2024

IIIF Annual Conference 2024: A Journey of Innovation and Inspiration

The British Library Universal Viewer team were delighted to attend the IIIF conference and showcase 2024 at UCLA in Los Angeles, California. This was our the first official event since the team formed earlier in the year, and we felt incredibly fortunate to be travelling across numerous time zones to join over 70 members of the IIIF community for four days of innovation, learning and inspiration. 

301841f4-2849-40c7-8efc-1c40eb4f07e8
The Universal Viewer team outside the De Neve Plaza at UCLA

The first two days of the conference were held at the De Neve Plaza and took the form of lightning talks from delegates from a variety of different industries, and on many different topics. This format meant there was something to interest everyone, regardless of experience, and was great for keeping concentration levels high despite the jet lag! 

Birds of a feather sessions were held on the third day of the conference, with a last-minute entry from the Universal Viewer team – although lack of space meant that this was an impromptu meeting in the Kerckhoff Coffee House. However, this meant we were able to plan future work, specifically on annotations, in the sunshine on the terrace. 


6a00d8341c464853ef02c8d3b42230200c-320wi
Attendees of the UV Birds of a Feather session at the Kerckhoff Coffee House

Here were the exciting takeaways! 

Lanie Okorodudu: I was interested on how IIIF resources and IIIF-related tools could be used as a part of curriculums in online learning platforms to create meaningful knowledgeable experiences for students. I was also intrigued by “Tropiiify”, which is a plug-in for exporting IIIF collections and designed for non-technical users. 

Erin Burnand: I loved hearing about how IIIF can provide innovative solutions for incredible (but complex) collections such as the Judy Chicago Research Portal (Pennsylvania State University Library) and the work on Eastern Silk Road collections for the International Dunhuang Programme (presented by the BL’s Anastasia Pineschi) 

James Misson: The conference was an amazing opportunity to connect with fellow IIIF users, from IIIF newcomers, to those who helped define the original specifications. I enjoyed hearing work on the carbon footprint of OCR, and the transformation of historical textiles into sound to make an exhibition more accessible to visually impaired people. It was inspiring to see the range of uses IIIF has, and I was especially excited by Allmaps (allmaps.org), a toolbox for working with IIIF maps. The conference was a testament to how open the IIIF community is, and everyone generously shared their knowledge with our new team – conversations that continued in the bars of Westwood and In-n-Out Burger. 

Saira Akhter: I found the discussions on the use of AI within IIIF interesting, such as for facial recognition within historic photographs and future integration with OCR/HRT tools and outputs. The showcase at the Getty was great for learning more about IIIF itself, and it was cool to see how the idea for IIIF was first written on a napkin at a restaurant. I also enjoyed seeing more novel uses of IIIF, such as for importing paintings into Animal Crossing. 

Recordings of the conference are now available on YouTube.   

26 June 2024

Join the British Library as a Digital Curator, OCR/HTR

This is a repeated and updated blog post by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections. She shares some background information on how a new post advertised for a Digital Curator for OCR/HTR will help the Library streamline post-digitisation work to make its collections even more accessible to users. Our previous run of this recruitment was curtailed due to the cyber-attack on the Library - but we are now ready to restart the process!

 

We’ve been digitising our collections for about three decades, opening up access to incredibly diverse and rich collections, for our users to study and enjoy. However, it is important that we further support discovery and digital research by unlocking the huge potential in automatically transcribing our collections.

We’ve done some work over the years towards making our collection items available in machine-readable format, in order to enable full-text search and analysis. Optical Character Recognition (OCR) technology has been around for a while, and there are several large-scale projects that produced OCRed text alongside digitised images – such as the Microsoft Books project. Until recently, Western languages print collections have been the main focus, especially newspaper collections. A flagship collaboration with the Alan Turing Institute, the Living with Machines project, applied OCR technology to UK newspapers, designing and implementing new methods in data science and artificial intelligence, and analysing these materials at scale.

OCR of Bengali books using Transkribus, Two Centuries of Indian Print Project
OCR of Bengali books using Transkribus, Two Centuries of Indian Print Project

Machine Learning technologies have been dealing increasingly well with both modern and historical collections, whether printed, typewritten or handwritten. Taking a broader perspective on Library collections, we have been exploring opportunities with non-Western collections too. Library staff have been engaging closely with the exploration of OCR and Handwritten Text Recognition (HTR) systems for EnglishBangla, Arabic, Urdu and Chinese. Digital Curators Tom Derrick, Nora McGregor and Adi Keinan-Schoonbaert have teamed up with PRImA Research Lab and the Alan Turing Institute to run four competitions in 2017-2019, inviting providers of text recognition methods to try them out on our historical material.

We have been working with Transkribus as well – for example, Alex Hailey, Curator for Modern Archives and Manuscripts, used the software to automatically transcribe 19th century botanical records from the India Office Records. A digital humanities work strand led by former colleague Tom Derrick saw the OCR of most of our digitised collection of Bengali printed texts, digitised as part of the Two Centuries of Indian Print project. More recently Transkribus has been used to extract text from catalogue cards in a project called Convert-a-Card, as well as from Incunabula print catalogues.

An example of a catalogue card in Transkribus, showing segmentation and transcription
An example of a catalogue card in Transkribus, showing segmentation and transcription

We've also collaborated with Colin Brisson from the READ_Chinese project on Chinese HTR, working with eScriptorium to enhance binarisation, segmentation and transcription models using manuscripts that were digitised as part of the International Dunhuang Programme. You can read more about this work in this brilliant blog post by Peter Smith, who's done a PhD placement with us last year.

The British Library is now looking for someone to join us to further improve the access and usability of our digital collections, by integrating a standardised OCR and HTR production process into our existing workflows, in line with industry best practice.

For more information and to apply please visit the ad for Digital Curator for OCR/HTR on the British Library recruitment site. Applications close on Sunday 21 July 2024. Please pay close attention to questions asked in the application process. Any questions? Drop us a line at [email protected].

Good luck!

24 June 2024

China trip report – IDP, DH, and everything in between

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected]. 

 

Last April I was part of a British Library delegation to China, which was a wholesome and fulfilling experience. It aimed to refresh collaborations and partnerships with the National Library of China and the Dunhuang Academy, explore new connections and strengthen existing ones with many other institutions and individuals. I will explore this trip from a digital scholarship lens, but you can read all about the trip and its larger aims and accomplishments in a post on the IDP blog by International Dunhuang Programme Project Manager, Anastasia Pineschi. 

The Mogao Caves in Dunhuang
The Mogao Caves in Dunhuang

My primary objective was to attend and present at the IDP conference (19-20 April 2024), co-organised by the British Library and the Dunhuang Academy and synchronised with IDP’s 30th anniversary and the launch of a new, fresh and accomplished IDP website. Sharing our work and learning from others during this conference and the IDP workshop that took place the following day was one of my objectives. But I was also looking to reconnect with peers and getting to know new colleagues working in the fields of DH and the interchange of AI, cultural heritage and historical digital collections; explore opportunities for collaboration in the field of OCR/HTR (Optical Character Recognition, Handwritten Text Recognition); and get ideas for DH opportunities for IDP. 

British Library and Dunhuang Academy colleagues in front of Mogao Cave 96 (Nine Story Temple) 
British Library and Dunhuang Academy colleagues in front of Mogao Cave 96 (Nine Story Temple)

Colleagues from the Dunhuang Academy showed us such outstanding hospitality, with our Dunhuang trip including many behind-the-scenes visits and unique experiences. These included, naturally, the extraordinary Mogao Grottoes, but also another cave site called the Western Thousand Buddha Caves, and stunning natural spots such as the Singing Sand Dune (Mingsha Mountain) and the Crescent Moon Spring. We also visited places such as the Digital Exhibition and Visitor Center, the Multi-field lab at the Dunhuang Studies Information Center, the Grottoes Monitoring Center and Conservation Lab, and the Dunhuang City Museum. All have left long-lasting impressions. 

One of the dashboards managing the Mogao Grottoes at the Grottoes Monitoring Center
One of the dashboards managing the Mogao Grottoes at the Grottoes Monitoring Center

But let’s get back to the main purpose of this post, which is to report on some of the outstanding work happening out there at the intersection of Chinese historical collections and DH.

 

Conference (DH) Highlights  

I’ll start with one of the earliest platforms to enable and encourage DH research in the context of Chinese works, the Chinese Texts Project. Dr Donald Sturgeon (Durham University) presented about this well-known digital library of pre-20th century Chinese texts, which started in 2005 and is still impressively active at present, being one of the largest and most widely used digital libraries of premodern Chinese texts. Crowdsourcing and AI are now used to enhance the texts available via this platform. Machine Learning OCR is used to automate transcriptions, automated punctuation is added through deep learning, and OCR corrections are done via a crowdsourcing interface. This sees quite a high volume of engagement, typically ca. 1,000 edits per day! Sturgeon also talked about the automated annotation of named historical entities in transcribed texts, as well as using deep learning to assert periods and dates, being able to transition between Chinese and Western calendars. These annotations can then turn into structured data – enabling linking up to other data. 

Dr Donald Sturgeon presents about extracting structured data from annotations
Dr Donald Sturgeon presents about extracting structured data from annotations

While on the topic of state-of-the-art platforms, Prof Kiyonori Nagasaki (International Institute for Digital Humanities, Tokyo) talked about the SAT Daizokyo Text Database, a digital editing system for Buddhist canons and manuscripts using AI-OCR developed and recently released by the National Diet Library of Japan. The IIIF-compliant database of Buddhist icons annotated over 20,000 items, enabling search by various attributes. Nagasaki gave us a website demo, displaying an illustration with 400 annotations. One can search annotated parts of this image and compare images in the search results. Like the Chinese Texts Project, the SAT platform also incorporates crowdsourcing ‘editing’ with clever Machine Learning techniques. It was good to hear that there is an intention for SAT to gradually include Dunhuang manuscripts in the future. 

Prof Kiyonori Nagasaki demonstrated how the interface interaction is facilitated by IIIF: clicking on the text bring up the right area in the IIIF-image
Prof Kiyonori Nagasaki demonstrated how the interface interaction is facilitated by IIIF: clicking on the text bring up the right area in the IIIF-image

Another well-established, IIIF-based system, presented by Dr Hongxing Zhang (V&A Museum), is the Chinese Iconography Thesaurus (CIT). CIT has been an ongoing project since 2016, developed at the V&A and aiming to work towards subject indexing standard for Chinese Art. A system of controlled vocabulary is crucial to improve access to collections and linking up multiple collections. CIT focuses on Chinese iconography – motifs, themes, and subject matters of cultural objects, with almost 15,000 concepts and entities. And, it’s IIIF-supported – images and annotations can be viewed in IIIF Mirador lightbox. 

Not just Chinese

While much of the work around Dunhuang or Silk Road manuscripts has to do with Chinese language, several scholars emphasised the importance of addressing other languages as well. Dunhuang manuscripts were written in languages such as Sogdian, Middle Persian, Parthian, Bactrian, Tocharian, Khotanese, Sanskrit, Tibetan, Old Uighur, and Tangut. Prof Xinjiang Rong (Peking University) emphasised the importance of providing transcriptions, transliterations and translations alongside digitised images. These languages require special language expertise; therefore, cooperation between institutions and scholars is crucial. Prof Tieshan Zhang (Minzu University of China) also urges researchers to address and publish non-Chinese Dunhuang manuscripts. He especially highlighted the importance of making better use of text recognition technologies for languages other than Chinese. Last year, the Computer Science department of Minzu University of China applied for a research project to do just that. They started with non-Chinese languages and aim to increase recognition accuracy to over 90%. 

The talk by Prof Hannes Fellner (University of Vienna) came as a perfect example of how one could address the study of material in other languages, using computational methods. He introduced a project aiming to trace the development of Tarim Brahmi – one of the major writing systems of the Eastern Silk Road during the 1st millennium CE, which includes Khotanese, Sanskrit, Tocharian, and Saka. The project compiles a database of characters in Tarim Brahimi languages (currently primarily Tocharian), with palaeographic and linguistic annotations, presented as a web application. With the aim to create a research tool for texts in this writing system, such platform could facilitate the study of palaeographic variation, which in turn could help explore scribal identification, language development stages, and correlations between palaeographic and linguistic variations. Fellner works with Transkribus and IIIF to retrieve the coordinates of characters and words, returning the relevant ‘cut-outs’ of the photos to the web application. These can then be visualised, displaying character or word variations alongside their transliteration. 

Prof Hannes Fellner shows how working with Transkribus and IIIF makes it possible to retrieve ‘cut-outs’ from photographs corresponding to the query string
Prof Hannes Fellner shows how working with Transkribus and IIIF makes it possible to retrieve ‘cut-outs’ from photographs corresponding to the query string

Coming back to Chinese OCR/HTR, there’s quite a lot of activity in this area. I presented about work at the British Library aiming to advance Chinese HTR methods, in the wider context of the Library’s OCR/HTR work. We’ve focused on using the eScriptorium platform by collaborating with Colin Brisson (École Pratique des Hautes Études) and the French consortium Numerica Sinologica (now working on the READ_Chinese project). I talked about the work of our PhD Placement student, Peter Smith (University of Oxford), contributing to processes such as binarisation, segmentation and text recognition. I have recently presented about this work at Ryukoku University in Kyoto, and you can read more about it in Peter’s excellent blog post. 

Dr Adi Keinan-Schoonbaert talking about OCR/HTR activities at the British Library
Dr Adi Keinan-Schoonbaert talking about OCR/HTR activities at the British Library

 

Dunhuang online platforms

It is crucial to embed such technologies and software into user-friendly platforms, where different functionalities are available for different types of needs and audiences. Dr Peter Zhou (University of California, Berkeley) talked about the importance of building a sustainable platform that can support the complete digital lifecycle, including data curation and management, long-term preservation, and dissemination. Zhou’s objectives for the Digital Dunhuang platform are to connect resources that are otherwise isolated, featuring uniform standards for data exchanges. Such platform must enable different kinds of data formats, including raw images, historical photos, videos, cave QTVRs, digitised texts and artifacts, reproductions, microfilm, interactive visuals, conservation data, spatial info, 3D modelling data, and immersive media. This Digital Dunhuang platform should be flexible, able to scale up and deal with mass content in different formats, have Machine Learning capabilities, and aggregating knowledge content through linking.  

We can see many of these elements in a platform developed by the Dunhuang Academy. Xiaogang Zhang and Tianxiu Yu of the Dunhuang Academy introduced the Digital Library Cave platform (Digital Dunhuang), built in collaboration with Tencent, and its plans. The platform presents both a database of Dunhuang materials and murals, as well as a playable game focused on the narrative of the Library Cave. This platform displays an engaging, immersive mixture of 3D environments and artifacts, in addition to 2D items. The aim for the Digital Dunhuang platform is to present digital resources relating to the Mogao Grottoes in one integrated and comprehensive resource for Dunhuang studies. (Side note: access to the database requires a login and input of personal data). 

Tianxiu Yu showing a Knowledge Graph connecting different types of data resources
Tianxiu Yu showing a Knowledge Graph connecting different types of data resources

The richness and variety of data available now and in future on this platform is remarkable. The entire cliff of the Mogao Grottoes and some of the large-scale cultural relics are available in 3D, and this is complemented by other data used in conservation and research. And there’s an impressive array of AI technologies applied to both images and texts. For images, murals dataset annotations and automatic object detection would allow for search and retrieval; AI used for image enhancements for old photos; line drawing are extracted from art scenes; and image stitching automation. For texts, functionalities will include, at a later stage, character text recognition, providing full text retrieval at 90% precision rate; Traditional to Simplified Chinese conversion; automatic punctuation; entity extraction; and the creation of knowledge graphs. When completed, this platform will be open and share all resources available online. 

With a solid focus on text retrieval and analysis, Dr Xiaoxing Zhao (Dunhuang Academy) presented about the Dunhuang Documents Database, collating digitised manuscripts and prints dating from the 4th to the 11th centuries discovered in the Library Cave at Mogao, Dunhuang. Providing full-text retrieval for Chinese, Tibetan, and Uighur (and a plan to add Tangut), it includes search functionality using keywords, and features transliteration in Traditional Chinese, which can be conveniently viewed alongside the image. It’s great to see how far AI text recognition has come! 

Dr Xiaoxing Zhao demonstrating the Dunhuang Documents Database’s transliteration in Traditional Chinese, which can be seen side by side to the image
Dr Xiaoxing Zhao demonstrating the Dunhuang Documents Database’s transliteration in Traditional Chinese, which can be seen side by side to the image

However, technological advances are not just restricted to AI and Machine Learning. Prof Simon Mahony (Emeritus Professor, UCL) gave a fascinating, image-rich talk about non-invasive and non-destructive computational imaging of ancient texts. Mahony introduced different techniques to address research questions arising from textual manuscripts. These methods allow, for example, reading illegible texts and seeing artworks, determining the composition of pigments, or detecting characteristics of ink. One of the projects that he was involved with was the Great Parchment Book project. Damaged in a fire, the book’s content became inaccessible for researchers – but a series of steps taken to digitally straighten, flatten and stretch the book, turned it back to a readable state. This and other computational methods applied to images are indeed very inspirational! 

Prof Simon Mahony talking about how computational methods were used to enable the reading of the text in the Great Parchment Book project
Prof Simon Mahony talking about how computational methods were used to enable the reading of the text in the Great Parchment Book project

 

Back to Beijing 

Coming back to Beijing, we had several visits such as the National Library of China and the Palace Museum’s Conservation Department. But I’ll focus here on two visits which are directly related to DH and computational methods – the first at the Chinese Academy of Sciences (CAS), and the second at the National Key Laboratory of General Artificial Intelligence, Peking University. 

We were kindly hosted by Prof Cheng-Lin Liu from the State Key Laboratory of Multimodal AI Systems (MAIS), Institute of Automation, CAS, and joined by Drs Fei Yin, Heng Zhang, and Xiao-Hui Li. Prof Liu gave an excellent keynote talk at the Machine Learning workshop at the ICDAR2023 conference, which I attended in August 2023. It was about “Plane Geometry, Diagram Parsing and Problem Solving,” which well exemplifies MAIS’ areas of work. It is a national platform specialising in document analysis, computer vision, robotics, Machine Learning, Natural Language Processing (NLP), and medical AI research – the first to start Pattern Recognition research in China, and one of its main AI research centres. We enjoyed an excellent exchange – and a fruitful discussion.  

MAIS and British Library colleagues at the CAS offices in the Haidian District, Beijing
MAIS and British Library colleagues at the CAS offices in the Haidian District, Beijing

 

From there, we travelled to Peking University for another stimulating knowledge exchange meeting with Prof Jun Wang, Director of the Research Center for Digital Humanities (PKUDH) and Vice Dean, Artificial Intelligence Institute, joined by Dr Qi Su, Dr Pengyi Zhang, Dr Hao Yang, Honglei San, Kairan Liu, and Siyu Duan. We watched videos of two Shidian platforms – open access web platforms for reading, editing and analysing ancient Chinese books, developed through a partnership between PKUDH and the Douyin Group. One platform is the Open Access Ancient Book Reading Platform, and the second is the AI-powered Ancient Book Collation Platform. The AI-empowered editing and compiling system includes an impressive array of functionalities. 

Screenshot from the YouTube video, showing features of the Shidian reading platform
Screenshot from the YouTube video, showing features of the Shidian reading platform

Our session also included presentations and discussions around topics such as AI character reconstruction, cultural heritage curation and crowdsourcing, automatic text annotation and linked data. For example, PhD student Siyu Duan (supervised by Prof Su Qi) presented about dealing with ancient ideograph restoration, including a little experiment on Dunhuang data that showed suggested restoration of damaged or illegible characters. The whole session was an absolute delight!  

I am so grateful for everyone generosity and hospitality – I have learned so much, so thank you. Until next time! 

Dr Adi Keinan-Schoonbaert enjoying the dunes and the Crescent Moon Spring, Dunhuang
Dr Adi Keinan-Schoonbaert enjoying the dunes and the Crescent Moon Spring, Dunhuang

 

21 June 2024

blplaybills.org: leveraging open data from the British Library

In this guest post, developer Sak Supple describes his work turning digitised images of playbills into fully searchable documents... Digital Curator Mia Ridge says, 'we're absolutely delighted by Sak's work, and hope that his post helps others working with digitised collections'.

Screenshot of digitised playbills showing their varied layouts and typefaces
Sample playbills from the British Library's collection

This blog post explores the creation of blplaybills.org, a website that showcases data made publicly available by the British Library.

The blplaybills.org website provides a way to search for, view and download archival playbills from Great Britain and Ireland, 1600-1902, as curated by the British Library (BL).

The website is independently produced using assets made available by the British Library under a Creative Commons licence as part of an open data initiative.

The playbill data

Playbills were promotional flyers advertising entertainment events at theatres, fairs and pleasure gardens.

The BL playbills data originated as document scans (digitised from microfilm, the most viable approach for fragile artefacts) in PDF format, each file containing hundreds of individual playbills, grouped by volume (usually organised by theatre, region and/or period of history).

In total there are more than 80,000 scanned playbills available.

Beside the PDFs, there is also metadata describing where in the Library these playbills could be found (volumes, shelfmarks etc). Including this information meant researchers could search for information online, and also have the volume reference at hand when visiting the Library.

This data is useful to anyone researching theatre, music, history and literature. Making it easy to find, view and download playbills using simple text searches over the internet is a good way to bring the playbills to a wider audience.

This is how blplaybills.org came into existence: the goal was to turn playbill data from the British Library into a searchable online database and image store.

The workflows

It is notoriously difficult to search PDF documents containing scans.

The text in these playbills is embedded in an image. This makes it especially difficult for computers to search the content of a scan, since a computer will interpret the text as a number of lines and curves within the image, without recognizing it as text.

Because internet technologies are well suited to searching for text, the first challenge is to turn the scanned playbill text into searchable text that a computer can more easily understand.

The chosen approach was to use Optical Character Recognition (OCR) software to capture text contained in the playbills.

OCR is a pattern matching technique, enhanced with machine learning, that finds text in an image by first using text detection algorithms to isolate character images, called glyphs, and comparing these with similarly stored glyphs. These glyphs are then further broken down into features (lines, loops etc), which are then used to find the best match amongst pre-trained glyphs.

The recognised text can then be processed using techniques like contextual analysis and grammar checking to improve accuracy.

The result can then be stored in a computer file to form text that a computer can recognise in the form of characters, words, phrases and sentences.

The resulting text is associated with individual playbills and related metadata, and the text and metadata stored in an online database to make it searchable.

In parallel to the above processes, high and low resolution JPEG versions of individual playbills were generated and uploaded to cloud storage for online access.

The general flow is shown below.

Diagram showing how PDFs and metadata were processed
Figure 1: Flow of data from original data to structured online resources

The details of each of these workflows is discussed in more detail below.

Text generation workflow

Since the goal is to make it possible to search for individual playbills, the first step was to break up PDFs containing multiple playbills into individual documents containing one playbill each.

This was done using open source software called poppler-utils that provides command line utilities for manipulating PDF documents, including generating single page documents from one multipage document.

The next step is to extract text using OCR. In 2018 my research showed that an effective open source solution for this was Tesseract.

Experiments showed that Tesseract produced best results by converting the PDF document to a lossless raster format like TIFF (Tag Image File Format) before running the OCR program. In fact, it was found that changing the size of the document, increasing the resolution and contrast and then converting to TIFF produced good output from Tesseract OCR.

The conversion from PDF to TIFF for each playbill was achieved using open source software called ImageMagick.

This workflow is shown below.

Diagram from multipage PDF to single page to high contrast TIFF to OCR text file
Figure 2: Workflow to produce OCR text for each individual playbill

Doing this for 80,000+ individual playbills was achieved by automating the above workflow and processing multiple playbills in parallel. The individual playbills could be uniquely identified by the name of the original multipage PDF, together with the page number of the playbill.

Two other workflows were set up to work in parallel with this:

  • Convert individual PDF playbills into high and low resolution JPEGs for online viewing
  • Add metadata to the OCR text (volume, shelfmark, date, theatre etc) to produce a JSON file, and upload and index this information in a searchable online database

JPEG generation

As individual PDF playbills were generated from multipage PDFs, a copy of each single page PDF was sent to the JPEG generation workflow where its arrival triggered the workflow.

ImageMagick was used to create thumbnail and high resolution JPEG versions of the playbill suitable for online viewing.

The resulting JPEG files, identified by the original PDF filename and page number of the playbill, were then uploaded to cloud storage.

JSON generation

A popular choice to store searchable text in JSON format is a database called Elasticsearch. This provides fast indexing and search capabilities, and is available for non-commercial use.

This JSON should include the searchable playbill text and relevant metadata.

Each output from the text generation workflow triggered the JSON generation, allowing metadata for the individual playbill to be merged with OCR text into a single JSON file.

The resulting JSON was uploaded and indexed in an online Elasticsearch database. This became the searchable datastore for the web application that researchers use when visiting blplaybills.org.

The search interface

At this point the data is stored in a searchable online database, and images of individual playbills have been made available in online cloud storage.

The next step is to allow researchers to search for, view and download playbills.

The main requirements of the interface are:

  • Simple text search to return playbills containing matching text
  • These results to be quickly filtered using faceted search based on date, theatre, location, organisation and volume
  • Quick copy of playbill text
  • View and download a high resolution version of the playbill
  • Responsive design

The interface is shown in Figure 3 below.

Screenshot of the blplaybills.org search interface
Figure 3: Online search interface

The web interface is hosted in AWS/EC2 (Amazon Web Services cloud compute service) and uses standard web frameworks used for the creation of single page applications.

Software development

Wherever open source software was available it was used: Tesseract, ImageMagick and poppler-utils.

Some software development was necessary to create backend workflows, and to automate and integrate them with each other.

This was achieved using a combination of scripting (NodeJS, Bourne shell and Python) and C programs.

The front-end was developed with Javascript, NodeJS, Angular and HTML5/CSS3.

Recent work and next steps

I recently made some modifications to the above approach to improve the quality of OCR generated text for each playbill.

Specifically, Tesseract has been replaced by a utility called textra (Swift/MacOS) that uses the Apple Vision framework for character recognition. This significantly improved the quality of the text generated by the OCR process, resulting in improved search accuracy. This technology was not available in 2018 when blplaybills.org was first created.

Another method to improve the accuracy of search might be to enhance OCR text with text transcribed as part of a crowdsourcing initiative from the British Library: In the Spotlight. This involved members of the public transcribing titles, names and locations in playbills. By adding this information to the indexed data already generated, search accuracy could be further improved.

An interesting piece of research would be to consider if LLMs (Large Language Models) could be fine tuned to enhance the results of traditional OCR techniques.

The goal would be to find a generalised approach that uses modern natural language processing techniques to improve the automatic transcription of less machine-readable archival material such as, but not limited to, these playbills. Ideally these techniques could also be applied to multi-lingual material.

This will be the focus of future work to improve the data behind blplaybills.org.