THE BRITISH LIBRARY

Digital scholarship blog

Enabling innovative research with British Library digital collections

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

23 January 2018

Using Transkribus for handwritten text recognition with the India Office Records

In this post, Alex Hailey, Curator, Modern Archives and Manuscripts, describes the Library's work with handwritten text recognition.

National Handwriting Day seems like a good time to introduce the Library’s initial work with the Transkribus platform to produce automatic Handwritten Text Recognition models for use with the India Office Records.

Transkribus is produced and supported as part of the READ project, and provides a platform 'for the automated recognition, transcription and searching of historical documents'. Users upload images and then identify areas of writing (text regions) and lines within those regions. Once a page has been segmented in this way, users transcribe the text to produce a 'ground truth' transcription – an accurate representation of the text on the page. The ground truth texts and images are then used to train a recurrent neural network to produce a tool to transcribe texts from images: a Handwritten Text Recognition (HTR) model.

image from https://s3.amazonaws.com/feather-client-files-aviary-prod-us-east-1/2018-01-22/8f108ba6-3247-429a-995c-6db42a4d3d7f.png
Page segmented using the automated line identification tool. The document structure tree can be seen in the left panel.

After hearing about the project at the Linnean Society’s From Cabinet to Internet conference in 2015, we decided to run a small pilot project using material digitised as part of the Botany in British India project.

Producing ground truth text and Handwritten Text Recognition (HTR) models

We created an initial set of ground truth training data for 200 images, produced by India Office curators and with the help of a PhD student. This data was sent to the Transkribus team to produce our first HTR model. We also supplied material for the construction of a dictionary to be used alongside the HTR, based on the text from the botany chapter of Science and the Changing Environment in India 1780-1920 and contemporary botanical texts.

The accuracy of an HTR model can be determined by generating an automated transcription, correcting any errors, and then comparing the two versions. The Transkribus comparison tool calculates a Character Error Rate (CER) and a Word Error Rate (WER), and also provides a handy visualisation. With our first HTR model we saw an average CER of 30% and WER of 50%, which reflected the small size of the training set and the number of different hands across the collections.

(Transkribus recommends using collections with one or two consistent hands, but we thought we would push on regardless to get an idea of the challenges when using complex, multi-authored archives).

Doc18776img16
WER and CER are quite unforgiving measures of accuracy. The image above has 18.5% WER and 9.5% CER

For our second model we created an additional 500 pages of ground truth text, resulting in a training set of 83,358 words over 14,599 lines. We saw a marked improvement in results with this second HTR model – an average WER of 30%, and CER of 15%.

image from https://s3.amazonaws.com/feather-client-files-aviary-prod-us-east-1/2018-01-22/a59e02fd-b126-424b-97c8-57aa42172c10.png
Graph showing the learning curve for our second HTR model, measured in CER

Improvements in the automatic layout detection and the ability to run the HTR over images in batch means that we can now generate ground truth more quickly by correcting computer-produced transcriptions than we could through a fully-manual process. We have since generated and corrected an additional 200 pages of transcriptions, and have expanded the training dataset for our next HTR model.

Lessons learned and next steps

We have now produced over 800 pages of corrected transcriptions using Transkribus, and have a much better idea of the challenges that the India Office material poses for current HTR technologies. Pages with margins and inconsistent paragraph widths prove challenging for the automatic layout detection, although the line identification has improved significantly, and tends to require only minor corrections (if any). Faint text, numerals, and tabulated text appeared to pose problems for our HTR models, as did particularly elaborate or lengthy ascenders and descenders.

More positively, we have signed a Memorandum of Understanding with the READ project, and are now able to take part in the exciting conversations around the transcription and searching of digitised manuscript materials, which we can hopefully start to feed into developments at the Library. The presentations from the recent Transkribus Conference are a good place to start if you want to learn more.

The transcriptions will be made available to researchers via data.bl.uk, and we are also planning to use them to test the ingest and delivery of transcriptions for manuscript material via the Universal Viewer.

By Alex Hailey, Curator, Modern Archives and Manuscripts

22 January 2018

BL Labs 2017 Symposium: Data Mining Verse in 18th Century Newspapers by Jennifer Batt

Dr Jennifer Batt, Senior Lecturer at the University of Bristol, reported on an investigation in finding verse using text and data-mining methods in a collection of digitised eighteenth-century newspapers in the British Library’s Burney Collection to recover a complex, expansive, ephemeral poetic culture that has been lost to us for well over 250 years. The collection equates to around 1 million pages, around 700 or so bound volumes of 1271 titles of newspapers and news pamphlets published in London and also some English provincial, Irish and Scottish papers, and a few examples from the American colonies.

A video of her presentation is available below:

Jennifer's slides are available on SlideShare by clicking on the image below or following the link:

Datamining for verse in eighteenth-century newspapers
Datamining for verse in eighteenth-century newspapers

https://www.slideshare.net/labsbl/datamining-for-verse-in-eighteenthcentury-newsapers

 

 

19 January 2018

BL Labs 2017 Symposium: Imaginary Cities by Michael Takeo Magruder - Artistic Award Winner

Artist Michael Takeo Magruder has been working with the British Library's digitised collections to produce stunning and thought-provoking artworks for his project, Imaginary Cities. This is an Arts-meets-Humanities research project exploring how large digital repositories of historical cultural materials can be used to create new born-digital artworks and real-time experiences which are relevant and exciting to 21st century audiences.

The project uses images - and the associated metadata - of pre-20th century urban maps drawn from the British Library’s online 1 Million Images from Scanned Books collection on Flickr Commons, and transformed this material into provocative fictional cityscapes. 

Michael was unable to attend the fifth annual British Library Labs Symposium in person, but gave a presentation about his work virtually which you can see here in this video:

Michael was also announced as the winner of the BL Labs Artistic Award 2017 and here is a short clip of him receiving his award via Skype:

(Michael's award is announced at 14 minutes and 30 seconds in to the video.)

If you are inspired to create something with the British Library's collections, find our more on the British Library Labs Awards pages, the deadline this year is midnight BST 11th October 2018. The winners will be announced at our sixth BL Labs symposium on Monday 12 November, 2018.

Posted by Eleanor Cooper, Project Officer BL Labs.