THE BRITISH LIBRARY

Digital scholarship blog

2 posts from October 2018

29 October 2018

Using Transkribus for automated text recognition of historical Bengali Books

In this post Tom Derrick, Digital Curator, Two Centuries of Indian Print, explains the Library's recent use of Transkribus for automated text recognition of Bengali printed books.

Are you working with digitised printed collections that you want to 'unlock' for keyword search and text mining? Maybe you have already heard about Transkribus but thought it could only be used for automated recognition of handwritten texts. If so you might be surprised to hear it also does a pretty good job with printed texts too. You might be even more surprised to hear it does an impressive job with printed texts in Indian scripts! At least that is what we have found from recent testing with a batch of 19th century printed books written in Bengali script that have been digitised through the British Library’s Two Centuries of Indian Print project.

Transkribus is a READ project and available as a free tool for users who want to automate recognition of historical documents. The British Library has already had some success using Transkribus on manuscripts from our India Office collection, and it was that which inspired me to see how it would perform on the Bengali texts, which provides an altogether different type of challenge.

For a start, most text recognition solutions either do not support Indian scripts, or do not reach close to the same level of recognition as they do with documents written in English or other Latin scripts. In part this is down to supply and demand. Mainstream providers of tools have prioritised Western customers, yet there is also the relative lack of digitised Indian texts that can be used to train text recognition engines.

These text recognition engines have also been well trained on modern dictionaries and a collection of historical texts like the Bengali books will often contain words which are no longer in use. Their aged physicality also brings with it the delights of faded print, blotchy paper and other paper-based gremlins that keeps conservationists in work yet disrupts automated text recognition. Throw in an extensive alphabet that contains more diverse and complicated character forms than English and you can start to piece together how difficult it can be to train recognition engines to achieve comparable results with Bengali texts.

So it was with more with hope than expectation I approached Transkribus. We began by selecting 50 pages from the Bengali books representing the variety of typographical and layout styles within the wider collection of c. 500,000 pages as much as possible. Not an easy task! We uploaded these to Transkribus, manually segmenting paragraphs into text regions and automating line recognition. We then manually transcribed the texts to create a ground truth which, together with the scanned page images, were used to train the recurrent neural network within Transkribus to create a model for the 5,700 transcribed words.

Transkribus_Bengali_screenshot                                 View of a segmented page from one of the British Library's Bengali books along with its transcription, within the Transkribus viewer. 

The model was tested on a few pages from the wider collection and the results clearly communicated via the graph below. The model achieved an average character error rate (CER) of 21.9%, which is comparable to the best results we have seen from other text recognition services. Word accuracy of 61% was based on the number of words that were misspelled in the automated transcription compared to the ground truth. Eventually we would like to use automated transcriptions to support keyword searching of the Bengali books online and the higher the word accuracy increases the chances of users pulling back all relevant hits from their keyword search. We noticed the results often missed the upper zone of certain Bengali characters, i.e. the part of the character or glyph which resides above the matra line that connects characters in Bengali words. Further training focused on recognition of these characters may improve the results.

TranskribusResultsGraph showing the learning curve of the Bengali model using the Transkribus HTR tool.      

Our training set of 50 pages is very small compared to other projects using Transkribus and so we think the accuracy could be vastly improved by creating more transcriptions and re-training the model. However, we're happy with these initial results and would encourage others in a similar position to give Transkribus a try.

 

 

03 October 2018

The submission deadline for BL Labs Awards 2018 is next week!

The British Library has a vast, and continuously expanding, collection of material in digital form. You can dig into our datasets with text and data mining tools, conduct image analysis while listening to wildlife recordings, browse thousands of digitised manuscripts, and get lost in the million public domain images from BL publications available on Flickr.

To celebrate the variety of ways in which people have engaged with these amazing resources, the British Library Labs team run an Awards competition every autumn. Awards are given for completed projects in four categories:

  • Research - A project or activity which shows the development of new knowledge, research methods, or tools.
  • Commercial - An activity that delivers or develops commercial value in the context of new products, tools, or services that build on, incorporate, or enhance the Library's digital content.
  • Artistic - An artistic or creative endeavour which inspires, stimulates, amazes and provokes.
  • Teaching / Learning - Quality learning experiences created for learners of any age and ability that use the Library's digital content.

The competition is open to applicants from anywhere in the world – providing they have based their work on the British Library’s data or digital collections. There is also a Staff Award for a project by a current member of staff (or a team) at the British Library. In each category, winners receive £500 and runners up, £100 – as well as fame, glory and prestige, of course, and will be presented with their awards at the annual BL Labs Symposium on Monday 12th November 2018.

The deadline for submitting your project for one of this year’s external awards is midnight (BST) on Thursday 11th October – just over a week from now! You can read the small print (Terms & Conditions etc) here, and submit your entry using this online form.

Lucky British Library staff get an extra 12 hours to submit a project for an award – deadline midday (BST) on Friday 12th October.

BLAwards2018

BL Labs Awards 2017 Winners. Top-Left, Research – A large-scale comparison of world music corpora with computational tools; Top-Right, Commercial – Movable Type: The Card Game; Bottom-Left, Artistic – Imaginary Cities; Bottom-Right, Teaching/Learning – Vittoria’s World of Stories.

We encourage applications from all fields: digital humanities researchers, artists, musicians, entrepreneurs, game designers, writers, poets, statisticians, library scientists … the list really is endless and every year we are surprised and delighted by the new ways in which the digital collections have been used. If you would like to read about some of the previous projects, click on the links below which take you to blogs about last year’s star entrants. You can also browse previous submissions in any of the categories using this guide to the digital projects archive.

Read about some of the fantastic projects that won awards in 2017:

So hurry - get your applications in, and join the party on the 12th November!!

For any further information about BL Labs or our Awards, please contact us at labs@bl.uk.