THE BRITISH LIBRARY

Digital scholarship blog

26 February 2019

Competition to automate text recognition for printed Bangla books

You may have seen the exciting news last week that the British Library has launched a competition on recognition of historical Arabic scientific manuscripts that will run as part of ICDAR2019. We thought it only fair to cover printed material too! So we’re running another competition, also at ICDAR, for automated text recognition of rare and unique printed books written in Bangla that have been digitised through the Library's Two Centuries of Indian Print project.

Some of you may remember the Bangla printed books competition which took place at ICDAR2017 which generated significant interest among academic institutions and technology providers both in India and across the world. The 2017 competition set the challenge of finding an optimal solution for automating recognition of Bangla printed text and resulted in Google’s method performing best for both text detection and layout analysis.

Fast forward to 2019 and, thanks to Jadavpur University in Kolkata, we have added more ground truth transcriptions for competition entrants to train their OCR systems with. We hope that the competition encourages submissions again from cutting-edge OCR methods leading to a solution that can truly open up these historic books, dating between 1713 and 1914, for text mining, enabling scholars of South Asian studies to explore hundreds of thousands of pages on a scale that has not been possible until now.

AletheiaGroundTruth

              Image showing a transcribed page from one of the Bengali books featured in the ICDAR2019 competition

As with the Arabic competition, we are collaborating with PRImA (Pattern Recognition & Image Analysis Research Lab) who will provide expert and objective evaluation of OCR results produced through the competition. The final results will be revealed at the ICDAR2019 conference in Sydney in September.

So if you missed out last time but are interested in testing your OCR systems on our books the competition is now open! For instructions of how to apply and more about the competition, please visit https://www.primaresearch.org/REID2019/

 

This post is by Tom Derrick, Digital Curator for Two Centuries of Indian Print, British Library. He is on Twitter as @TommyID83 and Two Centuries of Indian Print tweet from @BL_IndianPrint