A workshop on Optical Character Recognition for Bangla
I was fortunate enough to travel to Kolkata recently along with other members of the Two Centuries of Indian Print team where we ran a workshop on ‘Developments with Optical Character Recognition for Bangla’. The event, which took place at Jadavpur University, proved an excellent forum to share knowledge in this area of growing interest and was reflected in the range of library professionals, academics and computer scientists who attended from ten institutions across Bengal and from the US.
Applying Optical Character Recognition (OCR) to printed texts is one of the key expectations of 21st century scholars and library users, who want to quickly find information online that accurately meets their research needs. Cultural institutions are gateways to millions of items containing knowledge that can transform modern research. The workshop builds on our recently launched OCR Competition for Rare Indian Books and looked at the developments, challenges and opportunities of OCR in opening up vast quantities of knowledge to digital researchers.
Dr. Naira Khan from the University of Dhaka’s Computational Linguistics department kicked off the workshop by introducing the key process of how OCR works, including ‘pre-processing’ steps such as binarisation which reduces a scanned page of text to its binary form to remove background noise, isolating only the text on the page. Skew detection, another pre-processing technique, corrects scans with angular text that can cause problems for OCR systems that require perfectly horizontal or vertical text. Dr. Khan moved on to explain how OCR systems segment pages into text and non-text regions right down to pixel detection to recognise word boundaries. When it comes down to recognising individual characters, Bangla script presents some unique challenges, containing such a vast range of compound characters, vowel signs and ligatures, not to mention the distinctive top line connecting characters known as the ‘Matra’. Breaking the characters into their geometric features such as lines, arcs and circles enables combinations of features to be formed, classified as characters and expressed in digital form as OCR output.
Dr. Khan introducing the concepts of OCR
After Dr. Khan’s inspiring talk attendees learned of the British Library’s particular challenge searching for an OCR solution for our 19th century Bengali books currently being digitised, and the potential use of an OCR’d dataset for Digital Humanities researchers wanting to perform text and data mining. The books span an enormous range of genres from works by religious missionaries, to those covering food, science and works of fiction. So obtaining OCR would enable automated searching and analysis of the full text across hundreds of thousands of pages that could lead to exciting research discoveries in South Asian studies.
The event concluded with a practical session during which attendees used different OCR software on a sample of the BL’s digitised Bengali books. They experimented with Tesseract, Google Drive, i2ocr and newOCR. The general consensus was Google Drive proved to be the most accurate! Although, there are other tools we have only just begun to try out such as Transkribus that may be useful.
All-in-all the workshop proved a really worthwhile exercise in widening knowledge among Indian institutions about the challenges and possible uses of OCR for Bangla. The work currently being undertaken by universities and technology centres using state-of-the-art machine learning techniques to perform text recognition will hopefully close the gap between Bangla (as well as other Indic scripts) and Latin scripts when it comes to efficient OCR tools.
This is a post by Tom Derrick, Digital Curator for the Two Centuries of Indian Print project.