22 March 2017
British Library Launches OCR Competition for Rare Indian Books
Calling all transcription enthusiasts! We’ve launched a competition to find an accurate and automated transcription solution for our rare Indian books and printed catalogue records, currently being digitised through the Two Centuries of Indian Print project.
The competition, in partnership with the University of Salford’s PRIMA Research Lab, is part of the International Conference on Document Analysis and Recognition, taking place in Kyoto, Japan this November. The winners will be announced at a special event during the conference.
Digitised images of the books will be made openly available through the library’s website and we hope this competition will produce transcriptions that enable full text search and discovery of this rich material. Sharing XML transcriptions will also give researchers the foundation to apply computational tools and methods such as text mining that may lead to new insights into book and publishing history in India.
Split into two challenges, those wishing to participate in the competition can enter either or both.
The first challenge is to find an automated transcription for the 19th century printed books written in Bengali script. Optical Character Recognition of many non-Latin scripts is a developing area, but still presents a considerable barrier for libraries and other cultural institutions hoping to open up their material for scholarly research.
Above: A page from 'Animal Biography', one of the Bengali books being digitised as part of Two Centuries of Indian Print (VT 1712)
Challenge number two involves our printed catalogue records, known as ‘Quarterly Lists’. These describe books published in India between 1867 and 1967. The lists are arranged in tables and therefore accurately representing the layout of the data is important if researchers are able to use computational methods to identify chunks of information such as the place of publication and cost of the book.
Above: A typical double page from the Quarterly Lists (SV 412/8)
With the competition now open, we’ve already gone some way to helping participants by manually transcribing a few pages to create ‘ground truth’ using PRIMA's editing tool, Aletheia. You can watch a video introducing the competition. So if you or anyone you know would like to enter, do please register and you could be contributing to this landmark project, and picking up an award for your troubles!