Digital scholarship blog

Enabling innovative research with British Library digital collections

05 February 2018

Building a Handwritten Arabic Manuscript Ground Truth Dataset يد واحدة لا تصفـّق

Are you able to read handwritten Arabic from historical manuscripts such as these? Then we could use your help!

In conjunction with our ICFHR2018 Competition on Recognition of Historical Arabic Scientific Manuscripts it is our aim to build a substantial image and ground truth dataset that can be used as the basis for advancing research in historical handwritten Arabic text analysis. This data will be made freely available for anyone wishing to advance the state-of-the-art in optical character recognition technology. 

What is Ground Truth?

The Impact Centre of Competence in Digitisation explains:

In digital imaging and OCR, ground truth is the objective verification of the particular properties of a digital image, used to test the accuracy of automated image analysis processes. The ground truth of an image’s text content, for instance, is the complete and accurate record of every character and word in the image. This can be compared to the output of an OCR engine and used to assess the engine’s accuracy, and how important any deviation from ground truth is in that instance.

The task to create such a dataset is enormous however so we're looking to build a network of folks who might be interested in sparing some time to transcribe a page or two.

If you're interested in learning more, and possibly contributing, we would love to hear from you! Please send us your details and we'll be in touch about upcoming workshops and activities to be held both in London and remotely.

 

.