THE BRITISH LIBRARY

Digital scholarship blog

12 March 2018

The Ground Truth: Transcribing historical Arabic Scientific Manuscripts for OCR research

Announcing a collaborative transcription project to support state-of-the-art research in automatic handwritten text recognition for historical Arabic texts

Cultural heritage institutions around the world are digitising hundreds of thousands of pages of historical Arabic manuscript and archive collections. Making these fully text searchable has the potential to truly transform scholarship, opening up this rich content for discovery and enabling large-scale analysis.

Computer scientists and scholars are working on this challenge, building systems which can automatically transcribe images of handwritten text, but for historical Arabic script a solution remains just out of reach.

Our aim is to contribute to continued research in this area by building an open image and ground truth dataset of historical handwritten Arabic texts, ensuring historical Arabic collections benefit from state-of-the-art developments in handwritten text recognition.

What is Ground Truth?

Optical Character Recognition (OCR) systems essentially turn a picture of text into text itself—in other words, producing something like a .TXT or .DOC file from a scanned .JPG of a printed or handwritten page. Most OCR systems require ground truth, a set of files which represent the truthful record of elements of an image, for training and evaluation purposes.

The ground truth of an image’s text content, for instance, is the complete and accurate record of every character and word in the image.

By knowing what the system is supposed to recognise on a page of handwritten text, researchers can both train their system to recognise the characters as well as test how well the system does once trained.

Transcription
 

  
View more transcriptions in progress from this manuscript (Or 3366) on the platform 

A collaborative approach

This project is a proof of concept exploring whether the creation of such a dataset can be done collaboratively at scale, using the collective expertise of volunteers around the world. At the heart of this approach is the Library’s enduring commitment to creating new and interesting ways to connect diverse communities of interest and expertise, be it scholars, the general public, computer scientists, students, and curators, around our collections. For this we are utilising a free and open-source platform, From the Page, which allows anyone with an interest in historical Arabic manuscripts to experience them up close, many for the first time, to discuss, learn and share expertise in their transcription.

Helping transform research

The Digital Scholarship Department was able to fund the development of this open source platform to support Right-to-Left transcription, a feature which will benefit any scholar wishing to use the software for their own transcription needs. Any transcriptions produced in this pilot will be transformed into ground truth resources, hosted by the British Library and made freely available, without rights restriction, for anyone wishing to advance the state-of-the-art in optical character recognition technology. Specifically, resources created will be contributed to ground-breaking projects already underway such as Transkribus, the Open Islamic Texts Initiative, the IMPACT Centre of Competence Image and Ground Truth Resources and more!

Visit the new Arabic Scientific Manuscripts of the British Library transcription platform and download our Getting Started Guide for more detail (an Arabic version will be available shortly). 

  

Posted by Nora McGregor, Digital Curator, British Library