Digital scholarship blog

Enabling innovative research with British Library digital collections

21 February 2019

Automatic Transcription of Historical Arabic Scientific Manuscripts - Round 2

I am very pleased to announce that the British Library in collaboration with PRImA (Pattern Recognition & Image Analysis Research Lab) and the Alan Turing Institute are launching the ICDAR2019 Competition on Recognition of Historical Arabic Scientific Manuscripts.

Why are we doing this?

The British Library has a significant collection of Arabic manuscripts, among the largest in Europe and North America. These include copies of major religious, historical, literary and scientific works. As a post-digitisation step, we aim to make their contents more discoverable and usable by creating machine-readable text from scanned images. Opening up this content for full-text search and enabling text analysis at scale can revolutionise research!

Screenshot of a page featuring handwritten arabic text from the manuscript Add MS 7474_0032

What did we do last year?

In collaboration with the aforementioned partners, we launched a competition as part of the 16th International Conference on Frontiers in Handwriting Recognition (ICFHR2018). This competition was aimed at finding an optimal solution for an automatic Recognition of Historical Arabic Scientific Manuscripts (RASM2018).

For this purpose, we provided competition participants with a ground truth set – digitised images and XML files – derived from the British Library/Qatar Foundation Partnership digitised collection of historical Arabic manuscripts available on the Qatar Digital Library. This set files indicated the different text regions and lines, alongside their accurate transcription. It was used to train participants’ text recognition systems to automatically identify Arabic script in other images. We supplied participants with an additional set of 85 digitised images to try this out – and then PRImA evaluated the results using objective comparative evaluation methods.

Who won?

Participants had to address one or more of these three challenges: page segmentation, text line detection and Optical Character Recognition (OCR).

We had two winners, for two different tasks:

  • Page segmentation: Berat Kurar Barakat, Ben-Gurion University of the Negev
  • Text lines segmentation & Text recognition: Hany Ahmed, RDI Company, Cairo University

You can read more about it in this article, published in the proceedings of ICFHR2018.

Why another competition?

The field of OCR and HTR (Handwritten Text Recognition) is rapidly evolving, and we would like to provide text recognition communities with a larger and more enhanced ground truth set to train their systems. Our goal is to leave the research community with the most useful dataset for developing state-of-the-art solutions for Arabic HTR.

We are also adding another challenge in the current competition! Our Arabic manuscripts provide text recognition systems with many challenges, such as varying text column widths and font sizes, different text directions, faded ink, non-rectangular text regions, decorations and much more. This time we are trying to tackle marginalia – text written in the margins of the manuscripts – which is often less standardised and legible than the main text, and frequently goes in different directions.

Now what?

We are now inviting anyone with a text recognition software to try it out with our unique Arabic material. This competition is held in the context of the 15th International Conference on Document Analysis and Recognition (ICDAR2019).

This is the official RASM2019 website: https://www.primaresearch.org/RASM2019/

Here you will be able to find more information on this competition, its schedule and resources. To enter the competition please e-mail [email protected]

Organisers:

  • Prof Apostolos Antonacopoulos, Professor of Pattern Recognition, University of Salford and Head of (PRImA) research lab
  • Christian Clausner, Research Fellow at the Pattern Recognition and Image Analysis (PRImA) research lab
  • Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections at the British Library
  • Lynda Barraclough, Head of Curatorial Operations for the British Library’s partnership with the Qatar Foundation
  • Daniel Lowe, Curator for Arabic Collections at British Library
  • Dr Bink Hallum, Arabic Scientific Manuscripts Curator for the British Library/Qatar Foundation Partnership
  • Daniel Wilson-Nunn, PhD student at the University of Warwick & Turing PhD Student based at the Alan Turing Institute

Any questions – do get in touch with [email protected] or [email protected]

Good Luck!

Screenshot of a page featuring handwritten arabic text from the manuscript Delhi Arabic 1901_0154

 

This post is by Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She is on twitter as @BL_AdiKS

 

.