THE BRITISH LIBRARY

Digital scholarship blog

2 posts from September 2019

14 September 2019

BL Labs Awards 2019: enter before 2100 on Sunday 29th September! (deadline extended)

We have extended our deadline for our BL Labs Awards to 21:00 (BST) on Sunday 29th September, submit your entry here. If you have already entered, you don't have to resubmit, however, we are happy to receive updated entries too.

The BL Labs Awards formally recognises outstanding and innovative work that has been created using the British Library’s digital collections and data.

Submit your entry, and help us spread the word to all interested parties!

This year, BL Labs is commending work in four key areas:

  • Research - A project or activity that shows the development of new knowledge, research methods, or tools.
  • Commercial - An activity that delivers or develops commercial value in the context of new products, tools, or services that build on, incorporate, or enhance the Library's digital content.
  • Artistic - An artistic or creative endeavour that inspires, stimulates, amazes and provokes.
  • Teaching / Learning - Quality learning experiences created for learners of any age and ability that use the Library's digital content.

After the submission deadline of 21:00 (BST) on Sunday 29th September for entering the BL Labs Awards has passed, the entries will be shortlisted. Selected shortlisted entrants will be notified via email by midnight BST on Thursday 10th October 2019. 

A prize of £500 will be awarded to the winner and Â£100 to the runner up in each Awards category at the BL Labs Symposium on 11th November 2019 at the British Library, St Pancras, London.

The talent of the BL Labs Awards winners and runners up over the last four years has led to the production of a remarkable and varied collection of innovative projects. In 2018, the Awards commended work in four main categories – Research, Artistic, Commercial and Teaching & Learning:

Photo collage

  • Research category Award (2018) winner: The Delius Catalogue of Works: the production of a comprehensive catalogue of works by the composer Delius, based on research using (and integrated with) the BL’s Archives and Manuscripts Catalogue by Joanna Bullivant, Daniel Grimley, David Lewis and Kevin Page from Oxford University’s Music department.
  • Artistic Award (2018) winner: Another Intelligence Sings (AI Sings): an interactive, immersive sound-art installation, which uses AI to transform environmental sound recordings from the BL’s sound archive by Amanda Baum, Rose Leahy and Rob Walker independent artists and experience designers.
  • Commercial Award (2018) winner: Fashion presentation for London Fashion Week by Nabil Nayal: the Library collection - a fashion collection inspired by digitised Elizabethan-era manuscripts from the BL, culminating in several fashion shows/events/commissions including one at the BL in London.
  • Teaching and Learning (2018) winner: Pocket Miscellanies: ten online pocket-book ‘zines’ featuring images taken from the BL digitised medieval manuscripts collection by Jonah Coman, PhD student at Glasgow School of Art.

For further information about BL Labs or our Awards, please contact us at labs@bl.uk.

Posted by Mahendra Mahey, Manager of of British Library Labs.

13 September 2019

Results of the RASM2019 Competition on Recognition of Historical Arabic Scientific Manuscripts

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Twitter as @BL_AdiKS.

 

Earlier this year, the British Library in collaboration with PRImA Research Lab and the Alan Turing Institute launched a competition on the Recognition of Historical Arabic Scientific Manuscripts, or in short, RASM2019. This competition was held in the context of the 15th International Conference on Document Analysis and Recognition (ICDAR2019). It was the second competition of this type, following RASM2018 which took place in 2018.

The Library has an extensive collection of Arabic manuscripts, comprising of almost 15,000 works. We have been digitising several hundred manuscripts as part of the British Library/Qatar Foundation Partnership, making them available on Qatar Digital Library. A natural next-step would be the creation of machine-readable content from scanned images, for enhanced search and whole new avenues of research.

Running a competition helps us identify software providers and tool developers, as well as introduce us to the specific challenges that pattern recognition systems face when dealing with historic, handwritten materials. For this year’s competition we provided a ground truth set of 120 images and associated XML files: 20 pages to be used to train text recognition systems to automatically identify Arabic script, and a 100 pages to evaluate the training.

Aside from providing larger training and evaluation sets, for this year’s competition we’ve added an extra challenge – marginalia. Notes written in the margins are often less consistent and less coherent than main blocks of text, and can go in different directions. The competition set out three different challenges: page segmentation, text line detection and Optical Character Recognition (OCR). Tackling marginalia was a bonus challenge!

We had just one submission for this year’s competition – RDI Company, Cairo University, who previously participated in 2018 and did very well. RDI submitted three different methods, and participated in two challenges: text line segmentation and OCR. When evaluating the results, PRImA compared established systems used in industry and academia – Tesseract 4.0, ABBYY FineReader Engine 12 (FRE12), and Google Cloud Vision API – to RDI’s submitted methods. The evaluation approach was the same as last year’s, with PRImA evaluating page analysis and recognition methods using different evaluation metrics, in order to gain an insight into the algorithms.

 

Results

Challenge 1 - Page Layout Analysis

The first challenge was set out to identify regions in a page, and find out where blocks of text are located on the page. RDI did not participate in this challenge, therefore an analysis was made only on common industry software mentioned above. The results can be seen in the chart below:

Chart showing RASM2019 page segmentation results
Chart showing RASM2019 page segmentation results

 

Google did relatively well here, and the results are quite similar to last year’s. Despite dealing with the more challenging marginalia text, Google’s previous accuracy score (70.6%) has gone down only very slightly to a still impressive 69.3%.

Example image showing Google’s page segmentation
Example image showing Google’s page segmentation

 

Tesseract 4 and FRE12 scored very similarly, with Tesseract decreasing from last year’s 54.5%. Interestingly, FRE12’s performance on text blocks including marginalia (42.5%) was better than last year’s FRE11 performance without marginalia, scoring at 40.9%. Analysis showed that Tesseract and FRE often misclassified text areas as illustrations, with FRE doing better than Tesseract in this regard.

 

Challenge 2 - Text Line Segmentation

The second challenge looked into segmenting text into distinct text lines. RDI submitted three methods for this challenge, all of which returned the text lines of the main text block (as they did not wish to participate in the marginalia challenge). Results were then compared with Tesseract and FineReader, and are reflected below:

Chart showing RASM2019 text line segmentation results
Chart showing RASM2019 text line segmentation results

 

RDI did very well with its three methods, with an accuracy level ranging between 76.6% and 77.6%. However, despite not attempting to segments marginalia text lines, their methods did not perform as well as last year’s method (with 81.6% accuracy). Their methods did seem to detect some marginalia, though very little overall, as seen in the screenshot below.

Example image showing RDI’s text line segmentation results
Example image showing RDI’s text line segmentation results

 

Tesseract and FineReader again scored lower than RDI, both with decreasing accuracy compared to RASM2018’s results (Tesseract 4 with 44.2%, FRE11 with 43.2%). This is due to the additional marginalia challenge. The Google method does not detect text lines, therefore the Text Line chart above does not include their results.

 

Challenge 3 - OCR Accuracy

The third and last challenge was all about text recognition, tackling the correct identification of characters and words in the text. Evaluation for this challenge was conducted four times: 1) on the whole page, including marginalia, 2) only on main blocks of text, excluding marginalia, 3) using the original texts, and 4) using normalised texts. Text normalisation was performed for both ground truth and OCR results, due to the historic nature of the material, occasional unusual spelling, and use/lack of diacritics. All methods performed slightly better when not tested on marginalia; accuracy rates are demonstrated in the charts below:

Chart showing OCR accuracy results, for main text body only (normalised, no marginalia)
Chart showing OCR accuracy results, for main text body only (normalised, no marginalia)
 
Chart showing OCR accuracy results for all text regions (normalised, with marginalia)
Chart showing OCR accuracy results for all text regions (normalised, with marginalia)

 

It is evident that there are minor differences in the character accuracies for the three RDI methods, with RDI2 performing slightly better than the others. When comparing the OCR accuracy between texts with and without marginalia, there are slightly higher success rates for the latter, though the difference is not significant. This means that tested methods performed on the marginalia almost as well as they did on the main text, which is encouraging.

Comparing RASM2018’s results, RDI’s results are good but not as good as last year (with 85.44% accuracy), likely to be a result of adding marginalia to the recognition challenge. Google performed very well too, considering they did not specifically train or optimised for this competition. Tesseract’s results went down from 30.45% to 25.13%, and FineReader Engine 12 performed better than its previous version FRE11, going up from 12.23% to 17.53% accuracy. However, it is still very low, as handwritten texts are not part of their target material.

 

Further Thoughts

RDI-Corporation has its own historical Arabic handwritten and typewritten OCR system, which has been built using different historical manuscripts. Its methods have done well, given the very challenging nature of the documents. Neither Tesseract nor ABBYY FineReader produce usable results, but that’s not surprising since they are both optimised for printed texts, and target contemporary material and not historical manuscripts.

As next steps, we would like to test these materials with Transkribus, which produced promising results for early printed Indian texts (see e.g. Tom Derrick’s blog post – stay tuned for some even more impressive results!), and potentially Kraken as well. All ground truth will be released through the Library’s future Open Access repository (now in testing phase), as well as through the website of IMPACT Centre for Competence. Watch this space for any developments!