18 October 2023
In this post, Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, shares some background information on how the new post advertised for a Digital Curator for OCR/HTR will help the Library streamline post-digitisation work to make its collections even more accessible to users.
We’ve been digitising our collections for almost three decades, opening up access to incredibly diverse and rich collections, for our users to study and enjoy. However, it is important that we further support discovery and digital research by unlocking the huge potential in automatically transcribing our collections!
We’ve done some work over the years towards making our collection items available in machine-readable format, in order to enable full-text search and analysis. Optical Character Recognition (OCR) technology has been around for a while, and there are several large-scale projects that produced OCRed text alongside digitised images – such as the Microsoft Books project. Until recently, Western languages print collections have been the main focus, especially newspaper collections. A flagship collaboration with the Alan Turing Institute, the Living with Machines project, applied OCR technology to UK newspapers, designing and implementing new methods in data science and artificial intelligence, and analysing these materials at scale.
Machine Learning technologies have been dealing increasingly well with both modern and historical collections, whether printed, typewritten or handwritten. Taking a broader perspective on Library collections, we have been exploring opportunities with non-Western collections too. Library staff have been engaging closely with the exploration of OCR and Handwritten Text Recognition (HTR) systems for English, Bangla and Arabic. Digital Curators Tom Derrick, Nora McGregor and Adi Keinan-Schoonbaert have teamed up with PRImA Research Lab and the Alan Turing Institute to run four competitions in 2017-2019, inviting providers of text recognition methods to try them out on our historical material.
We have been working with Transkribus as well – for example, Alex Hailey, Curator for Modern Archives and Manuscripts, used the software to automatically transcribe 19th century botanical records from the India Office Records. A digital humanities work strand led by former colleague Tom Derrick saw the OCR of most of our digitised collection of Bengali printed texts, digitised as part of the Two Centuries of Indian Print project. More recently Transkribus has been used to extract text from catalogue cards in a project called Convert-a-Card, as well as from Incunabula print catalogues.
The British Library is now looking for someone to join us to further improve the access and usability of our digital collections, by integrating a standardised OCR and HTR production process into our existing workflows, in line with industry best practice.
For more information and to apply please visit the British Library recruitment site and look for the Digital Curator for OCR/HTR position. Applications close on Sunday 5 November 2023. Please pay close attention to questions asked in the application process. Any questions? Drop us a line at [email protected].