THE BRITISH LIBRARY

Digital scholarship blog

21 August 2019

Chevening British Library Fellowship working with Chinese historical texts

Chevening is the UK government’s international awards programme aimed at developing global leaders. In 2015, the Foreign and Commonwealth Office (FCO) has partnered with the British Library to offer professionals two new fellowships every year. These fellowships are unique opportunities for one-year placements at the Library, working with exceptional collections under the Library’s custodianship. Past and present Chevening Fellows at the Library have focused on geographically diverse collections, from Latin America through Africa to South Asia, with different themes such as Nationalism, Independence, and Partition in South Asia, 1900-1950 and Big Data and Libraries.

We are thrilled to announce that one of the two placements available for the 2020/2021 academic year will focus on automating the recognition of historical Chinese handwritten texts. This is a special opportunity to work in the Library’s Digital Scholarship Department, and engage with unique historical collections digitised as part of the International Dunhuang Project and the Lotus Sutra Manuscripts Digitisation Project. Focusing on material from Dunhuang (China), part of the Stein collection, this Fellowship will engage with new digital tools and techniques in order to explore possible solutions to automate the transcription of these handwritten texts.

Chinese Lotus Sutra scroll with Tibetan divination texts on the back (Shelfmark: Or.8210/S.155). Digitised as part of the Lotus Sutra Manuscripts Digitisation Project. © The British Library
Chinese Lotus Sutra scroll with Tibetan divination texts on the back (Shelfmark: Or.8210/S.155). Digitised as part of the Lotus Sutra Manuscripts Digitisation Project. © The British Library

 

The context for this fellowship is the Library’s efforts towards making its collection items available in machine-readable format, to enable full-text search and analysis. The Library has been digitising its collections at scale for over two decades, with digitisation opening up access to diversely rich collections. However, it’s important for us to further support discovery and digital research by unlocking the huge potential in automatically transcribing our collections. Until recently, Western language print collections have been the main focus, especially newspaper collections. A flagship collaboration with the Alan Turing Institute, a project called “Living with Machines,” is underway to apply Optical Character Recognition (OCR) to UK newspapers, design and implement new methods in data science and artificial intelligence, and analyse these materials at scale.

Taking a broader perspective on Library collections, we have started to explore opportunities with non-Latin collections too. Members of the Digital Scholarship team are engaging closely with the exploration of OCR and Handwritten Text Recognition (HTR) systems for Bangla and Arabic. Digital Curators Tom Derrick, Nora McGregor and Adi Keinan-Schoonbaert have teamed up with PRImA Research Lab and the Alan Turing Institute to ran four competitions in 2017-2019, inviting providers of text recognition methods to try them out on our historical material. Another initiative which Tom is engaged with is exploring Transkribus for Bengali printed texts. He trained Transkribus’ HTR+ recognition engine, which ended up transcribing this material at 94% character accuracy! Tom and Adi’s recent blog post in EuropeanaTech Insight (issue on OCR) summarises these initiatives.

Regions and text lines demarcated as ground truth for RASM2019 ICDAR2019 Competition on Recognition of Historical Arabic Scientific Manuscripts (Shelfmark: Add MS 7474). Digitised and available on Qatar Digital Library.
Regions and text lines demarcated as ground truth for RASM2019 ICDAR2019 Competition on Recognition of Historical Arabic Scientific Manuscripts (Shelfmark: Add MS 7474). Digitised and available on Qatar Digital Library.

 

The Chevening Fellow will contribute to our efforts to identify OCR/HTR systems that can tackle digitised historical collections. They will explore the current landscape of Chinese handwritten text recognition, look into methods, challenges, tools and software, use them to test our material, and demonstrate digital research opportunities arising from the availability of these texts in machine-readable format.

This fellowship programme will start in September 2020 for a 12-month period of project-based activity at the British Library. The successful candidate will receive support and supervision from Library staff, and will benefit from professional development opportunities, networking and stakeholder engagement, gaining access to a range of organisational training and development opportunities (such as the Digital Scholarship Training Programme), as well as staff-level access to unique British Library collections and research resources.

For more information and to apply, please visit the Chevening British Library Fellowship page: https://www.chevening.org/fellowship/british-library/, and the “Automating the recognition of historical Chinese handwritten texts” Fellow page: https://www.chevening.org/fellowship/british-library-chinese-handwritten-texts/.

Applications close at 12pm (GMT), 5 November 2019. Good luck!

 

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Twitter as @BL_AdiKS.