Digital scholarship blog

03 August 2021

Automating the Recognition of Chinese Manuscripts: New Chevening British Library Fellowship

 

The Chevening Fellowship Programme is the UK government’s international awards scheme aimed at fostering knowledge exchange and collaboration, and developing global leaders. In 2015, the Foreign, Commonwealth & Development Office (FCDO) has partnered with the British Library to offer professionals two new fellowships every year, and recently the two organisations have announced the renewal of their partnership until 2024/25.

Chevening logo and the British Library logo

These fellowships are unique opportunities for one-year placements at the Library, working with exceptional collections under the Library’s custodianship. The Library has hosted international fellows through this scheme since 2016, with each fellowship framing a distinct project inspired by Library collections. Past and present Chevening Fellows at the Library have focused on geographically diverse collections, from Latin America through Africa to South Asia, with different themes such as archival material from Latin America and the Caribbean, African-language printed books, Nationalism, Independence, and Partition in South Asia and Big Data and Libraries.

We are thrilled to (re-)announce that one of the two placements available for the 2022/2023 academic year will focus on automating the recognition of historical Chinese handwritten texts. This fellowship, originally announced two years ago, had to be postponed due to the pandemic – and we are excited to be able to offer it again. This is a special opportunity to work in the Library’s Digital Research Team, and engage with unique historical collections digitised as part of the International Dunhuang Project and the Lotus Sutra Manuscripts Digitisation Project. Focusing on material from Dunhuang (China), part of the Stein collection, this fellowship will engage with new digital tools and techniques in order to explore possible solutions to automate the transcription of these handwritten texts.

End piece of a Chinese Lotus Sutra Scroll (shelfmark: Or.8210/S.1606). Digitised as part of the Lotus Sutra Manuscripts Digitisation Project.
End piece of a Chinese Lotus Sutra Scroll (shelfmark: Or.8210/S.1606). Digitised as part of the Lotus Sutra Manuscripts Digitisation Project.

 

The context for this fellowship is the Library’s efforts towards making its collection items available in machine-readable format, to enable full-text search and analysis. The Library has been digitising its collections at scale for over two decades, with digitisation opening up access to diversely rich collections. However, it is important for us to further support discovery and digital research by unlocking the huge potential in automatically transcribing our collections. Until recently, Western languages print collections have been the main focus, especially newspaper collections. A flagship collaboration with the Alan Turing Institute, the Living with Machines project, has been applying Optical Character Recognition (OCR) technology to UK newspapers, designing and implementing new methods in data science and artificial intelligence, and analysing these materials at scale.

Taking a broader perspective on Library collections, we have been exploring opportunities with non-Western collections too. Library staff have been engaging closely with the exploration of OCR and Handwritten Text Recognition (HTR) systems for English, Bangla and Arabic. Digital Curators Tom Derrick, Nora McGregor and Adi Keinan-Schoonbaert have teamed up with PRImA Research Lab and the Alan Turing Institute to ran four competitions in 2017-2019, inviting providers of text recognition methods to try them out on our historical material. We have been working with Transkribus as well – for example, Alex Hailey, Curator for Modern Archives and Manuscripts, used the software to automatically transcribe 19th century botanical records from the India Office Records. An ongoing work led by Tom Derrick is to OCR our digitised collection of Bengali printed texts, digitised as part of the Two Centuries of Indian Print project.

 

Regions, text lines and illustrations demarcated as ground truth, as shown in Transkribus (Shelfmark: Or 3366). Digitised and available on Qatar Digital Library.
Regions, text lines and illustrations demarcated as ground truth, as shown in Transkribus (Shelfmark: Or 3366). Digitised and available on Qatar Digital Library.
 
 
Another screenshot from Transkribus, showing automatically transcribed Bengali printed text (Shelfmark: VT 1914 d). Digitised as part of the Two Centuries of Indian Print project.
Another screenshot from Transkribus, showing automatically transcribed Bengali printed text (Shelfmark: VT 1914 d). Digitised as part of the Two Centuries of Indian Print project.

 

The Chevening Fellow will contribute to our efforts to identify OCR/HTR systems that can tackle digitised historical collections. They will explore the current landscape of Chinese handwritten text recognition, look into methods, challenges, tools and software, use them to test our material, and demonstrate digital research opportunities arising from the availability of these texts in machine-readable format.

This fellowship programme will start in September 2022 for a 12-month period of project-based activity at the British Library. The successful candidate will receive support and supervision from Library staff, and will benefit from professional development opportunities, networking and stakeholder engagement, gaining access to a range of organisational training and development opportunities (such as the Digital Scholarship Training Programme), as well as staff-level access to unique British Library collections and research resources.

For more information and to apply, please visit the Chevening British Library Fellowship page: https://www.chevening.org/fellowship/british-library/, and the “Automating the recognition of historical Chinese handwritten texts” fellowship page: https://www.chevening.org/fellowship/british-library-historical-chinese-texts/.

Applications open on 3 August, 12:00 (midday) BST and close on 2 November, 12:00 (midday) GMT.

Good Luck!

This post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She is on twitter as @BL_AdiKS

 

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs