28 May 2020
Automated text extraction from colonial-era maps of eastern Africa
After recently completing a pilot course in Computing for Information Professionals at Birkbeck University, I have just released a new dataset containing the text extracted from almost 2,000 colonial-era maps and documents covering eastern Africa. The resource is available now from the Shared Research Repository, and provides access to thousands of names of historical settlements and regions, descriptions of historical land use, topography and vegetation, and notes of ethnographic, military or administrative context.
The resource consists of a downloadable spreadsheet, which lets users browse or search the extracted text. I hope it will be of particular use in identifying and locating place names in eastern Africa during the colonial period, for which there is a gap in current research resources. I’m also hopeful it will facilitate the contribution of these maps to studies of the history of the environment.
The text was harvested from maps and documents that are held at the British Library in the War Office Archive, a collection of over 14,000 mostly unique, hand-drawn items originally kept by the British War Office between c.1880 and 1940 and used to compile printed maps over large parts of the world. They came from a variety of sources, including military surveyors, explorers, missionaries and spies. Generous funding from Indigo Trust recently allowed us to digitise those items relating to eastern Africa.
Automated extraction of the text was carried out using the Google Vision API, which found a total of 633,451 pieces of ‘text’ on the maps. However, after the majority of erroneous results or results that were not useful had been cleaned out, the final dataset was reduced to 317,133 transcriptions. These are sorted alphabetically and displayed in an Excel spreadsheet, shown in the following screenshot:
The order in which the pieces of text were transcribed from the maps was retained in the second column of the spreadsheet so that, if the spreadsheet is re-ordered by that column, each word can also be seen in its original context – for example, the text in the screenshot below can be read from top to bottom (‘The topography has been supplied...’):
The spreadsheet enables a user to identify the image in which any piece of text appears, and links to a geographical search interface for the archive, shown below, which in turn provides links to high-res versions of the images and their catalogue records on the BL website. The combination of these resources lets users identify each piece of text and see it in context on the face of the map.
The maps are drawn in a wide variety of different hands, and the text often overlaps or is written over background features, making automated transcription tricky. Some errors do remain - for example, where individual characters have been incorrectly transcribed within words, though the words themselves should still be identifiable. In addition, not all words appearing on the maps were captured.
The resource came about after I was fortunate enough to join a cohort of colleagues from the British Library and the National Archives attending the pilot postgraduate course at Birkbeck. After speed-learning Python and SQL coding languages in the first term, I then focussed on the development of a software tool that enlists the Google Vision API to auto-transcribe text found on maps. Once made, I set it to work harvesting words found on the eastern Africa maps.
I am very grateful to BL Digital Curator Nora McGregor, who set up and coordinated the initial pilot (now launching this autumn as an Applied Data Science Postgraduate Certificate), to the Institute of Coding, who funded it, and to BL managers for allocating study time during work. This project would also not have been possible without Indigo Trust, whose generous funding to conserve, catalogue and digitise War Office maps over the last five years has made them accessible to the world online, and enabled further initiatives such as this.