THE BRITISH LIBRARY

Digital scholarship blog

10 posts from February 2018

22 February 2018

BL Labs 2017 Symposium: Picturing Canada and Interactive Map (Staff Award Runner Up)

Putting collection metadata on the map: Picturing Canada

The Picturing Canada project began in 2012 as a British Library, Eccles Centre and Wikimedia UK collaboration to digitise a collection and experiment with releasing high quality reproductions of collection items into the public domain. At its heart the project sought to open up an under-used collection of photographs, connecting them with new audiences and uses outside of the walls of the British Library. It also provided a template for the Library’s subsequent public domain releases and has been provided many around with an insight into the depth of the Library’s Canadian collections.

Before the collection could be released it needed to be digitised and robust metadata created. Fortunately the Library had a good working batch of metadata created off the back of work done by researchers from Dalhousie University in the 1980s. The initial use of this to the project was clear but in digitising the images and putting them and the metadata online something became apparent; most images had some sort of information (be it a title or a photographer’s studio address) that could be used to determine a geographical location for the images.

At the time, this realisation was parked for future investigation but the 2015 exhibition, ‘Canada Through the Lens’, drawing off the same digitised collection, opened up an opportunity to try and use this information to map the collection and generate new insights into its contents. Much of the coordinate determination and mapping was done by Joan Francis, co-awardee of the BL Labs runner-up prize, who worked to find and add coordinates for the photographs. This was a relatively simple but time-consuming process involving finding locations in the metadata image title or, in the case of a photographer’s studio address, on the photograph itself. These text-based locations were then converted into co-ordinates compatible with Google Fusion Tables (there’s an excellent tutorial here) and added to records for each image.

 

The result of this is the map that you see above, a series of points which can be clicked on to see a partial metadata record for the item as well as a link to the photograph itself on Wikipedia Commons. As the work is time-consuming and fraught with potential error we have still only worked to a robust mapping of about four fifths of the collection and this is the work you see here. Interestingly, map is not just a useful finding aid – although it performs this function very well.

Mapping the collection also provides insight into the geography of photographic production in Canada during the period this collection was created (1895 – 1923). It is clear, for instance, how significant the eastern metropolitan areas of Toronto, Montreal and Quebec are to Canada’s photographic production in this period. Similarly, the corridors of production seen running close to the Canada-US border and occasionally spurring north also suggest the significance of the railroad to Canada’s photographic economy. So the map helps users to find images but also offers more questions; an exciting prospect for continued work.

Posted by BL Labs on behalf of Philip Hatfield and Joan Francis

Submit a project for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.

21 February 2018

BL Labs 2017 Symposium: Opening up the British Library’s Early Indian Printed Books Collection (Staff Award Winner)

Making the British Library’s valuable collection of early Bengali books more accessible to researchers and the general public around the world rests heavily on the collaborative work undertaken across different teams of the library and partners in the UK and abroad. The commitment and passion of the project team has relied on the contribution and expertise of collaborators, as well as the forward thinking vision of the library, partners and fundraisers.

Receiving the BL Labs Staff Award 2017 is a great opportunity to thank everyone involved. 

Members of the Two Centuries of Indian Print team receiving the British Library Labs award at the Symposium on 30th October.
Members of the Two Centuries of Indian Print team receiving the British Library Labs award at the Symposium on 30th October 2017
 
Tom Derrick (Digital Curator) was in India at the same time the team received their Award.
Tom Derrick (Digital Curator) was in India at the same time the team received their Award

The Two Centuries of Indian Print project is a partnership between the British Library, the School of Cultural Texts and Records (SCTR) at Jadavpur University, Srishti Institute of Art, Design and Technology, and the Library at SOAS University of London, among others. It has also involved collaborations with the National Library of India, and other institutions in India.

The AHRC Newton-Bhabha Fund and the Department for Business, Energy and Industrial Strategy have generously funded the work undertaken so far by the project, focusing on early printed Bengali books. Many are unavailable in other library collections or are extremely difficult to locate and access. The project has undertaken a variety of initiatives from the digitisation of books and enhancement of the catalogue records in English and Bengali, to stimulating the use of digital humanities tools and techniques, running a programme of digital skills sharing and capacity building workshops, and hosting the South Asia Series seminars. All of these initiatives greatly contribute to the discovery and study of the collection. The project is also conducting ground breaking work in finding a solution to Optical Character Recognition (OCR) in Bangla script. OCR is not available for South Asian languages currently and harnessing viable Optical Character Recognition technology would enable full text search of the books, paving the way for researchers to use natural language processing techniques to perform large scale analysis across a large corpus of text covering a diverse range of topics relating to Indian society, religion, and politics to name but a few. Doing so will increase the possibilities for new discoveries in this academic field. 

However, despite its status as one of the most widely spoken languages in the world, Bangla script has been greatly underserved by providers of OCR solutions. This is due in part to the orthographical and typographical variances that have taken place in recent centuries that make building a dictionary and character ‘classifier’ more challenging. Due to the wide date range of the books we are digitising, these issues affect the quality of OCR. The physical condition of our historical books, including faded text, presents additional difficulties for creating machine readable versions of the books. 

To overcome these obstacles, the project team has been advancing the development of OCR for Bangla through the organisation of an international competition which reviewed the state-of-the-art in commercial and open source text recognition tools. The results of the competition will be announced at the ICDAR 2017 conference in Kyoto later this month. Watch this space! The competition dataset has been made openly available for download and reuse for any researchers or institutions who would like to experiment with OCR for Bengali.

A page from the Animal Biographies, VT 1712 showing its transcription produced for the ICDAR 2017 competition
A page from the Animal Biographies, VT 1712 showing its transcription produced for the ICDAR 2017 competition

The project has organised two Skills Exchange Programmes, hosting mid-career Library professionals from the the National Library of India at the British Library for a week, providing a packed programme of tours and talks from all areas of the Library. The project has also conducted digital skills sharing and capacity building workshops for library professionals and archivists from cultural heritage institutions in India. The first workshop took place at Jadavpur University, Kolkata, in December 2016. Library and information professionals from cultural heritage institutions in Bengal took part in a one-day event to learn more about how information technology is transforming humanities research today and in turn Library services, as well as the methods for interrogating humanities-related datasets.

Afterthe success of this first workshop another event was held in July 2017, at which more than 30 library professionals discussed OCR developments for Bangla, trying out different tools and discussing digital scholarship techniques and projects. Most recently, the project’s digital curator facilitated a workshop around Digitisation Standards at the International Conference of Asian Libraries in Delhi. The workshops continue in earnest in the new year with another digital humanities skills workshop planned for January 2018 to be held in partnership with the Srishti Institute of Art, Design, and Technology.

Attendees of the workshop held at Jadavpur University in December 2016 taking part in a group activity to discuss the application of digital humanities methods to library collections
Attendees of the workshop held at Jadavpur University in December 2016 taking part in a group activity to discuss the application of digital humanities methods to library collections

The Project Team also held a two day Academic Symposium on South Asian book history at Jadavpur University in the summer, with 17 speakers from India, wider South Asia, and the UK. Attendance was between 50-70 people a day and feedback was very good.  We plan to have a publication arising from this Symposium, and to upload a video to our project webspace. The project also hosts a popular series of talks based around the Two Centuries of Indian Print project and the British Library’s South Asia collections. The seminars take place fortnightly at the British Library. So far we have hosted a range of academics and researchers, from PhD students to senior academics from the UK and abroad, who share cutting-edge research with discussion chaired by curators and specialists in the field. The seminars have been a great success attracting large attendances and speakers from around the world. We also host a number of show and tells of our material to raise awareness for our collection and to engage in community outreach.

Everyone on the project is thrilled to have won this award and we will be working hard in 2018 to continue bringing the Two Centuries of Indian Print project to the attention and use of researchers and the general public.

Submit a project for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.

Posted by BL Labs on behalf of The Two Centuries of Indian Print team.

15 February 2018

BL Labs 2017 Symposium: Git Lit, Learning & Teaching Award Runner Up

Applications of Distributed Version Control Technologies Toward the Creation of 50,000 Digital Scholarly Editions

The British Library maintains a collection of roughly 50,000 digital texts, scanned from public-domain books, most of which were originally published in the 19th century. As scanned books, their text format is Analyzed Layout and Text Object (ALTO) Extensible Markup Language (XML), a verbose markup format created by Optical Character Recognition (OCR) software, and one which is only marginally human-readable. Our project, Git-Lit, converts each text to the plain text format Markdown, creates version-controlled repositories for each using the distributed version control system Git, and posts the repositories to the project management platform GitHub, where they can be edited by anyone. Along the way, websites for each text, optimized for readability, are automatically generated via GitHub Pages. These websites integrate with the annotation platform Hypothes.is, enabling them to be annotated. In this way, Git-Lit aims to make this collection of British Library electronic texts discoverable, readable, editable, annotatable, and downloadable.

A Screenshot of the Website Automatically Generated from the British Library Electronic Text
A Screenshot of the Website Automatically Generated from the British Library Electronic Text


The biggest advantage of using a distributed version control system like Git is that it leverages the kinds of decentralized collaboration workflows that have long been in use in software development. Open-source software and web development, for which Git and GitHub were originally designed, is a much-studied methodology, long proven to be more effective than closed-source methods. Rather than maintain a central silo for serving code and electronic texts, the decentralized approach ensures a plurality of textual versions. Since anyone may copy ("fork") a project, modify it, and create their own version, there is no one central, canonical text, but many. Each version may freely borrow ("pull") from others, request that others integrate their changes ("pull request"), and discuss potential changes ("issues") using the project management subsystems of GitHub. This workflow streamlines collaboration, and encourages external contributions. Furthermore, since each change ("commit") requires a description of the commit, and reasons for it, the Git platform enforces the kind of editorial documentation necessary for scholarly editing. We like to think of git-based editing, therefore, as scholarly editing, and GitHub-based collaboration as a democratization of scholarly editing.

Furthermore, since GitHub allows instant editing of texts in the web browser, it is a simple and intuitive method of crowdsourcing the text cleanup process. Since OCRd texts are often full of errors, GitHub allows any reader to correct an obvious OCR error she or he finds. The analogous process of reporting errors to centralized text repositories like Project Gutenberg has been known to take several years. On GitHub, however, it is instantaneous.

Not the least advantage of this setup is the automated creation of websites from the plain text sources. Not only does this transform the markdown to a clean, readable edition of the text, but it provides integration with the annotation platform Hypothes.is. Hypothes.is allows for social annotation of a text, making it ideal for classroom use. Professors may assign a British Library text as a course reading, and may require their students annotate it, an activity which can generate discussions in the limitless virtual margins of this electronic textual space.

The Git-Lit project has so far posted around 50 texts to GitHub, as prototypes, with the full corpus of roughly 50,000 texts soon to come. After the full corpus is processed in this way, we'll begin enhancing some of the metadata. So far, we have developed techniques for probabilistically inferring the language of each text, and using Ben Schmidt's document vectorization method, Stable Random Projection, we have been able to probabilistically infer Library of Congress classifications, as well. This enables the automatic generation of sub-corpora like PR (British Literature), or PZ (American Literature).

In the coming year, we hope to integrate the Git-Lit transformed British Library texts into a structured database, further enhancing the discoverability of its texts. We have just received a micro-grant from NYC-DH to help launch Corpus-DB, a project also aiming to produce textual corpora, and through Corpus-DB, we will soon create a SQL database containing the metadata, our enhanced and inferred metadata, and other aggregated book data gleaned from public APIs. This will soon allow readers and computational text analysts the ability to download groups of British Library electronic texts. Users interested in downloading, say, all novels set in London, will be able to get a complete full-text dump of all public-domain novels in this category by visiting a URL such as api.corpus-db.org/novels/setting/London. We expect that this will greatly streamline the corpus creation process that takes up so much of the time of a computational text analysis.

Both Git-Lit and Corpus-DB are open-source projects, open to contributions from anyone, regardless of skill. If you'd like to contribute to our project in some way, get in contact with us, and we'll tell you how you can help.

Jonathan Reeve
Jonathan Reeve

Jonathan Reeve is a third-year graduate student in the Department of English and Comparative Literature at Columbia University, where he specializes in computational literary analysis. Find his recent experiments at jonreeve.com.

If this blog post has stimulated your interest in working with the British Library's digital collections, start a project and enter it for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library to find out who wins.

Posted by BL Labs on behalf of Jonathan Reeve

14 February 2018

BL Labs 2017 Symposium: Movable Type, Commercial Award Winner

Movable Type is a tabletop word game, and something of a love letter to classic books and authors, made completely of custom playing cards. While the game’s appearance might remind you of Scrabble, it has some tricks up its sleeve to give it a much more modern and dynamic feel.

Movable Type Card Game
Figure 1: Movable Type Card Game


The initial idea for Movable Type was born around two years ago. I had been making games for some years and knew I wanted to do something with a word game. My main objective was to have a game that was very interactive, easy to grasp, and tactical, while also being very quick to play. As much as I love word games, some of them have a tendency to outstay their welcome. 

The central mechanism in Movable Type is called card-drafting. This method allows players to pick their letter cards each round – this does away with the large amounts of luck you find in many classic word games, and instead shifts attention onto the tactical decisions of the players. It also means that rules are kept very simple and that players can take their turns simultaneously, creating a much more dynamic play environment.

Movable Type - The Cards
Movable Type - The Cards


The prototype for Movable Type was only a few weeks old when I settled on the art style I was going to use in the final product. I’ve been a long-time fan of the British Library Flickr account, which lets users browse through images from public domain books. Once I had spotted the large collection of initial capitals, I was sold!

Movable Type - Illustrated Letters
Movable Type - Illustrated Letters


My wife, Tiffany Moon, is a graphic designer by trade. She helped clean up the images and present these beautiful pieces of art in a colourful new fashion, appropriate for a retail product. I also wanted portraits of some of my favourite authors to be in the game, so I commissioned Alisdair Wood to create woodcut-style images of ten classic, influential and diverse literary figures. Without those initial capital images taken from the British Library collection and used to direct the game’s overall style, Movable Type would likely not look half as impressive and definitely wouldn’t resonate with me and many players like the current style does.

Movable Type - Illustrated Famous People
Movable Type - Illustrated Famous People


I launched Movable Type on the crowdfunding platform, Kickstarter, last year. Upon its release, it won the Imirt Irish Game Award for Best Analog Game and second runner-up for Game of the Year. It received good reception at several public events and sold out of its initial print run, so I decided that a second edition of the game was in order. That bigger and better second edition is funding on Kickstarter and should be in some select retail stores by mid to late 2018 (fingers crossed!).

Movable Type receiving the Commercial Category BL Lab Award, was a huge boon for the reputation of the game and myself as a game designer. Furthermore, it was a genuine honour to be at the British Library for this event and able to share my product with the audience there.

If this blog post has stimulated your interest in working with the British Library's digital collections, start a project and enter it for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.

Posted by BL Labs on behalf of Robin David O’Keeffe

13 February 2018

BL Labs 2017 Symposium: Samtla, Research Award Runner Up

Samtla (Search And Mining Tools for Labelling Archives) was developed to address a need in the humanities for research tools that help to search, browse, compare, and annotate documents stored in digital archives. The system was designed in collaboration with researchers at Southampton University, whose research involved locating shared vocabulary and phrases across an archive of Aramaic Magic Texts from Late Antiquity. The archive contained texts written in Aramaic, Mandaic, Syriac, and Hebrew languages. Due to the morphological complexity of these languages, where morphemes are attached to a root morpheme to mark gender and number, standard approaches and off-the-shelf software were not flexible enough for the task, as they tended to be designed to work with a specific archive or user group. 

Figure1
Figure 1: Samtla supports tolerant search allowing queries to be matched exactly and approximately. (Click to enlarge image)

  Samtla is designed to extract the same or similar information that may be expressed by authors in different ways, whether it is in the choice of vocabulary or the grammar. Traditionally search and text mining tools have been based on words, which limits their use to corpora containing languages were 'words' can be easily identified and extracted from text, e.g. languages with a whitespace character like English, French, German, etc. Word models tend to fail when the language is morphologically complex, like Aramaic, and Hebrew. Samtla addresses these issues by adopting a character-level approach stored in a statistical language model. This means that rather than extracting words, we extract character-sequences representing the morphology of the language, which we then use to match the search terms of the query and rank the documents according to the statistics of the language. Character-based models are language independent as there is no need to preprocess the document, and we can locate words and phrases with a lot of flexibility. As a result Samtla compensates for the variability in language use, spelling errors made by users when they search, and errors in the document as a result of the digitisation process (e.g. OCR errors). 

Figure2
Figure 2: Samtla's document comparison tool displaying a semantically similar passage between two Bibles from different periods. (Click to enlarge image)

 The British Library have been very supportive of the work by openly providing access to their digital archives. The archives ranged in domain, topic, language, and scale, which enabled us to test Samtla’s flexibility to its limits. One of the biggest challenges we faced was indexing larger-scale archives of several gigabytes. Some archives also contained a scan of the original document together with metadata about the structure of the text. This provided a basis for developing new tools that brought researchers closer to the original object, which included highlighting the named entities over both the raw text, and the scanned image.

Currently we are focusing on developing approaches for leveraging the semantics underlying text data in order to help researchers find semantically related information. Semantic annotation is also useful for labelling text data with named entities, and sentiments. Our current aim is to develop approaches for annotating text data in any language or domain, which is challenging due to the fact that languages encode the semantics of a text in different ways.

As a first step we are offering labelled data to researchers, as part of a trial service, in order to help speed up the research process, or provide tagged data for machine learning approaches. If you are interested in participating in this trial, then more information can be found at www.samtla.com.

Figure3
Figure 3: Samtla's annotation tools label the texts with named entities to provide faceted browsing and data layers over the original image. (Click to enlarge image)

 If this blog post has stimulated your interest in working with the British Library's digital collections, start a project and enter it for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.


Posted by BL Labs on behalf of Dr Martyn Harris, Prof Dan Levene, Prof Mark Levene and Dr Dell Zhang.

05 February 2018

8th Century Arabic science meets today's computer science

Or, Announcing a Competition for the Automatic Transcription of Historical Arabic Scientific Manuscripts 

“An impartial view of Digital Humanities (DH) scholarship in the present day reveals a stark divide between ‘the West and the rest’…Far fewer large-scale DH initiatives have focused on Asia and the non-Western world than on Western Europe and the Americas…Digital databases and text corpora – the ‘raw material’ of text mining and computational text analysis – are far more abundant for English and other Latin alphabetic scripts than they are for Chinese, Japanese, Korean, Sanskrit, Hindi, Arabic and other non-Latin orthographies…Troves of unread primary sources lie dormant because no text mining technology exists to parse them.”

-Dr. Thomas Mullaney, Associate Professor of Chinese History at Stanford University

Supporting the use of Asian & African Collections in digital scholarship means shining a light on this stark divide and seeking ways to close the gap. In this spirit, we are excited to announce the ICFHR2018 Competition on Recognition of Historical Arabic Scientific Manuscripts.

Add MS 7474_0043.script

The Competition

Drawing together experts from British Library, The Alan Turing Institute, Qatar Digital Library and PRImA Research Lab, our aim in launching this competition is to play an active roll in advancing the state-of-the-art in handwritten text recognition technologies for Arabic. For our first challenge we are focussing on finding an optimal solution for accurately and automatically transcribing historical Arabic scientific handwritten manuscripts.

Though such technologies are still in their infancy, unlocking historical handwritten Arabic manuscripts for large-scale text analysis has the potential to truly transform research. In conjunction with the competition we hope to build and make freely open and available a substantial image and ground truth dataset to support continued efforts in this area. 

Enter the Competition

Organisers

Apostolos Antonacopoulos Professor of Pattern Recognition, University of Salford and Head of (PRImA) research lab 
Christian Clausner Research Fellow at the Pattern Recognition and Image Analysis (PRImA) research lab  
Nora McGregor Digital Curator at British Library, Asian & African Collections
Daniel Lowe Curator at British Library, Arabic Collections
Daniel Wilson-Nunn, PhD student at University of Warwick & Turing PhD Student based at Alan Turing Institute 
• Bink Hallum, Arabic Scientific Manuscripts Curator at British Library/Qatar Foundation Partnership 

Further reading

For more on recent Digital Research Team text recognition and transcription projects see:

 

This post is by Nora McGregor, Digital Curator, British Library. She is on twitter as @ndalyrose

Building a Handwritten Arabic Manuscript Ground Truth Dataset يد واحدة لا تصفـّق

Are you able to read handwritten Arabic from historical manuscripts such as these? Then we could use your help!

In conjunction with our ICFHR2018 Competition on Recognition of Historical Arabic Scientific Manuscripts it is our aim to build a substantial image and ground truth dataset that can be used as the basis for advancing research in historical handwritten Arabic text analysis. This data will be made freely available for anyone wishing to advance the state-of-the-art in optical character recognition technology. 

What is Ground Truth?

The Impact Centre of Competence in Digitisation explains:

In digital imaging and OCR, ground truth is the objective verification of the particular properties of a digital image, used to test the accuracy of automated image analysis processes. The ground truth of an image’s text content, for instance, is the complete and accurate record of every character and word in the image. This can be compared to the output of an OCR engine and used to assess the engine’s accuracy, and how important any deviation from ground truth is in that instance.

The task to create such a dataset is enormous however so we're looking to build a network of folks who might be interested in sparing some time to transcribe a page or two.

If you're interested in learning more, and possibly contributing, we would love to hear from you! Please send us your details and we'll be in touch about upcoming workshops and activities to be held both in London and remotely.

 

03 February 2018

Fashion Design Competition Winner Announced

The British Library has recently run a fashion design competition using our digital collections to inspire a new generation of fashion designers. In partnership with the British Fashion Council and fashion house Teatum Jones, students from universities across the UK were invited to create a fashion portfolio which tells an inspirational story inspired by the British Library’s Flickr Collection.

The Flickr Commons Collection contains over 1 million images ‘snipped’ from around 65,000 digitised books largely from the 19th Century. This collection is ideal for creative researchers and has already inspired artists and designers.

Fashion competition 1

This competition challenged students’ creativity and story-telling through design and research skills. Working to a brief set by the Library and Teatum Jones, students were given free rein to find inspiration from across the Library’s collections, ranging from South American costumes to aviation.

To open the competition and welcome to students to the Library we held a Fashion Research Masterclass as part of our postgraduate open day programme in October 2017. The day featured talks from experts across the Library and an inspirational Show and Tell focusing on variety of collections that could inspire fashion research and design.

Fashion competition 2

The competition judging took place in January 2018, with 8 finalists from across the UK presenting their portfolios to a panel of specialists from the fashion sector including Paul Baptiste (Accessories Buying Manager, Fenwick), Catherine Teatum and Rob Jones (Creative Directors, Teatum Jones), Judith Rosser-Davies (British Fashion Council) and Mahendra Mahey (British Library Labs). Unfortunately we could only have one winner and we are delighted to announce Alanna Hilton from Edinburgh College of Art as our winner.

Alanna’s collection ‘Unlabelled’ has designs “that reject labelling [and] where consumers of all ages, sizes, ethnicities and genders can find beautiful clothes”. For winning the competition, Alanna received a financial prize from the British Fashion Council and membership of the British Library membership scheme. All of the submissions were incredibly strong, and we were very pleased to see the students all taking such different design routes, inspired by our collections.

The Library will continue to explore how we facilitate creative research process, as well as link it to business support provided by our Business and IP Centre. Our work with the fashion students highlighted the real variety in the ways researchers access our collections, find inspiration and use the library.

We are continuing to work with the British Fashion Council, exploring how to better reach this part of the research community and we will have an update from our winning designer later in the year. The project is also grateful to Innovate UK for their support.

This post is by Edmund Connolly from the British Library's Higher Education & Cultural Engagement department, he is on twitter as @Ed_Connolly.