THE BRITISH LIBRARY

Digital scholarship blog

12 posts from February 2018

28 February 2018

Announcing the BL Labs roadshows locations and dates for 2018!

The @BL_Labs Roadshows: dates and locations for 2018

Do you want to learn more about the British Library’s digital collections? Are you interested in discovering how other researchers have used our digitised material in creative and innovative ways? Would you like to give us feedback on the kinds of services we are providing and would like to provide for digital scholars? Come and meet Library staff and gain an insight into some of the opportunities and challenges of working with our digital content. Get advice, pick up tips, and consider entering the digital project you have been working on for one of the BL Labs Awards (deadline Thursday 11th October 2018).

Our @BL_Labs Roadshows will be held at university departments across the UK between March and June 2018. Events will include presentations from the British Library and host institutions, practical hands-on workshops, a chance to explore and discuss what you may do with some of the Library's data and for you to speak to and get feedback from experts. We’re also keen to hear your views on some of the long-term services the British Library is hoping to develop for those who want to work with our digital collections and data.

Register for one of the roadshows! They are FREE to attend and OPEN TO ALL (unless otherwise stated). For further details about locations we are visiting this year, see below: 

Scanned British Isles with places JPEG correctetd
BL Labs Roadshow locations for 2018

March

  • Monday 26 March 2018 (10:00 – 13:00) - BL Labs Roadshow at CityLIS (City University of London Department of Library and Information Science), London (internal event)

April

May

June

  • Tuesday 5 June 2018 (12:00 – 16:00) - BL Labs Roadshow at the University of Leeds, Leeds
  • Wednesday 27 June 2018 (09:00 – 13:00) - BL Labs Roadshow at the University of Birmingham, Birmingham

You will be able to view the full programme details for each of the roadshows, and book your place via Eventbrite. Links will be live shortly or visit our events page.

For any further questions, please contact us at labs@bl.uk.

The British Library Labs project is funded by the Andrew W. Mellon Foundation and the British Library.

Posted by BL Labs

23 February 2018

The Cartographer's Confession

Last summer I posted about the Ambient Literature project, which is researching if and how digital media can create bridges between story and place. Forming the heart of this project, three authors; Kate Pullinger, James Attlee and Duncan Speakman have each created new experimental works that respond to the presence of a reader, and these aim to show how we can redefine the rules of the reading experience through innovative use of technology.

I’m pleased to report the one of the Ambient Literature commissioned works; The Cartographer's Confession by James Attlee is the winner of the 8th annual if:book award for New Media Writing, which was presented at Bournemouth University recently.

Cartographersconfession
if:book award winner James Attlee, with (left to right) Chris Meade, Justine Solomons, Jim Pope, Andy Campbell, Stella Wisdom and Emma Whittaker

The Cartographer’s Confession is an immersive story based in London, where readers interact with the app on location, to discover the long-hidden secrets of ‘The Cartographer’.  Containing visual material, as well as having an original musical soundtrack, this is a ‘mixed reality’ experience. Accepting his award from if:book director Chris Meade, Attlee confessed that this blending of sound, video and story is something he had wanted to do previously alongside his print-based works, but he wasn’t able to make it happen until collaborating with digital producer Emma Whittaker.

I very much enjoyed this work, especially the music, and I also encourage you to try the experience. All you will need is a smartphone, a set of headphones, and the ability to visit a number of locations in London, through which the story unfolds (though there is also an ‘armchair mode’ if you are unable to get to London). You can download the app for free and it is available on iOS and Android.

Furthermore, as Ambient Literature is research project, they are very keen to speak with participants; to learn about their reactions to the work. So if you have completed The Cartographer's Confession (or are close to finishing it) and willing to be interviewed about your experience for about 15 minutes (either in person or over Skype, Facetime etc.), please fill out this form  and one of the project researchers will be in touch via email to arrange a time to talk. If you have any questions about this process, please contact Dr Michael Marcinkowski.

This post is by Digital Curator Stella Wisdom, on twitter as @miss_wisdom and member of the Ambient Literature Advisory Board.

22 February 2018

BL Labs 2017 Symposium: Picturing Canada and Interactive Map (Staff Award Runner Up)

Putting collection metadata on the map: Picturing Canada

The Picturing Canada project began in 2012 as a British Library, Eccles Centre and Wikimedia UK collaboration to digitise a collection and experiment with releasing high quality reproductions of collection items into the public domain. At its heart the project sought to open up an under-used collection of photographs, connecting them with new audiences and uses outside of the walls of the British Library. It also provided a template for the Library’s subsequent public domain releases and has been provided many around with an insight into the depth of the Library’s Canadian collections.

Before the collection could be released it needed to be digitised and robust metadata created. Fortunately the Library had a good working batch of metadata created off the back of work done by researchers from Dalhousie University in the 1980s. The initial use of this to the project was clear but in digitising the images and putting them and the metadata online something became apparent; most images had some sort of information (be it a title or a photographer’s studio address) that could be used to determine a geographical location for the images.

At the time, this realisation was parked for future investigation but the 2015 exhibition, ‘Canada Through the Lens’, drawing off the same digitised collection, opened up an opportunity to try and use this information to map the collection and generate new insights into its contents. Much of the coordinate determination and mapping was done by Joan Francis, co-awardee of the BL Labs runner-up prize, who worked to find and add coordinates for the photographs. This was a relatively simple but time-consuming process involving finding locations in the metadata image title or, in the case of a photographer’s studio address, on the photograph itself. These text-based locations were then converted into co-ordinates compatible with Google Fusion Tables (there’s an excellent tutorial here) and added to records for each image.

 

The result of this is the map that you see above, a series of points which can be clicked on to see a partial metadata record for the item as well as a link to the photograph itself on Wikipedia Commons. As the work is time-consuming and fraught with potential error we have still only worked to a robust mapping of about four fifths of the collection and this is the work you see here. Interestingly, map is not just a useful finding aid – although it performs this function very well.

Mapping the collection also provides insight into the geography of photographic production in Canada during the period this collection was created (1895 – 1923). It is clear, for instance, how significant the eastern metropolitan areas of Toronto, Montreal and Quebec are to Canada’s photographic production in this period. Similarly, the corridors of production seen running close to the Canada-US border and occasionally spurring north also suggest the significance of the railroad to Canada’s photographic economy. So the map helps users to find images but also offers more questions; an exciting prospect for continued work.

Posted by BL Labs on behalf of Philip Hatfield and Joan Francis

Submit a project for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.

21 February 2018

BL Labs 2017 Symposium: Opening up the British Library’s Early Indian Printed Books Collection (Staff Award Winner)

Making the British Library’s valuable collection of early Bengali books more accessible to researchers and the general public around the world rests heavily on the collaborative work undertaken across different teams of the library and partners in the UK and abroad. The commitment and passion of the project team has relied on the contribution and expertise of collaborators, as well as the forward thinking vision of the library, partners and fundraisers.

Receiving the BL Labs Staff Award 2017 is a great opportunity to thank everyone involved. 

Members of the Two Centuries of Indian Print team receiving the British Library Labs award at the Symposium on 30th October.
Members of the Two Centuries of Indian Print team receiving the British Library Labs award at the Symposium on 30th October 2017
 
Tom Derrick (Digital Curator) was in India at the same time the team received their Award.
Tom Derrick (Digital Curator) was in India at the same time the team received their Award

The Two Centuries of Indian Print project is a partnership between the British Library, the School of Cultural Texts and Records (SCTR) at Jadavpur University, Srishti Institute of Art, Design and Technology, and the Library at SOAS University of London, among others. It has also involved collaborations with the National Library of India, and other institutions in India.

The AHRC Newton-Bhabha Fund and the Department for Business, Energy and Industrial Strategy have generously funded the work undertaken so far by the project, focusing on early printed Bengali books. Many are unavailable in other library collections or are extremely difficult to locate and access. The project has undertaken a variety of initiatives from the digitisation of books and enhancement of the catalogue records in English and Bengali, to stimulating the use of digital humanities tools and techniques, running a programme of digital skills sharing and capacity building workshops, and hosting the South Asia Series seminars. All of these initiatives greatly contribute to the discovery and study of the collection. The project is also conducting ground breaking work in finding a solution to Optical Character Recognition (OCR) in Bangla script. OCR is not available for South Asian languages currently and harnessing viable Optical Character Recognition technology would enable full text search of the books, paving the way for researchers to use natural language processing techniques to perform large scale analysis across a large corpus of text covering a diverse range of topics relating to Indian society, religion, and politics to name but a few. Doing so will increase the possibilities for new discoveries in this academic field. 

However, despite its status as one of the most widely spoken languages in the world, Bangla script has been greatly underserved by providers of OCR solutions. This is due in part to the orthographical and typographical variances that have taken place in recent centuries that make building a dictionary and character ‘classifier’ more challenging. Due to the wide date range of the books we are digitising, these issues affect the quality of OCR. The physical condition of our historical books, including faded text, presents additional difficulties for creating machine readable versions of the books. 

To overcome these obstacles, the project team has been advancing the development of OCR for Bangla through the organisation of an international competition which reviewed the state-of-the-art in commercial and open source text recognition tools. The results of the competition will be announced at the ICDAR 2017 conference in Kyoto later this month. Watch this space! The competition dataset has been made openly available for download and reuse for any researchers or institutions who would like to experiment with OCR for Bengali.

A page from the Animal Biographies, VT 1712 showing its transcription produced for the ICDAR 2017 competition
A page from the Animal Biographies, VT 1712 showing its transcription produced for the ICDAR 2017 competition

The project has organised two Skills Exchange Programmes, hosting mid-career Library professionals from the the National Library of India at the British Library for a week, providing a packed programme of tours and talks from all areas of the Library. The project has also conducted digital skills sharing and capacity building workshops for library professionals and archivists from cultural heritage institutions in India. The first workshop took place at Jadavpur University, Kolkata, in December 2016. Library and information professionals from cultural heritage institutions in Bengal took part in a one-day event to learn more about how information technology is transforming humanities research today and in turn Library services, as well as the methods for interrogating humanities-related datasets.

Afterthe success of this first workshop another event was held in July 2017, at which more than 30 library professionals discussed OCR developments for Bangla, trying out different tools and discussing digital scholarship techniques and projects. Most recently, the project’s digital curator facilitated a workshop around Digitisation Standards at the International Conference of Asian Libraries in Delhi. The workshops continue in earnest in the new year with another digital humanities skills workshop planned for January 2018 to be held in partnership with the Srishti Institute of Art, Design, and Technology.

Attendees of the workshop held at Jadavpur University in December 2016 taking part in a group activity to discuss the application of digital humanities methods to library collections
Attendees of the workshop held at Jadavpur University in December 2016 taking part in a group activity to discuss the application of digital humanities methods to library collections

The Project Team also held a two day Academic Symposium on South Asian book history at Jadavpur University in the summer, with 17 speakers from India, wider South Asia, and the UK. Attendance was between 50-70 people a day and feedback was very good.  We plan to have a publication arising from this Symposium, and to upload a video to our project webspace. The project also hosts a popular series of talks based around the Two Centuries of Indian Print project and the British Library’s South Asia collections. The seminars take place fortnightly at the British Library. So far we have hosted a range of academics and researchers, from PhD students to senior academics from the UK and abroad, who share cutting-edge research with discussion chaired by curators and specialists in the field. The seminars have been a great success attracting large attendances and speakers from around the world. We also host a number of show and tells of our material to raise awareness for our collection and to engage in community outreach.

Everyone on the project is thrilled to have won this award and we will be working hard in 2018 to continue bringing the Two Centuries of Indian Print project to the attention and use of researchers and the general public.

Submit a project for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.

Posted by BL Labs on behalf of The Two Centuries of Indian Print team.

15 February 2018

BL Labs 2017 Symposium: Git Lit, Learning & Teaching Award Runner Up

Applications of Distributed Version Control Technologies Toward the Creation of 50,000 Digital Scholarly Editions

The British Library maintains a collection of roughly 50,000 digital texts, scanned from public-domain books, most of which were originally published in the 19th century. As scanned books, their text format is Analyzed Layout and Text Object (ALTO) Extensible Markup Language (XML), a verbose markup format created by Optical Character Recognition (OCR) software, and one which is only marginally human-readable. Our project, Git-Lit, converts each text to the plain text format Markdown, creates version-controlled repositories for each using the distributed version control system Git, and posts the repositories to the project management platform GitHub, where they can be edited by anyone. Along the way, websites for each text, optimized for readability, are automatically generated via GitHub Pages. These websites integrate with the annotation platform Hypothes.is, enabling them to be annotated. In this way, Git-Lit aims to make this collection of British Library electronic texts discoverable, readable, editable, annotatable, and downloadable.

A Screenshot of the Website Automatically Generated from the British Library Electronic Text
A Screenshot of the Website Automatically Generated from the British Library Electronic Text


The biggest advantage of using a distributed version control system like Git is that it leverages the kinds of decentralized collaboration workflows that have long been in use in software development. Open-source software and web development, for which Git and GitHub were originally designed, is a much-studied methodology, long proven to be more effective than closed-source methods. Rather than maintain a central silo for serving code and electronic texts, the decentralized approach ensures a plurality of textual versions. Since anyone may copy ("fork") a project, modify it, and create their own version, there is no one central, canonical text, but many. Each version may freely borrow ("pull") from others, request that others integrate their changes ("pull request"), and discuss potential changes ("issues") using the project management subsystems of GitHub. This workflow streamlines collaboration, and encourages external contributions. Furthermore, since each change ("commit") requires a description of the commit, and reasons for it, the Git platform enforces the kind of editorial documentation necessary for scholarly editing. We like to think of git-based editing, therefore, as scholarly editing, and GitHub-based collaboration as a democratization of scholarly editing.

Furthermore, since GitHub allows instant editing of texts in the web browser, it is a simple and intuitive method of crowdsourcing the text cleanup process. Since OCRd texts are often full of errors, GitHub allows any reader to correct an obvious OCR error she or he finds. The analogous process of reporting errors to centralized text repositories like Project Gutenberg has been known to take several years. On GitHub, however, it is instantaneous.

Not the least advantage of this setup is the automated creation of websites from the plain text sources. Not only does this transform the markdown to a clean, readable edition of the text, but it provides integration with the annotation platform Hypothes.is. Hypothes.is allows for social annotation of a text, making it ideal for classroom use. Professors may assign a British Library text as a course reading, and may require their students annotate it, an activity which can generate discussions in the limitless virtual margins of this electronic textual space.

The Git-Lit project has so far posted around 50 texts to GitHub, as prototypes, with the full corpus of roughly 50,000 texts soon to come. After the full corpus is processed in this way, we'll begin enhancing some of the metadata. So far, we have developed techniques for probabilistically inferring the language of each text, and using Ben Schmidt's document vectorization method, Stable Random Projection, we have been able to probabilistically infer Library of Congress classifications, as well. This enables the automatic generation of sub-corpora like PR (British Literature), or PZ (American Literature).

In the coming year, we hope to integrate the Git-Lit transformed British Library texts into a structured database, further enhancing the discoverability of its texts. We have just received a micro-grant from NYC-DH to help launch Corpus-DB, a project also aiming to produce textual corpora, and through Corpus-DB, we will soon create a SQL database containing the metadata, our enhanced and inferred metadata, and other aggregated book data gleaned from public APIs. This will soon allow readers and computational text analysts the ability to download groups of British Library electronic texts. Users interested in downloading, say, all novels set in London, will be able to get a complete full-text dump of all public-domain novels in this category by visiting a URL such as api.corpus-db.org/novels/setting/London. We expect that this will greatly streamline the corpus creation process that takes up so much of the time of a computational text analysis.

Both Git-Lit and Corpus-DB are open-source projects, open to contributions from anyone, regardless of skill. If you'd like to contribute to our project in some way, get in contact with us, and we'll tell you how you can help.

Jonathan Reeve
Jonathan Reeve

Jonathan Reeve is a third-year graduate student in the Department of English and Comparative Literature at Columbia University, where he specializes in computational literary analysis. Find his recent experiments at jonreeve.com.

If this blog post has stimulated your interest in working with the British Library's digital collections, start a project and enter it for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library to find out who wins.

Posted by BL Labs on behalf of Jonathan Reeve

14 February 2018

BL Labs 2017 Symposium: Movable Type, Commercial Award Winner

Movable Type is a tabletop word game, and something of a love letter to classic books and authors, made completely of custom playing cards. While the game’s appearance might remind you of Scrabble, it has some tricks up its sleeve to give it a much more modern and dynamic feel.

Movable Type Card Game
Figure 1: Movable Type Card Game


The initial idea for Movable Type was born around two years ago. I had been making games for some years and knew I wanted to do something with a word game. My main objective was to have a game that was very interactive, easy to grasp, and tactical, while also being very quick to play. As much as I love word games, some of them have a tendency to outstay their welcome. 

The central mechanism in Movable Type is called card-drafting. This method allows players to pick their letter cards each round – this does away with the large amounts of luck you find in many classic word games, and instead shifts attention onto the tactical decisions of the players. It also means that rules are kept very simple and that players can take their turns simultaneously, creating a much more dynamic play environment.

Movable Type - The Cards
Movable Type - The Cards


The prototype for Movable Type was only a few weeks old when I settled on the art style I was going to use in the final product. I’ve been a long-time fan of the British Library Flickr account, which lets users browse through images from public domain books. Once I had spotted the large collection of initial capitals, I was sold!

Movable Type - Illustrated Letters
Movable Type - Illustrated Letters


My wife, Tiffany Moon, is a graphic designer by trade. She helped clean up the images and present these beautiful pieces of art in a colourful new fashion, appropriate for a retail product. I also wanted portraits of some of my favourite authors to be in the game, so I commissioned Alisdair Wood to create woodcut-style images of ten classic, influential and diverse literary figures. Without those initial capital images taken from the British Library collection and used to direct the game’s overall style, Movable Type would likely not look half as impressive and definitely wouldn’t resonate with me and many players like the current style does.

Movable Type - Illustrated Famous People
Movable Type - Illustrated Famous People


I launched Movable Type on the crowdfunding platform, Kickstarter, last year. Upon its release, it won the Imirt Irish Game Award for Best Analog Game and second runner-up for Game of the Year. It received good reception at several public events and sold out of its initial print run, so I decided that a second edition of the game was in order. That bigger and better second edition is funding on Kickstarter and should be in some select retail stores by mid to late 2018 (fingers crossed!).

Movable Type receiving the Commercial Category BL Lab Award, was a huge boon for the reputation of the game and myself as a game designer. Furthermore, it was a genuine honour to be at the British Library for this event and able to share my product with the audience there.

If this blog post has stimulated your interest in working with the British Library's digital collections, start a project and enter it for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.

Posted by BL Labs on behalf of Robin David O’Keeffe

13 February 2018

BL Labs 2017 Symposium: Samtla, Research Award Runner Up

Samtla (Search And Mining Tools for Labelling Archives) was developed to address a need in the humanities for research tools that help to search, browse, compare, and annotate documents stored in digital archives. The system was designed in collaboration with researchers at Southampton University, whose research involved locating shared vocabulary and phrases across an archive of Aramaic Magic Texts from Late Antiquity. The archive contained texts written in Aramaic, Mandaic, Syriac, and Hebrew languages. Due to the morphological complexity of these languages, where morphemes are attached to a root morpheme to mark gender and number, standard approaches and off-the-shelf software were not flexible enough for the task, as they tended to be designed to work with a specific archive or user group. 

Figure1
Figure 1: Samtla supports tolerant search allowing queries to be matched exactly and approximately. (Click to enlarge image)

  Samtla is designed to extract the same or similar information that may be expressed by authors in different ways, whether it is in the choice of vocabulary or the grammar. Traditionally search and text mining tools have been based on words, which limits their use to corpora containing languages were 'words' can be easily identified and extracted from text, e.g. languages with a whitespace character like English, French, German, etc. Word models tend to fail when the language is morphologically complex, like Aramaic, and Hebrew. Samtla addresses these issues by adopting a character-level approach stored in a statistical language model. This means that rather than extracting words, we extract character-sequences representing the morphology of the language, which we then use to match the search terms of the query and rank the documents according to the statistics of the language. Character-based models are language independent as there is no need to preprocess the document, and we can locate words and phrases with a lot of flexibility. As a result Samtla compensates for the variability in language use, spelling errors made by users when they search, and errors in the document as a result of the digitisation process (e.g. OCR errors). 

Figure2
Figure 2: Samtla's document comparison tool displaying a semantically similar passage between two Bibles from different periods. (Click to enlarge image)

 The British Library have been very supportive of the work by openly providing access to their digital archives. The archives ranged in domain, topic, language, and scale, which enabled us to test Samtla’s flexibility to its limits. One of the biggest challenges we faced was indexing larger-scale archives of several gigabytes. Some archives also contained a scan of the original document together with metadata about the structure of the text. This provided a basis for developing new tools that brought researchers closer to the original object, which included highlighting the named entities over both the raw text, and the scanned image.

Currently we are focusing on developing approaches for leveraging the semantics underlying text data in order to help researchers find semantically related information. Semantic annotation is also useful for labelling text data with named entities, and sentiments. Our current aim is to develop approaches for annotating text data in any language or domain, which is challenging due to the fact that languages encode the semantics of a text in different ways.

As a first step we are offering labelled data to researchers, as part of a trial service, in order to help speed up the research process, or provide tagged data for machine learning approaches. If you are interested in participating in this trial, then more information can be found at www.samtla.com.

Figure3
Figure 3: Samtla's annotation tools label the texts with named entities to provide faceted browsing and data layers over the original image. (Click to enlarge image)

 If this blog post has stimulated your interest in working with the British Library's digital collections, start a project and enter it for one of the BL Labs 2018 Awards! Join us on 12 November 2018 for the BL Labs annual Symposium at the British Library.


Posted by BL Labs on behalf of Dr Martyn Harris, Prof Dan Levene, Prof Mark Levene and Dr Dell Zhang.

05 February 2018

8th Century Arabic science meets today's computer science

Or, Announcing a Competition for the Automatic Transcription of Historical Arabic Scientific Manuscripts 

“An impartial view of Digital Humanities (DH) scholarship in the present day reveals a stark divide between ‘the West and the rest’…Far fewer large-scale DH initiatives have focused on Asia and the non-Western world than on Western Europe and the Americas…Digital databases and text corpora – the ‘raw material’ of text mining and computational text analysis – are far more abundant for English and other Latin alphabetic scripts than they are for Chinese, Japanese, Korean, Sanskrit, Hindi, Arabic and other non-Latin orthographies…Troves of unread primary sources lie dormant because no text mining technology exists to parse them.”

-Dr. Thomas Mullaney, Associate Professor of Chinese History at Stanford University

Supporting the use of Asian & African Collections in digital scholarship means shining a light on this stark divide and seeking ways to close the gap. In this spirit, we are excited to announce the ICFHR2018 Competition on Recognition of Historical Arabic Scientific Manuscripts.

Add MS 7474_0043.script

The Competition

Drawing together experts from British Library, The Alan Turing Institute, Qatar Digital Library and PRImA Research Lab, our aim in launching this competition is to play an active roll in advancing the state-of-the-art in handwritten text recognition technologies for Arabic. For our first challenge we are focussing on finding an optimal solution for accurately and automatically transcribing historical Arabic scientific handwritten manuscripts.

Though such technologies are still in their infancy, unlocking historical handwritten Arabic manuscripts for large-scale text analysis has the potential to truly transform research. In conjunction with the competition we hope to build and make freely open and available a substantial image and ground truth dataset to support continued efforts in this area. 

Enter the Competition

Organisers

Apostolos Antonacopoulos Professor of Pattern Recognition, University of Salford and Head of (PRImA) research lab 
Christian Clausner Research Fellow at the Pattern Recognition and Image Analysis (PRImA) research lab  
Nora McGregor Digital Curator at British Library, Asian & African Collections
Daniel Lowe Curator at British Library, Arabic Collections
Daniel Wilson-Nunn, PhD student at University of Warwick & Turing PhD Student based at Alan Turing Institute 
• Bink Hallum, Arabic Scientific Manuscripts Curator at British Library/Qatar Foundation Partnership 

Further reading

For more on recent Digital Research Team text recognition and transcription projects see:

 

This post is by Nora McGregor, Digital Curator, British Library. She is on twitter as @ndalyrose