THE BRITISH LIBRARY

Digital scholarship blog

6 posts from January 2018

31 January 2018

Linking Privy Council Appeals Data

This post continues a series of blog posts relating to a PhD placement project that seeks to make data about appeals heard by the Judicial Committee of the Privy Council (JCPC) available in new formats, to enhance discoverability, and to increase the potential for new historical and socio-legal research questions. Previous posts looked at the historical context of the JCPC and related online resources, as well as the process of cleaning the data and producing some initial visualisations.

When looking at the metadata about JCPC judgments between 1860 and 1998, it became clear to me that what was in fact being represented here was a network of appeals, judgments, courts, people, organisations and places. Holding this information in a spreadsheet can be extremely useful, as demonstrated by the visualisations created previously; however, this format does not accurately capture the sometimes complex relationships underlying these cases. As such, I felt that a network might be a more representative way of structuring the data, based on a Linked Data model.

Linked Data was first introduced by Tim Berners-Lee in 2006. It comprises a set of tools and techniques for connecting datasets based on features they have in common in a format that can be understood by computers. Structuring data in this way can have huge benefits for Humanities research, and has already been used in many projects – examples include linking ancient and historical texts based on the places mentioned within them (Pelagios) and bringing together information about people’s experiences of listening to music (Listening Experience Database). I decided to convert the JCPC data to Linked Data to make relationships between the entities contained within the dataset more apparent, as well as link to external sources, where available, to provide additional context to the judgment documents.

The image below shows how the fields from the JCPC spreadsheet might relate to each other in a Linked Data structure.

JCPCDataModelHumanReadable_V1_20180104

In this diagram:

  • Blue nodes represent distinct entities (specific instances of e.g. Judgment, Appellant, Location)
  • Purple nodes represent the classes that define these entities, i.e. what type of entity each blue node is (terms that represent the concepts of e.g. Judgment, Appellant, Location)
  • Green nodes represent properties that describe those entities (e.g. ‘is’, ‘has title’, ‘has date’)
  • Orange nodes represent the values of those properties (e.g. Appellant Name, Judgment Date, City)
  • Red nodes represent links to external sources that describe that entity

Using this network structure, I converted the JCPC data to Linked Data; the conversion process is outlined in detail in the next blog post in this series.

A major advantage of converting the JCPC data to Linked Data is the potential it provides for integration with other sources. This means that search queries can be conducted and visualisations can be produced that use the JCPC data in combination with one or more other datasets, such as those relating to a similar historical period, geographical area(s), or subject. Rather than these datasets existing in isolation from each other, connecting them could fill in gaps in the information and highlight new relationships involving appeals, judgments, locations or the parties involved. This could open up the possibilities for new research questions in legal history and beyond.

Linking the JCPC data will also allow new types of visualisation to be created, either by connecting it to other datasets, or on its own. One option is network visualisations, where the data is filtered based on various search criteria (e.g. by location, time period or names of people/organisations) and the results are displayed using the network structure shown above. Looking at the data as a network can demonstrate at a glance how the different components relate to each other, and could indicate interesting avenues for future research. In a later post in this series, I’ll look at some network visualisations created from the linked JCPC data, as well as what we can (and can’t) learn from them.

This post is by Sarah Middle, a PhD placement student at the British Library researching the appeal cases heard by the Judicial Committee of the Privy Council (JCPC).  Sarah is on twitter as @digitalshrew.    

29 January 2018

BL Labs 2017 Symposium: Face Swap, Artistic Award Runner Up

Blog post by Tristan Roddis, Director of web development at Cogapp.

The genesis of this entry to the BL Labs awards 2017 (Artistic Award Runner up) can be traced back to an internal Cogapp hackathon back in July. Then I paired up with my colleague Jon White to create a system that was to be known as “the eyes have it”: the plan was to show the users webcam with two boxes for eyes overlaid, and they would have to move their face into position, whereupon the whole picture would morph into a portrait painting that had their eyes in the same locations.

So we set to work using OpenCV and Python to detect faces and eyes in both live video and a library of portraits from the National Portrait Gallery.

We quickly realised that this wasn’t going to work:

Green rectangles are what OpenCV think are eyes. I have too many. 
Green rectangles are what OpenCV think are eyes. I have too many. 

It turns out that eye detection is a bit too erratic, so we changed tack and only considered the whole face instead. I created a Python script to strip out the coordinates for faces from the portraits we had to hand, and another that would do the same for an individual frame from the webcam video sent from the browser to Python using websockets. Once we had both of these coordinates, the Python script sent the data back to the web front end, where Jon used the HTML <canvas> element to overlay the cropped portrait face exactly over the detected webcam face. As soon as we saw this in action, we realized we’d made something interesting and amusing!

image from media.giphy.com

And that was it for the first round of development. By the end of the day we had a rudimentary system that could successfully overlay faces on video. You can read more about that project on the Cogapp blog, or see the final raw output in this video:

A couple of months later, we heard about the British Library Labs Awards, and thought we should re-purpose this fledgling system to create something worth entering.

The first task was to swap out the source images for some from the British Library. Fortunately, the million public domain images that they published on Flickr contain 6,948 that have been tagged as “people”. So it was a simple matter to use a Flickr module for Python to download a few hundred of these and extract the face coordinates as before.

Once that was done, I roped in another colleague, Neil Hawkins, to help me add some improvements to the front-end display. In particular:

  • Handling more than one face in shot
  • Displaying the title of the work
  • Displaying a thumbnail image of the full source image

And that was it! The final result can be seen in the video below of us testing it in and around our office. We also plugged in a laptop running the system to a large monitor in the BL conference centre so that BL Labs Symposium delegates could experience it first-hand.

If you want to know more about this, please get in touch! Tristan Roddis tristanr@cogapp.com

A clip of Tristan receiving the Award is below (starts at 8:42 and finishes at 14:10)

 

23 January 2018

Using Transkribus for handwritten text recognition with the India Office Records

In this post, Alex Hailey, Curator, Modern Archives and Manuscripts, describes the Library's work with handwritten text recognition.

National Handwriting Day seems like a good time to introduce the Library’s initial work with the Transkribus platform to produce automatic Handwritten Text Recognition models for use with the India Office Records.

Transkribus is produced and supported as part of the READ project, and provides a platform 'for the automated recognition, transcription and searching of historical documents'. Users upload images and then identify areas of writing (text regions) and lines within those regions. Once a page has been segmented in this way, users transcribe the text to produce a 'ground truth' transcription – an accurate representation of the text on the page. The ground truth texts and images are then used to train a recurrent neural network to produce a tool to transcribe texts from images: a Handwritten Text Recognition (HTR) model.

image from https://s3.amazonaws.com/feather-client-files-aviary-prod-us-east-1/2018-01-22/8f108ba6-3247-429a-995c-6db42a4d3d7f.png
Page segmented using the automated line identification tool. The document structure tree can be seen in the left panel.

After hearing about the project at the Linnean Society’s From Cabinet to Internet conference in 2015, we decided to run a small pilot project using material digitised as part of the Botany in British India project.

Producing ground truth text and Handwritten Text Recognition (HTR) models

We created an initial set of ground truth training data for 200 images, produced by India Office curators and with the help of a PhD student. This data was sent to the Transkribus team to produce our first HTR model. We also supplied material for the construction of a dictionary to be used alongside the HTR, based on the text from the botany chapter of Science and the Changing Environment in India 1780-1920 and contemporary botanical texts.

The accuracy of an HTR model can be determined by generating an automated transcription, correcting any errors, and then comparing the two versions. The Transkribus comparison tool calculates a Character Error Rate (CER) and a Word Error Rate (WER), and also provides a handy visualisation. With our first HTR model we saw an average CER of 30% and WER of 50%, which reflected the small size of the training set and the number of different hands across the collections.

(Transkribus recommends using collections with one or two consistent hands, but we thought we would push on regardless to get an idea of the challenges when using complex, multi-authored archives).

Doc18776img16
WER and CER are quite unforgiving measures of accuracy. The image above has 18.5% WER and 9.5% CER

For our second model we created an additional 500 pages of ground truth text, resulting in a training set of 83,358 words over 14,599 lines. We saw a marked improvement in results with this second HTR model – an average WER of 30%, and CER of 15%.

image from https://s3.amazonaws.com/feather-client-files-aviary-prod-us-east-1/2018-01-22/a59e02fd-b126-424b-97c8-57aa42172c10.png
Graph showing the learning curve for our second HTR model, measured in CER

Improvements in the automatic layout detection and the ability to run the HTR over images in batch means that we can now generate ground truth more quickly by correcting computer-produced transcriptions than we could through a fully-manual process. We have since generated and corrected an additional 200 pages of transcriptions, and have expanded the training dataset for our next HTR model.

Lessons learned and next steps

We have now produced over 800 pages of corrected transcriptions using Transkribus, and have a much better idea of the challenges that the India Office material poses for current HTR technologies. Pages with margins and inconsistent paragraph widths prove challenging for the automatic layout detection, although the line identification has improved significantly, and tends to require only minor corrections (if any). Faint text, numerals, and tabulated text appeared to pose problems for our HTR models, as did particularly elaborate or lengthy ascenders and descenders.

More positively, we have signed a Memorandum of Understanding with the READ project, and are now able to take part in the exciting conversations around the transcription and searching of digitised manuscript materials, which we can hopefully start to feed into developments at the Library. The presentations from the recent Transkribus Conference are a good place to start if you want to learn more.

The transcriptions will be made available to researchers via data.bl.uk, and we are also planning to use them to test the ingest and delivery of transcriptions for manuscript material via the Universal Viewer.

By Alex Hailey, Curator, Modern Archives and Manuscripts

If you liked this post, you might also be interested in The good, the bad, and the cross-hatched on the Untold Lives blog.

22 January 2018

BL Labs 2017 Symposium: Data Mining Verse in 18th Century Newspapers by Jennifer Batt

Dr Jennifer Batt, Senior Lecturer at the University of Bristol, reported on an investigation in finding verse using text and data-mining methods in a collection of digitised eighteenth-century newspapers in the British Library’s Burney Collection to recover a complex, expansive, ephemeral poetic culture that has been lost to us for well over 250 years. The collection equates to around 1 million pages, around 700 or so bound volumes of 1271 titles of newspapers and news pamphlets published in London and also some English provincial, Irish and Scottish papers, and a few examples from the American colonies.

A video of her presentation is available below:

Jennifer's slides are available on SlideShare by clicking on the image below or following the link:

Datamining for verse in eighteenth-century newspapers
Datamining for verse in eighteenth-century newspapers

https://www.slideshare.net/labsbl/datamining-for-verse-in-eighteenthcentury-newsapers 

 

 

19 January 2018

BL Labs 2017 Symposium: Imaginary Cities by Michael Takeo Magruder - Artistic Award Winner

Artist Michael Takeo Magruder has been working with the British Library's digitised collections to produce stunning and thought-provoking artworks for his project, Imaginary Cities. This is an Arts-meets-Humanities research project exploring how large digital repositories of historical cultural materials can be used to create new born-digital artworks and real-time experiences which are relevant and exciting to 21st century audiences.

The project uses images - and the associated metadata - of pre-20th century urban maps drawn from the British Library’s online 1 Million Images from Scanned Books collection on Flickr Commons, and transformed this material into provocative fictional cityscapes. 

Michael was unable to attend the fifth annual British Library Labs Symposium in person, but gave a presentation about his work virtually which you can see here in this video:

Michael was also announced as the winner of the BL Labs Artistic Award 2017 and here is a short clip of him receiving his award via Skype:

(Michael's award is announced at 14 minutes and 30 seconds in to the video.)

If you are inspired to create something with the British Library's collections, find out more on the British Library Labs Awards pages. The deadline this year is midnight BST 11th October 2018. The winners will be announced at our sixth BL Labs symposium on Monday 12 November, 2018.

Posted by BL Labs.

 

17 January 2018

BL Labs 2017 Symposium: Keynote Talk by Josie Fraser

The fifth annual British Library Labs Symposium kicked off with an inspiring keynote speech by Josie Fraser, entitled ‘Open, Digital, Inclusive: Unleashing Knowledge’.

As well as working as senior technology adviser within the National Technology Team at the UK Government's Department for Digital, Culture, Media and Sport Josie is currently the Chair of Wikimedia UK.

Josie discussed the impact of the open knowledge movement on education and learning. She looked at the powerful role that Wikimedia UK and Wikimedians have played in bringing UK cultural institutions and their digital collections to new and wider audiences. Her talk also explored how open knowledge partnerships are driving diversity and better representation for all online. At the end, she took questions from the audience and invited them to join her in exploring ideas and opportunities for the future.

You can see a video the full talk, with an introduction by Dr Adam Farquhar, Head of Digital Scholarship at the British Library, here:

You can follow this link to see her slides:

 Josie slide 1

 https://www.slideshare.net/labsbl/open-digital-inclusive-unleashing-knowledge

The sixth BL Labs Symposium will be on the 12th November 2018.

Posted by BL Labs