THE BRITISH LIBRARY

Digital scholarship blog

4 posts from December 2017

30 December 2017

The Flitch of Bacon: An Unexpected Journey Through the Collections of the British Library

Digital Curator Dr. Mia Ridge writes: we're excited to feature this guest post from an In the Spotlight participant. Edward Mills is a PhD student at the University of Exeter working on Anglo-Norman didactic literature. He also runs his own (somewhat sporadic) blog, â€˜Anglo-Normantics’, and can be found Tweeting, rather more frequently, at @edward_mills.

Many readers of [Edward's] blog will doubtless be familiar with the work being done by the Digital Scholarship team, of which one particularly remarkable example is the ‘In the Spotlight‘ project. The idea behind the project, for anyone who may have missed it, is absolutely fascinating: to create crowd-sourced transcriptions of part of the Library’s enormous collection of playbills. The part of the project that I’ve been most involved with so far is concerned with titles, and it’s a two-part process; first, the title is identified out of the (numerous) lines of text on the page, and once this has been verified by multiple volunteers, it is then fed back into the database as an item for transcription.

PlaybillsPizarro
In the Spotlight interface

Often, though, the titles alone are more than sufficient to pique my interest. One such intriguing morsel came to light during a recent transcribing stint, when I found myself faced with a title that raised even more questions than Love, Law, & Physic:

image from https://s3.amazonaws.com/feather-client-files-aviary-prod-us-east-1/2017-12-21/85a34802-64e9-4beb-8156-9aa1517413cd.png
Playbill for a performance of The Flitch of Bacon

In my day-job, I’m actually a medievalist, which meant that any play entitled The Flitch of Bacon was bound to pique my interest. The ‘flitch’ refers to an ancient – and certainly medieval –  custom in Dunmow, Essex, wherein couples who could prove that they had never once regretted their marriage in a year and a day would be awarded a ‘flitch’ (side) of bacon in recognition of their fidelity. I first came across the custom of these ‘flitch trials’ while watching an episode of the excellent Citation Needed podcast, and was intrigued to learn from there that references to the trials existed as far back as Chaucer (more on which later). The trials have an unbroken tradition stretching back centuries, and videos from 1925, 1952 and 2012 go some way towards demonstrating their continuing popularity. What the British Library project revealed, however, was that the flitch also served as the driver for artistic creation in its own right. A little bit of digging revealed that the libretto to the 1776 Flitch of Bacon farce has been digitised as part of the British Library’s own collections, and the lyrics are every bit as spectacular as one might expect them to be.

image from https://s3.amazonaws.com/feather-client-files-aviary-prod-us-east-1/2017-12-21/36b47ae7-9dc4-48dc-8d5a-3e023eae6f27.png
Rev. Henry Bate, The Flitch of Bacon: A Comic Opera in Two Acts (London: T. Evans, 1779), p. 24.

So far, so … unique. But, of course, the medievalist that dwells deep within me couldn’t resist digging into the history of the tradition, and once again the British Library’s collections came up trumps. The official website for the Dunmow Flitch Trials (because of course such a thing exists) proudly asserts that ‘a reference … can even be found within Chaucer’s 14th-century Canterbury Tales‘, which of course can easily be checked with a quick skim through the Library’s wonderful catalogue of digitised manuscripts. The Wife of Bath’s Prologue opens with the titular wife describing her attitude towards her first three husbands, whom she ‘hadde […] hoolly in myn honde’. She keeps them so busy that they soon come to regret their marriage to her, forfeiting their right to ‘the bacoun …that som men fecche in Essex an Donmowe’ in the process:

image from https://s3.amazonaws.com/feather-client-files-aviary-prod-us-east-1/2017-12-21/8e410cff-7b1c-4413-ae03-635c2f58fac9.png
‘The bacoun was nought fet for hem I trowe / That som men fecche in Essex an Donmowe’. From the Wife of Bath’s Tale (British Library, MS Harley 7334, fol. 89r).

Chaucer’s reference to the flitch custom is frequently taken, along with William Langland’s allusion in Piers Plowman to couples who ‘do hem to Donemowe […] To folwe for the fliche’, to be the earliest reference to the tradition that can be found in English literature. Once again, though, the British Library’s collections can help us to put this particular statement to the test; as you’ve probably guessed by now, they show that there is indeed an earlier reference to the custom waiting to be found.

Baconanglonorman

Our source for this precocious French-language reference is MS Harley 4657. Like many surviving medieval manuscripts, this codex is often described as a ‘miscellany’: that is, a collection of shorter works brought together into a single volume. In the case of Harley 4657, the book appears to have been designed as a coherent whole, with the texts copied together at around the same time and sharing quires with each other; this is perhaps explained by the fact that the texts contained within it are all devotional and didactic in nature. (Miscellanies that were, by contrast, put together at a later date are known as recueils factices – another useful term, along with the ‘flitch of bacon’, to slip into conversation with friends and family members.) The bulk of the book is taken up by the Manuel des pechez, a guide to confession that was later translated into English by Robert Manning as Handling Synne. It’s in this text that the flitch custom makes an appearance, as part of a description of how many couples do not deserve any recompense for loyalty on account of their mutual mistrust (fol. 21):

image from https://s3.amazonaws.com/feather-client-files-aviary-prod-us-east-1/2017-12-21/7f90c385-77be-4cdb-9160-94c0aa7ce873.png

22 December 2017

All I want for Christmas is... playbills!

Digital Curator Mia Ridge with an update on our playbills crowdsourcing project (with apologises to Mariah Carey for the dodgy headline)...

What do you do once you've eaten all the chocolates and cheese and watched all the Christmas movies? If you haven't had a go at transcribing historic playbills yet, the holidays are a great time to start.

Home, Sweet Home from: A collection of playbills from miscellaneous theatres: Nottingham - Oswestry 1755-1848 Collection Item, ([British Isles]: s.n.], 1755-1848.) <http://access.bl.uk/item/viewer/ark:/81055/vdc_100022589132.0x000002>

As 2017 turns into 2018, we thought it was time for an update on our progress with In the Spotlight. We've had over 20,000 contributions from over 2,000 visitors from 61 countries. Together, they've completed 21 sets of tasks on individual volumes - a wonderful result. We're still analysing it but the transcribed data looks good so far. Our next step is agreeing the details of including the results in the Library's catalogue - once that's done, information from individual playbills will be searchable for the first time.

Since the project launched in early November we've had some fantastic feedback, questions and comments on our forum and on social media. For example, Sylvia Morris @sylvmorris1 has written two blog posts, International Migrants Day: Ira Aldridge and theatre and British Library project enlists public to transcribe historical playbills.Twitter users like @e_stanf shared fantastic images they'd discovered, and we even made The Stage and the Russian media! Look out for more updates and blog posts from project participants in the new year.

Questions from our participants include a request from a PhD student to collect references to plays set at fairs. A question about plays being 'for the benefit of' led to the Wikipedia entry for 'benefit performances' being updated with one of our images. Share your curiosities and questions on our forum or twitter - we love hearing from you!

We haven't forgotten about Convert-a-Card in the excitement of launching In the Spotlight. Since launch, this project for digitising information from old card catalogues has had over 33,000 contributions. Early in the new year, we'll be adding a thousand new records to the Library's catalogue. Our thanks to everyone who's made a contribution.

So if you're looking for entertainment these holidays, we invite you to step Into the Spotlight at http://playbills.libcrowds.com and discover how people entertained themselves before Netflix!

21 December 2017

Cleaning and Visualising Privy Council Appeals Data

This blog post continues a recent post on the Social Sciences blog about the historical context of the Judicial Committee of the Privy Council (JCPC), useful collections to support research and online resources that facilitate discovery of JCPC appeal cases.

I am currently undertaking a three-month PhD student placement at the British Library, which aims enhance the discoverability of the JCPC collection of case papers and explore the potential of Digital Humanities methods for investigating questions about the court’s caseload and its actors. Two methods that I’ll be using include creating visualisations to represent data about these judgments and converting this data to Linked Data. In today’s post, I’ll focus on the process of cleaning the data and creating some initial visualisations; information about Linked Data conversion will appear in a later post.

The data I’m using refers to appeal cases that took place between 1860 and 1998. When I received the data, it was held in a spreadsheet where information such as ‘Judgment No.’, ‘Appellant’, ‘Respondent’, ‘Country of Origin’, ‘Judgment Date’ had been input from Word documents containing judgment metadata. This had been enhanced by generating a ‘Unique Identifier’ for each case by combining the judgment year and number, adding the ‘Appeal No.’ and ‘Appeal Date’ (where available) by consulting the judgment documents, and finding the ‘Longitude’ and ‘Latitude’ for each ‘Country of Origin’. The first few rows looked like this:

Spreadsheet

Data cleaning with OpenRefine

Before visualising or converting the data, some data cleaning had to take place. Data cleaning involves ensuring that consistent formatting is used across the dataset, there are no errors, and that the correct data is in the correct fields. To make it easier to clean the JCPC data, visualise potential issues more immediately, and ensure that any changes I make are consistent across the dataset, I'm using OpenRefine. This is free software that works in your web browser (but doesn't require a connection to the internet), which allows you to filter and facet your data based on values in particular columns, and batch edit multiple cells. Although it can be less efficient for mathematical functions than spreadsheet software, it is definitely more powerful for cleaning large datasets that mostly consist of text fields, like the JCPC spreadsheet.

Geographic challenges

Before visualising judgments on a map, I first looked at the 'Country of Origin' column. This column should more accurately be referred to as 'Location', as many of the entries were actually regions, cities or courts, instead of countries. To make this information more meaningful, and to allow comparison across countries e.g. where previously only the city was included, I created additional columns for 'Region', 'City' and 'Court', and populated the data accordingly:

Country

An important factor to bear in mind here is that place names relate to their judgment date, as well as geographical area. Many of the locations previously formed part of British colonies that have since become independent, with the result that names and boundaries have changed over time. Therefore, I had to be sensitive to each location's historical and political context and ensure that I was inputting e.g. the region and country that a city was in on each specific judgment date.

In addition to the ‘Country of Origin’ field, the spreadsheet included latitude and longitude coordinates for each location. Following an excellent and very straightforward tutorial, I used these coordinates to create a map of all cases using Google Fusion Tables:

While this map shows the geographic distribution of JCPC cases, there are some issues. Firstly, multiple judgments (sometimes hundreds or thousands) originated from the same court, and therefore have the same latitude and longitude coordinates. This means that on the map they appear exactly on top of each other and it's only possible to view the details of the top 'pin', no matter how far you zoom in. As noted in a previous blog post, a map like this is already used by the Institute of Advanced Legal Studies (IALS); however, as it is being used here to display a curated subset of judgments, the issue of multiple judgments per location does not apply. Secondly, it only includes modern place names, which it does not seem to be possible to remove.

I then tried using Tableau Public to see if it could be used to visualise the data in a more accurate way. After following a tutorial, I produced a map that used the updated ‘Country’ field (with the latitude and longitude detected by Tableau) to show each country where judgments originated. These are colour coded in a ‘heatmap’ style, where ‘hotter’ colours like red represent a higher number of cases than ‘colder’ colours such as blue.

This map is a good indicator of the relative number of judgments that originated in each country. However, Tableau (understandably and unsurprisingly) uses the modern coordinates for these countries, and therefore does not accurately reflect their geographical extent when the judgments took place (e.g. the geographical area represented by ‘India’ in much of the dataset was considerably larger than the geographical area we know as India today). Additionally, much of the nuance in the colour coding is lost because the number of judgments originating from India (3,604, or 41.4%) are far greater than that from any other country. This is illustrated by a pie chart created using Google Fusion Tables:

Using Tableau again, I thought it would also be helpful to go to the level of detail provided by the latitude and longitude already included in the dataset. This produced a map that is more attractive and informative than the Google Fusion Tables example, in terms of the number of judgments from each set of coordinates.

The main issue with this map is that it still doesn't provide a way in to the data. There are 'info boxes' that appear when you hover over a dot, but these can be misleading as they contain combined information from multiple cases, e.g. if one of the cases includes a court, this court is included in the info box as if it applies to all the cases at that point. Ideally what I'd like here would be for each info box to link to a list of cases that originated at the relevant location, including their judgment number and year, to facilitate ordering and retrieval of the physical copy at the British Library. Additionally, each judgment would link to the digitised documents for that case held by the British and Irish Legal Information Institute (BAILII). However, this is unlikely to be the kind of functionality Tableau was designed for - it seems to be more for overarching visualisations than to be used as a discovery tool.

The above maps are interesting and provide a strong visual overview that cannot be gained from looking at a spreadsheet. However, they would not assist users in accessing further information about the judgments, and do not accurately reflect the changing nature of the geography during this period.

Dealing with dates

Another potentially interesting aspect to visualise was case duration. It was already known prior to the start of the placement that some cases were disputed for years, or even decades; however, there was no information about how representative these cases were of the collection as a whole, or how duration might relate to other factors, such as location (e.g. is there a correlation between case duration and  distance from the JCPC headquarters in London? Might duration also correlate with the size and complexity of the printed record of proceedings contained in the volumes of case papers?).

The dataset includes a Judgment Date for each judgment, with some cases additionally including an Appeal Date (which started to be recorded consistently in the underlying spreadsheet from 1913). Although the Judgment Date shows the exact day of the judgment, the Appeal Date only gives the year of the appeal. This means that we can calculate the case duration to an approximate number of years by subtracting the year of appeal from the year of judgment.

Again, some data cleaning was required before making this calculation or visualising the information. Dates had previously been recorded in the spreadsheet in a variety of formats, and I used OpenRefine to ensure that all dates appeared in the form YYYY-MM-DD:

Date

3) does it indicate possibility of lengthy set of case papers.?

It was then relatively easy to copy the year from each date to a new ‘Judgment Year’ column, and subtract the ‘Appeal Year’ to give the approximate case duration. Performing this calculation was quite helpful in itself, because it highlighted errors in some of the dates that were not found through format checking. Where the case duration seemed surprisingly long, or had a negative value, I looked up the original documents for the case and amended the date(s) accordingly.

Once the above tasks were complete, I created a bar chart in Google Fusion Tables to visualise case duration – the horizontal axis represents the approximate number of years between the appeal and judgment dates (e.g. if the value is 0, the appeal was decided in the same year that it was registered in the JCPC), and the vertical axis represents the number of cases:

 

This chart clearly shows that the vast majority of cases were up to two years in length, although this will also potentially include appeals of a short duration registered at the end of one year and concluded at the start of the next. A few took much longer, but are difficult to see due to the scale necessary to accommodate the longest bars. While this is a useful way to find particularly long cases, the information is incomplete and approximate, and so the maps would potentially be more helpful to a wider audience.

Experimenting with different visualisations and tools has given me a better understanding of what makes a visualisation helpful, as well as considerations that must be made when visualising the JCPC data. I hope to build on this work by trying out some more tools, such as the Google Maps API, but my next post will focus on another aspect of my placement – conversion of the JCPC data to Linked Data.

This post is by Sarah Middle, a PhD placement student at the British Library researching the appeal cases heard by the Judicial Committee of the Privy Council (JCPC).  Sarah is on twitter as @digitalshrew.    

18 December 2017

Workshop report: Identifiers for UK theses

Along with the Universities of Southampton and London South Bank, EThOS and DataCite UK have been investigating if having persistent identifiers (PIDs) for both a thesis and its data would help to liberate data from the appendices of the PDF document. With some funding from Jisc in 2014, we ran a survey and some case studies looking at the state of linking to research data underlying theses to see where improvements could be made. Since then, there has been some slow but steady progress towards realising its recommendations. Identifiers are now visible in EThOS itself (see image below) and a small number of UK institutions are now assigning Digital Object Identifiers (DOIs) to their theses on a regular basis. Many more are implementing ORCID iDs for their post-graduate students. We wanted to reignite the conversation around unlocking thesis data and see what was needed to progress it further.

EThOS_CambridgeRecord_JBB

On 4th December 2017, we ran a workshop to hear what progress is being made and what the remaining barriers are to applying persistent identifiers to theses and thesis data. We heard from both the University of Cambridge and the London School of Hygiene and Tropical Medicine, both of whom are assigning DOIs to published theses on a regular basis. They gave an outline of how they have got to this point, including the case made within the university to ensure DOIs were available for theses.

As institutions start to identify their theses with DOIs, we need to ensure that these identifiers are picked up and usable in EThOS. Heather Rosie (EThoS Metadata Manager) explained how the lack of any consistent identifier for theses up to this point hinders disambiguation – due to errors in titles and different representations of author names, we simply do not know many theses have been published in the UK. But Heather also highlighted what institutions can do to help ensure any available identifiers make their way into EThOS - by making sure they are available for harvest, especially via OAI-PMH.

Based on the morning’s presentations there was broad discussion around the remaining issues that institutions still have in applying their DOIs or ORCIDs to their published theses. These included barriers such as:

  • Low priority due to lack of buy in or interest from both researchers and institutional decision-makers. Interest could be increased by improving understanding of what PIDs are and what they can do, particularly the tangible benefits they provide
  • A single institution may use multiple systems to manage different pieces of information about its researchers and their outputs. This creates internally competing systems that overlap; uneven resource; and a lack of clarity about what details go where
  • Further technical barriers include having to rely on the suppliers of non-open source systems to make the appropriate changes. Where plug-ins for even open-source systems are developed at institution, the associated workflow might not be appropriate for all other users. Finally, technical support teams tend to be removed from Library staff
  • Sustainability of using the identifiers, especially in terms of cost.

The second half of the workshop looked towards both the future and the past: whether the British Library digitising its large collection of legacy theses on microfilm might be a way to make them available to users, but also to ensure they are digitally preserved and assigned persistent identifiers. Paul Joseph from the University of British Columbia (UBC) gave us a great example to consider here: they have digitised 32,000 (both doctoral and masters level) and made them openly available through their repository: assigning DOIs as they did so. A major concern for UK universities undertaking a similar endeavour is the inability to confirm that third-party rights have been cleared in the thesis. But under their clear take-down policy, it was interesting to hear that UBC find that they only receive 2-3 take-down notices per year.

The final discussions of the day covered community needs for the future. This included two topics carried over from the morning’s session, on how we make the case for wider application of identifiers to theses to researchers and senior management and what can be done to make technical integration and workflow changes possible or easier. We also dug down into the other persistent identifiers related to theses that would support the needs of the UK community (such as organisation identifiers and funding identifiers), the potential for the Library to mass-digitize theses and assign DOIs to them and the other steps that can be taken to break data out of the thesis.

Through these discussions we got a strong steer as to what we at the British Library need to do to help to support the community in using persistent identifiers as a way of encouraging greater availability of doctoral research. These include providing:

  • more advocacy for PIDs – for example to students & research managers. We heard that a message from BL goes a long way – ‘we have to ask you to claim an ORCID iD because the British Library says so’, or ‘DOIs are needed because national thesis policy says so’
  • metadata guidance for libraries. What we already provide is great but we could do more of it, e.g. best practice examples, support desk, engage with system suppliers on behalf of institutions
  • preservation of digital theses. This is urgently needed
  • a big piece of IPR work to give institutions the confidence to make legacy theses open access without express permission, including a press campaign to drive interest & support.

But it is not only the Library that attendees thought may influence developments. There was also a clear appetite for stronger mandates from funders to support the deposit of open theses and reduction of embargo periods. There was also interest in national-level activities such as a national strategy for UK theses or a Scholarly Communication Licence for theses.

It’s clear there’s still a lot to be done before we’re at a stage where we can rely on persistent identifiers to help us jail-break research data out of thesis appendices. But we’ll continue to work with the community on this through EThOS and DataCite UK. We hope to hold a webinar in 2018 to talk more about the outcomes of this workshop, but in the meantime you can direct any questions on this work to datasets@bl.uk.

This post is by Rachael Kotarski, the British Library's Data Services Lead, on twitter as @RachPK.