21 December 2017
Cleaning and Visualising Privy Council Appeals Data
This blog post continues a recent post on the Social Sciences blog about the historical context of the Judicial Committee of the Privy Council (JCPC), useful collections to support research and online resources that facilitate discovery of JCPC appeal cases.
I am currently undertaking a three-month PhD student placement at the British Library, which aims enhance the discoverability of the JCPC collection of case papers and explore the potential of Digital Humanities methods for investigating questions about the court’s caseload and its actors. Two methods that I’ll be using include creating visualisations to represent data about these judgments and converting this data to Linked Data. In today’s post, I’ll focus on the process of cleaning the data and creating some initial visualisations; information about Linked Data conversion will appear in a later post.
The data I’m using refers to appeal cases that took place between 1860 and 1998. When I received the data, it was held in a spreadsheet where information such as ‘Judgment No.’, ‘Appellant’, ‘Respondent’, ‘Country of Origin’, ‘Judgment Date’ had been input from Word documents containing judgment metadata. This had been enhanced by generating a ‘Unique Identifier’ for each case by combining the judgment year and number, adding the ‘Appeal No.’ and ‘Appeal Date’ (where available) by consulting the judgment documents, and finding the ‘Longitude’ and ‘Latitude’ for each ‘Country of Origin’. The first few rows looked like this:
Data cleaning with OpenRefine
Before visualising or converting the data, some data cleaning had to take place. Data cleaning involves ensuring that consistent formatting is used across the dataset, there are no errors, and that the correct data is in the correct fields. To make it easier to clean the JCPC data, visualise potential issues more immediately, and ensure that any changes I make are consistent across the dataset, I'm using OpenRefine. This is free software that works in your web browser (but doesn't require a connection to the internet), which allows you to filter and facet your data based on values in particular columns, and batch edit multiple cells. Although it can be less efficient for mathematical functions than spreadsheet software, it is definitely more powerful for cleaning large datasets that mostly consist of text fields, like the JCPC spreadsheet.
Before visualising judgments on a map, I first looked at the 'Country of Origin' column. This column should more accurately be referred to as 'Location', as many of the entries were actually regions, cities or courts, instead of countries. To make this information more meaningful, and to allow comparison across countries e.g. where previously only the city was included, I created additional columns for 'Region', 'City' and 'Court', and populated the data accordingly:
An important factor to bear in mind here is that place names relate to their judgment date, as well as geographical area. Many of the locations previously formed part of British colonies that have since become independent, with the result that names and boundaries have changed over time. Therefore, I had to be sensitive to each location's historical and political context and ensure that I was inputting e.g. the region and country that a city was in on each specific judgment date.
In addition to the ‘Country of Origin’ field, the spreadsheet included latitude and longitude coordinates for each location. Following an excellent and very straightforward tutorial, I used these coordinates to create a map of all cases using Google Fusion Tables:
While this map shows the geographic distribution of JCPC cases, there are some issues. Firstly, multiple judgments (sometimes hundreds or thousands) originated from the same court, and therefore have the same latitude and longitude coordinates. This means that on the map they appear exactly on top of each other and it's only possible to view the details of the top 'pin', no matter how far you zoom in. As noted in a previous blog post, a map like this is already used by the Institute of Advanced Legal Studies (IALS); however, as it is being used here to display a curated subset of judgments, the issue of multiple judgments per location does not apply. Secondly, it only includes modern place names, which it does not seem to be possible to remove.
I then tried using Tableau Public to see if it could be used to visualise the data in a more accurate way. After following a tutorial, I produced a map that used the updated ‘Country’ field (with the latitude and longitude detected by Tableau) to show each country where judgments originated. These are colour coded in a ‘heatmap’ style, where ‘hotter’ colours like red represent a higher number of cases than ‘colder’ colours such as blue.
This map is a good indicator of the relative number of judgments that originated in each country. However, Tableau (understandably and unsurprisingly) uses the modern coordinates for these countries, and therefore does not accurately reflect their geographical extent when the judgments took place (e.g. the geographical area represented by ‘India’ in much of the dataset was considerably larger than the geographical area we know as India today). Additionally, much of the nuance in the colour coding is lost because the number of judgments originating from India (3,604, or 41.4%) are far greater than that from any other country. This is illustrated by a pie chart created using Google Fusion Tables:
Using Tableau again, I thought it would also be helpful to go to the level of detail provided by the latitude and longitude already included in the dataset. This produced a map that is more attractive and informative than the Google Fusion Tables example, in terms of the number of judgments from each set of coordinates.
The main issue with this map is that it still doesn't provide a way in to the data. There are 'info boxes' that appear when you hover over a dot, but these can be misleading as they contain combined information from multiple cases, e.g. if one of the cases includes a court, this court is included in the info box as if it applies to all the cases at that point. Ideally what I'd like here would be for each info box to link to a list of cases that originated at the relevant location, including their judgment number and year, to facilitate ordering and retrieval of the physical copy at the British Library. Additionally, each judgment would link to the digitised documents for that case held by the British and Irish Legal Information Institute (BAILII). However, this is unlikely to be the kind of functionality Tableau was designed for - it seems to be more for overarching visualisations than to be used as a discovery tool.
The above maps are interesting and provide a strong visual overview that cannot be gained from looking at a spreadsheet. However, they would not assist users in accessing further information about the judgments, and do not accurately reflect the changing nature of the geography during this period.
Dealing with dates
Another potentially interesting aspect to visualise was case duration. It was already known prior to the start of the placement that some cases were disputed for years, or even decades; however, there was no information about how representative these cases were of the collection as a whole, or how duration might relate to other factors, such as location (e.g. is there a correlation between case duration and distance from the JCPC headquarters in London? Might duration also correlate with the size and complexity of the printed record of proceedings contained in the volumes of case papers?).
The dataset includes a Judgment Date for each judgment, with some cases additionally including an Appeal Date (which started to be recorded consistently in the underlying spreadsheet from 1913). Although the Judgment Date shows the exact day of the judgment, the Appeal Date only gives the year of the appeal. This means that we can calculate the case duration to an approximate number of years by subtracting the year of appeal from the year of judgment.
Again, some data cleaning was required before making this calculation or visualising the information. Dates had previously been recorded in the spreadsheet in a variety of formats, and I used OpenRefine to ensure that all dates appeared in the form YYYY-MM-DD:
3) does it indicate possibility of lengthy set of case papers.?
It was then relatively easy to copy the year from each date to a new ‘Judgment Year’ column, and subtract the ‘Appeal Year’ to give the approximate case duration. Performing this calculation was quite helpful in itself, because it highlighted errors in some of the dates that were not found through format checking. Where the case duration seemed surprisingly long, or had a negative value, I looked up the original documents for the case and amended the date(s) accordingly.
Once the above tasks were complete, I created a bar chart in Google Fusion Tables to visualise case duration – the horizontal axis represents the approximate number of years between the appeal and judgment dates (e.g. if the value is 0, the appeal was decided in the same year that it was registered in the JCPC), and the vertical axis represents the number of cases:
This chart clearly shows that the vast majority of cases were up to two years in length, although this will also potentially include appeals of a short duration registered at the end of one year and concluded at the start of the next. A few took much longer, but are difficult to see due to the scale necessary to accommodate the longest bars. While this is a useful way to find particularly long cases, the information is incomplete and approximate, and so the maps would potentially be more helpful to a wider audience.
Experimenting with different visualisations and tools has given me a better understanding of what makes a visualisation helpful, as well as considerations that must be made when visualising the JCPC data. I hope to build on this work by trying out some more tools, such as the Google Maps API, but my next post will focus on another aspect of my placement – conversion of the JCPC data to Linked Data.
This post is by Sarah Middle, a PhD placement student at the British Library researching the appeal cases heard by the Judicial Committee of the Privy Council (JCPC). Sarah is on twitter as @digitalshrew.