02 February 2018
Converting Privy Council Appeals Metadata to Linked Data
To continue the series of posts on metadata about appeals to the Judicial Committee of the Privy Council, this post describes the process of converting this data to Linked Data. In the previous post, I briefly explained the concept of Linked Data and outlined the potential benefits of applying this approach to the JCPC dataset. An earlier post explained how cleaning the data enabled me to produce some initial visualisations; a post on the Social Science blog provides some historical context about the JCPC itself.
In my previous post, I included the following diagram to show how the Linked JCPC Data might be structured.
To convert the dataset to Linked Data using this model, each entity represented by a blue node, and each class and property represented by the purple and green nodes need a unique identifier known as a Uniform Resource Indicator (URI). For the entities, I generated these URIs myself based on guidelines provided by the British Library, using the following structure:
In the above URIs, the ‘...’ is replaced by a unique reference to a particular appeal, judgment, or location, e.g. a combination of the judgment number and year.
To ensure that the data can easily be understood by a computer and linked to other datasets, the classes and properties should be represented by existing URIs from established ontologies. An ontology is a controlled vocabulary (like a thesaurus) that not only defines terms relating to a subject area, but also defines the relationships between those terms. Generic properties and classes, such as titles, dates, names and locations, can be represented by established ontologies like Dublin Core, Friend of a Friend (FOAF) and vCard.
After considerable searching I was unable to find any online ontologies that precisely represent the legal concepts in the JCPC dataset. Instead, I decided to use relevant terms from Wikidata, where available, and to create terms in a new JCPC ontology for those entities and concepts not defined elsewhere. Taking this approach allowed me to concentrate my efforts on the process of conversion, but the possibility remains to align these terms with appropriate legal ontologies in future.
An updated version of the data model shows the ontology terms used for classes and properties (purple and green boxes):
Rather than include the full URI for each property or class, the first part of the URI is represented by a prefix, e.g. ‘foaf’, which is followed by the specific term, e.g. ‘name’, separated by a colon.
More Data Cleaning
The data model diagram also helped identify fields in the spreadsheet that required further cleaning before conversion could take place. This cleaning largely involved editing the Appellant and Respondent fields to separate multiple parties that originally appeared in the same cell and to move descriptive information to the Appellant/Respondent Description column. For those parties whose names were identical, I additionally checked the details of the case to determine whether they were in fact the same person appearing in multiple appeals/judgments.
Reconciliation is the process of aligning identifiers for entities in one dataset with the identifiers for those entities in another dataset. If these entities are connected using Linked Data, this process implicitly links all the information about the entity in one dataset to the entity in the other dataset. For example, one of the people in the JCPC dataset is H. G. Wells – if we link the JCPC instance of H. G. Wells to his Wikidata identifier, this will then facilitate access to further information about H. G. Wells from Wikidata:
Rather than look up each of these entities manually, I used a reconciliation service provided by OpenRefine, a piece of software I used previously for cleaning the JCPC data. The reconciliation service automatically looks up each value in a particular column from an external source (e.g. an authority file) specified by the user. For each value, it either provides a definite match or a selection of possible matches to choose from. Consultant and OpenRefine guru Owen Stephens has put together a couple of really helpful screencasts on reconciliation.
While reconciliation is very clever, it still requires some human intervention to ensure accuracy. The reconciliation service will match entities with similar names, but they might not necessarily refer to exactly the same thing. As we know, many people have the same name, and the same place names appear in multiple locations all over the world. I therefore had to check all matches that OpenRefine said were ‘definite’, and discard those that matched the name but referred to an incorrect entity.
I initially looked for a suitable gazetteer or authority file to which I could link the various case locations. My first port of call was Geonames, the standard authority file for linking location data. This was encouraging, as it does include alternative and historical place names for modern places. However, it doesn't contain any additional information about the dates for which each name was valid, or the geographical boundaries of the place at different times (the historical/political nature of the geography of this period was highlighted in a previous post). I additionally looked for openly-available digital gazetteers for the relevant historical period (1860-1998), but unfortunately none yet seem to exist. However, I have recently become aware of the University of Pittsburgh’s World Historical Gazetteer project, and will watch its progress with interest. For now, Geonames seems like the best option, while being aware of its limitations.
Although there have been attempts to create standard URIs for courts, there doesn’t yet seem to be a suitable authority file to which I could reconcile the JCPC data. Instead, I decided to use the Virtual International Authority File (VIAF), which combines authority files from libraries all over the world. Matches were found for most of the courts contained in the dataset.
For the parties involved in the cases, I initially also used VIAF, which resulted in few definite matches. I therefore additionally decided to reconcile Appellant, Respondent, Intervenant and Third Party data to Wikidata. This was far more successful than VIAF, resulting in a combined total of about 200 matches. As a result, I was able to identify cases involving H. G. Wells, Bob Marley, and Frederick Deeming, one of the prime suspects for the Jack the Ripper murders. Due to time constraints, I was only able to check those matches identified as ‘definite’; more could potentially be found by looking at each party individually and selecting any appropriate matches from the list of possible options.
Once the entities were separated from each other and reconciled to external sources (where possible), the data was ready to convert to Linked Data. I did this using LODRefine, a version of OpenRefine packaged with plugins for producing Linked Data. LODRefine converts an OpenRefine project to Linked Data based on an ‘RDF skeleton’ specified by the user. RDF stands for Resource Description Framework, and is the standard by which Linked Data is represented. It describes each relationship in the dataset as a triple, comprising a subject, predicate and object. The subject is the entity you’re describing, the object is either a piece of information about that entity or another entity, and the predicate is the relationship between the two. For example, in the data model diagram we have the following relationship:
This is a triple, where the URI for the Appeal is the subject, the URI dc:title (the property ‘title’ in the Dublin Core terms vocabulary) is the predicate, and the value of the Appeal Title column is the object. I expressed each of the relationships in the data model as a triple like this one in LODRefine’s RDF skeleton. Once this was complete, it was simply a case of clicking LODRefine’s ‘Export’ button and selecting one of the available RDF formats. Having previously spent considerable time writing code to convert data to RDF, I was surprised and delighted by how quick and simple this process was.
The Linked Data version of the JCPC dataset is not yet available online as we’re currently going through the process of ascertaining the appropriate licence to publish it under. Once this is confirmed, the dataset will be available to download from data.bl.uk in both RDF/XML and Turtle formats.
The next post in this series will look at what can be done with the JCPC data following its conversion to Linked Data.
This post is by Sarah Middle, a PhD placement student at the British Library researching the appeal cases heard by the Judicial Committee of the Privy Council (JCPC). Sarah is on twitter as @digitalshrew.