THE BRITISH LIBRARY

Digital scholarship blog

9 posts categorized "Law"

02 February 2018

Converting Privy Council Appeals Metadata to Linked Data

Add comment

To continue the series of posts on metadata about appeals to the Judicial Committee of the Privy Council, this post describes the process of converting this data to Linked Data. In the previous post, I briefly explained the concept of Linked Data and outlined the potential benefits of applying this approach to the JCPC dataset. An earlier post explained how cleaning the data enabled me to produce some initial visualisations; a post on the Social Science blog provides some historical context about the JCPC itself.

Data Model

In my previous post, I included the following diagram to show how the Linked JCPC Data might be structured.

JCPCDataModelHumanReadable_V1_20180104

To convert the dataset to Linked Data using this model, each entity represented by a blue node, and each class and property represented by the purple and green nodes need a unique identifier known as a Uniform Resource Indicator (URI). For the entities, I generated these URIs myself based on guidelines provided by the British Library, using the following structure:

  • http://data.bl.uk/jcpc/id/appeal/...
  • http://data.bl.uk/jcpc/id/judgment/...
  • http://data.bl.uk/jcpc/id/location/...

In the above URIs, the ‘...’ is replaced by a unique reference to a particular appeal, judgment, or location, e.g. a combination of the judgment number and year.

To ensure that the data can easily be understood by a computer and linked to other datasets, the classes and properties should be represented by existing URIs from established ontologies. An ontology is a controlled vocabulary (like a thesaurus) that not only defines terms relating to a subject area, but also defines the relationships between those terms. Generic properties and classes, such as titles, dates, names and locations, can be represented by established ontologies like Dublin Core, Friend of a Friend (FOAF) and vCard.

After considerable searching I was unable to find any online ontologies that precisely represent the legal concepts in the JCPC dataset. Instead, I decided to use relevant terms from Wikidata, where available, and to create terms in a new JCPC ontology for those entities and concepts not defined elsewhere. Taking this approach allowed me to concentrate my efforts on the process of conversion, but the possibility remains to align these terms with appropriate legal ontologies in future.

An updated version of the data model shows the ontology terms used for classes and properties (purple and green boxes):

JCPCDataModel_V9_20180104

Rather than include the full URI for each property or class, the first part of the URI is represented by a prefix, e.g. ‘foaf’, which is followed by the specific term, e.g. ‘name’, separated by a colon.

More Data Cleaning

The data model diagram also helped identify fields in the spreadsheet that required further cleaning before conversion could take place. This cleaning largely involved editing the Appellant and Respondent fields to separate multiple parties that originally appeared in the same cell and to move descriptive information to the Appellant/Respondent Description column. For those parties whose names were identical, I additionally checked the details of the case to determine whether they were in fact the same person appearing in multiple appeals/judgments.

Reconciliation

Reconciliation is the process of aligning identifiers for entities in one dataset with the identifiers for those entities in another dataset. If these entities are connected using Linked Data, this process implicitly links all the information about the entity in one dataset to the entity in the other dataset. For example, one of the people in the JCPC dataset is H. G. Wells – if we link the JCPC instance of H. G. Wells to his Wikidata identifier, this will then facilitate access to further information about H. G. Wells from Wikidata:

ReconciliationExample_V1_20180115

 Rather than look up each of these entities manually, I used a reconciliation service provided by OpenRefine, a piece of software I used previously for cleaning the JCPC data. The reconciliation service automatically looks up each value in a particular column from an external source (e.g. an authority file) specified by the user. For each value, it either provides a definite match or a selection of possible matches to choose from. Consultant and OpenRefine guru Owen Stephens has put together a couple of really helpful screencasts on reconciliation.

While reconciliation is very clever, it still requires some human intervention to ensure accuracy. The reconciliation service will match entities with similar names, but they might not necessarily refer to exactly the same thing. As we know, many people have the same name, and the same place names appear in multiple locations all over the world. I therefore had to check all matches that OpenRefine said were ‘definite’, and discard those that matched the name but referred to an incorrect entity.

Locations

I initially looked for a suitable gazetteer or authority file to which I could link the various case locations. My first port of call was Geonames, the standard authority file for linking location data. This was encouraging, as it does include alternative and historical place names for modern places. However, it doesn't contain any additional information about the dates for which each name was valid, or the geographical boundaries of the place at different times (the historical/political nature of the geography of this period was highlighted in a previous post). I additionally looked for openly-available digital gazetteers for the relevant historical period (1860-1998), but unfortunately none yet seem to exist. However, I have recently become aware of the University of Pittsburgh’s World Historical Gazetteer project, and will watch its progress with interest. For now, Geonames seems like the best option, while being aware of its limitations.

Courts

Although there have been attempts to create standard URIs for courts, there doesn’t yet seem to be a suitable authority file to which I could reconcile the JCPC data. Instead, I decided to use the Virtual International Authority File (VIAF), which combines authority files from libraries all over the world. Matches were found for most of the courts contained in the dataset.

Parties

For the parties involved in the cases, I initially also used VIAF, which resulted in few definite matches. I therefore additionally decided to reconcile Appellant, Respondent, Intervenant and Third Party data to Wikidata. This was far more successful than VIAF, resulting in a combined total of about 200 matches. As a result, I was able to identify cases involving H. G. Wells, Bob Marley, and Frederick Deeming, one of the prime suspects for the Jack the Ripper murders. Due to time constraints, I was only able to check those matches identified as ‘definite’; more could potentially be found by looking at each party individually and selecting any appropriate matches from the list of possible options.

Conversion

Once the entities were separated from each other and reconciled to external sources (where possible), the data was ready to convert to Linked Data. I did this using LODRefine, a version of OpenRefine packaged with plugins for producing Linked Data. LODRefine converts an OpenRefine project to Linked Data based on an ‘RDF skeleton’ specified by the user. RDF stands for Resource Description Framework, and is the standard by which Linked Data is represented. It describes each relationship in the dataset as a triple, comprising a subject, predicate and object. The subject is the entity you’re describing, the object is either a piece of information about that entity or another entity, and the predicate is the relationship between the two. For example, in the data model diagram we have the following relationship:

  AppealTitleTriple_V1_20180108

This is a triple, where the URI for the Appeal is the subject, the URI dc:title (the property ‘title’ in the Dublin Core terms vocabulary) is the predicate, and the value of the Appeal Title column is the object. I expressed each of the relationships in the data model as a triple like this one in LODRefine’s RDF skeleton. Once this was complete, it was simply a case of clicking LODRefine’s ‘Export’ button and selecting one of the available RDF formats. Having previously spent considerable time writing code to convert data to RDF, I was surprised and delighted by how quick and simple this process was.

Publication

The Linked Data version of the JCPC dataset is not yet available online as we’re currently going through the process of ascertaining the appropriate licence to publish it under. Once this is confirmed, the dataset will be available to download from data.bl.uk in both RDF/XML and Turtle formats.

The next post in this series will look at what can be done with the JCPC data following its conversion to Linked Data.

This post is by Sarah Middle, a PhD placement student at the British Library researching the appeal cases heard by the Judicial Committee of the Privy Council (JCPC).  Sarah is on twitter as @digitalshrew.   

31 January 2018

Linking Privy Council Appeals Data

Add comment

This post continues a series of blog posts relating to a PhD placement project that seeks to make data about appeals heard by the Judicial Committee of the Privy Council (JCPC) available in new formats, to enhance discoverability, and to increase the potential for new historical and socio-legal research questions. Previous posts looked at the historical context of the JCPC and related online resources, as well as the process of cleaning the data and producing some initial visualisations.

When looking at the metadata about JCPC judgments between 1860 and 1998, it became clear to me that what was in fact being represented here was a network of appeals, judgments, courts, people, organisations and places. Holding this information in a spreadsheet can be extremely useful, as demonstrated by the visualisations created previously; however, this format does not accurately capture the sometimes complex relationships underlying these cases. As such, I felt that a network might be a more representative way of structuring the data, based on a Linked Data model.

Linked Data was first introduced by Tim Berners-Lee in 2006. It comprises a set of tools and techniques for connecting datasets based on features they have in common in a format that can be understood by computers. Structuring data in this way can have huge benefits for Humanities research, and has already been used in many projects – examples include linking ancient and historical texts based on the places mentioned within them (Pelagios) and bringing together information about people’s experiences of listening to music (Listening Experience Database). I decided to convert the JCPC data to Linked Data to make relationships between the entities contained within the dataset more apparent, as well as link to external sources, where available, to provide additional context to the judgment documents.

The image below shows how the fields from the JCPC spreadsheet might relate to each other in a Linked Data structure.

JCPCDataModelHumanReadable_V1_20180104

In this diagram:

  • Blue nodes represent distinct entities (specific instances of e.g. Judgment, Appellant, Location)
  • Purple nodes represent the classes that define these entities, i.e. what type of entity each blue node is (terms that represent the concepts of e.g. Judgment, Appellant, Location)
  • Green nodes represent properties that describe those entities (e.g. ‘is’, ‘has title’, ‘has date’)
  • Orange nodes represent the values of those properties (e.g. Appellant Name, Judgment Date, City)
  • Red nodes represent links to external sources that describe that entity

Using this network structure, I converted the JCPC data to Linked Data; the conversion process is outlined in detail in the next blog post in this series.

A major advantage of converting the JCPC data to Linked Data is the potential it provides for integration with other sources. This means that search queries can be conducted and visualisations can be produced that use the JCPC data in combination with one or more other datasets, such as those relating to a similar historical period, geographical area(s), or subject. Rather than these datasets existing in isolation from each other, connecting them could fill in gaps in the information and highlight new relationships involving appeals, judgments, locations or the parties involved. This could open up the possibilities for new research questions in legal history and beyond.

Linking the JCPC data will also allow new types of visualisation to be created, either by connecting it to other datasets, or on its own. One option is network visualisations, where the data is filtered based on various search criteria (e.g. by location, time period or names of people/organisations) and the results are displayed using the network structure shown above. Looking at the data as a network can demonstrate at a glance how the different components relate to each other, and could indicate interesting avenues for future research. In a later post in this series, I’ll look at some network visualisations created from the linked JCPC data, as well as what we can (and can’t) learn from them.

This post is by Sarah Middle, a PhD placement student at the British Library researching the appeal cases heard by the Judicial Committee of the Privy Council (JCPC).  Sarah is on twitter as @digitalshrew.    

21 December 2017

Cleaning and Visualising Privy Council Appeals Data

Add comment

This blog post continues a recent post on the Social Sciences blog about the historical context of the Judicial Committee of the Privy Council (JCPC), useful collections to support research and online resources that facilitate discovery of JCPC appeal cases.

I am currently undertaking a three-month PhD student placement at the British Library, which aims enhance the discoverability of the JCPC collection of case papers and explore the potential of Digital Humanities methods for investigating questions about the court’s caseload and its actors. Two methods that I’ll be using include creating visualisations to represent data about these judgments and converting this data to Linked Data. In today’s post, I’ll focus on the process of cleaning the data and creating some initial visualisations; information about Linked Data conversion will appear in a later post.

The data I’m using refers to appeal cases that took place between 1860 and 1998. When I received the data, it was held in a spreadsheet where information such as ‘Judgment No.’, ‘Appellant’, ‘Respondent’, ‘Country of Origin’, ‘Judgment Date’ had been input from Word documents containing judgment metadata. This had been enhanced by generating a ‘Unique Identifier’ for each case by combining the judgment year and number, adding the ‘Appeal No.’ and ‘Appeal Date’ (where available) by consulting the judgment documents, and finding the ‘Longitude’ and ‘Latitude’ for each ‘Country of Origin’. The first few rows looked like this:

Spreadsheet

Data cleaning with OpenRefine

Before visualising or converting the data, some data cleaning had to take place. Data cleaning involves ensuring that consistent formatting is used across the dataset, there are no errors, and that the correct data is in the correct fields. To make it easier to clean the JCPC data, visualise potential issues more immediately, and ensure that any changes I make are consistent across the dataset, I'm using OpenRefine. This is free software that works in your web browser (but doesn't require a connection to the internet), which allows you to filter and facet your data based on values in particular columns, and batch edit multiple cells. Although it can be less efficient for mathematical functions than spreadsheet software, it is definitely more powerful for cleaning large datasets that mostly consist of text fields, like the JCPC spreadsheet.

Geographic challenges

Before visualising judgments on a map, I first looked at the 'Country of Origin' column. This column should more accurately be referred to as 'Location', as many of the entries were actually regions, cities or courts, instead of countries. To make this information more meaningful, and to allow comparison across countries e.g. where previously only the city was included, I created additional columns for 'Region', 'City' and 'Court', and populated the data accordingly:

Country

An important factor to bear in mind here is that place names relate to their judgment date, as well as geographical area. Many of the locations previously formed part of British colonies that have since become independent, with the result that names and boundaries have changed over time. Therefore, I had to be sensitive to each location's historical and political context and ensure that I was inputting e.g. the region and country that a city was in on each specific judgment date.

In addition to the ‘Country of Origin’ field, the spreadsheet included latitude and longitude coordinates for each location. Following an excellent and very straightforward tutorial, I used these coordinates to create a map of all cases using Google Fusion Tables:

While this map shows the geographic distribution of JCPC cases, there are some issues. Firstly, multiple judgments (sometimes hundreds or thousands) originated from the same court, and therefore have the same latitude and longitude coordinates. This means that on the map they appear exactly on top of each other and it's only possible to view the details of the top 'pin', no matter how far you zoom in. As noted in a previous blog post, a map like this is already used by the Institute of Advanced Legal Studies (IALS); however, as it is being used here to display a curated subset of judgments, the issue of multiple judgments per location does not apply. Secondly, it only includes modern place names, which it does not seem to be possible to remove.

I then tried using Tableau Public to see if it could be used to visualise the data in a more accurate way. After following a tutorial, I produced a map that used the updated ‘Country’ field (with the latitude and longitude detected by Tableau) to show each country where judgments originated. These are colour coded in a ‘heatmap’ style, where ‘hotter’ colours like red represent a higher number of cases than ‘colder’ colours such as blue.

This map is a good indicator of the relative number of judgments that originated in each country. However, Tableau (understandably and unsurprisingly) uses the modern coordinates for these countries, and therefore does not accurately reflect their geographical extent when the judgments took place (e.g. the geographical area represented by ‘India’ in much of the dataset was considerably larger than the geographical area we know as India today). Additionally, much of the nuance in the colour coding is lost because the number of judgments originating from India (3,604, or 41.4%) are far greater than that from any other country. This is illustrated by a pie chart created using Google Fusion Tables:

Using Tableau again, I thought it would also be helpful to go to the level of detail provided by the latitude and longitude already included in the dataset. This produced a map that is more attractive and informative than the Google Fusion Tables example, in terms of the number of judgments from each set of coordinates.

The main issue with this map is that it still doesn't provide a way in to the data. There are 'info boxes' that appear when you hover over a dot, but these can be misleading as they contain combined information from multiple cases, e.g. if one of the cases includes a court, this court is included in the info box as if it applies to all the cases at that point. Ideally what I'd like here would be for each info box to link to a list of cases that originated at the relevant location, including their judgment number and year, to facilitate ordering and retrieval of the physical copy at the British Library. Additionally, each judgment would link to the digitised documents for that case held by the British and Irish Legal Information Institute (BAILII). However, this is unlikely to be the kind of functionality Tableau was designed for - it seems to be more for overarching visualisations than to be used as a discovery tool.

The above maps are interesting and provide a strong visual overview that cannot be gained from looking at a spreadsheet. However, they would not assist users in accessing further information about the judgments, and do not accurately reflect the changing nature of the geography during this period.

Dealing with dates

Another potentially interesting aspect to visualise was case duration. It was already known prior to the start of the placement that some cases were disputed for years, or even decades; however, there was no information about how representative these cases were of the collection as a whole, or how duration might relate to other factors, such as location (e.g. is there a correlation between case duration and  distance from the JCPC headquarters in London? Might duration also correlate with the size and complexity of the printed record of proceedings contained in the volumes of case papers?).

The dataset includes a Judgment Date for each judgment, with some cases additionally including an Appeal Date (which started to be recorded consistently in the underlying spreadsheet from 1913). Although the Judgment Date shows the exact day of the judgment, the Appeal Date only gives the year of the appeal. This means that we can calculate the case duration to an approximate number of years by subtracting the year of appeal from the year of judgment.

Again, some data cleaning was required before making this calculation or visualising the information. Dates had previously been recorded in the spreadsheet in a variety of formats, and I used OpenRefine to ensure that all dates appeared in the form YYYY-MM-DD:

Date

3) does it indicate possibility of lengthy set of case papers.?

It was then relatively easy to copy the year from each date to a new ‘Judgment Year’ column, and subtract the ‘Appeal Year’ to give the approximate case duration. Performing this calculation was quite helpful in itself, because it highlighted errors in some of the dates that were not found through format checking. Where the case duration seemed surprisingly long, or had a negative value, I looked up the original documents for the case and amended the date(s) accordingly.

Once the above tasks were complete, I created a bar chart in Google Fusion Tables to visualise case duration – the horizontal axis represents the approximate number of years between the appeal and judgment dates (e.g. if the value is 0, the appeal was decided in the same year that it was registered in the JCPC), and the vertical axis represents the number of cases:

 

This chart clearly shows that the vast majority of cases were up to two years in length, although this will also potentially include appeals of a short duration registered at the end of one year and concluded at the start of the next. A few took much longer, but are difficult to see due to the scale necessary to accommodate the longest bars. While this is a useful way to find particularly long cases, the information is incomplete and approximate, and so the maps would potentially be more helpful to a wider audience.

Experimenting with different visualisations and tools has given me a better understanding of what makes a visualisation helpful, as well as considerations that must be made when visualising the JCPC data. I hope to build on this work by trying out some more tools, such as the Google Maps API, but my next post will focus on another aspect of my placement – conversion of the JCPC data to Linked Data.

This post is by Sarah Middle, a PhD placement student at the British Library researching the appeal cases heard by the Judicial Committee of the Privy Council (JCPC).  Sarah is on twitter as @digitalshrew.    

03 November 2016

SherlockNet update - 10s of millions more tags and thousands of captions added to the BL Flickr Images!

Add comment

SherlockNet are Brian Do, Karen Wang and Luda Zhao, finalists for the Labs Competition 2016.

We have some exciting updates regarding SherlockNet, our ongoing efforts to using machine learning techniques to radically improve the discoverability of the British Library Flickr Commons image dataset.

Tagging

Over the past two months we’ve been working on expanding and refining the set of tags assigned to each image. Initially, we set out simply to assign the images to one of 11 categories, which worked surprisingly well with less than a 20% error rate. But we realised that people usually search from a much larger set of words, and we spent a lot of time thinking about how we would assign more descriptive tags to each image.

Eventually, we settled on a Google Images style approach, where we parse the text surrounding each image and use it to get a relevant set of tags. Luckily, the British Library digitised the text around all 1 million images back in 2007-8 using Optical Character Recognition (OCR), so we were able to grab this data. We explored computational tools such as Term Frequency – Inverse Document Frequency (Tf-idf) and Latent Dirichlet allocation (LDA), which try to assign the most “informative” words to each image, but found that images aren’t always associated with the words on the page.

To solve this problem, we decided to use a 'voting' system where we find the 20 images most similar to our image of interest, and have all images vote on the nouns that appear most commonly in their surrounding text. The most commonly appearing words will be the tags we assign to the image. Despite some computational hurdles selecting the 20 most similar images from a set of 1 million, we were able to achieve this goal. Along the way, we encountered several interesting problems.

Similar images
For all images, similar images are displayed
  1. Spelling was a particularly difficult issue. The OCR algorithms that were state of the art back in 2007-2008 are now obsolete, so a sizable portion of our digitised text was misspelled / transcribed incorrectly. We used a pretty complicated decision tree to fix misspelled words. In a nutshell, it amounted to finding the word that a) is most common across British English literature and b) has the smallest edit distance relative to our misspelled word. Edit distance is the fewest number of edits (additions, deletions, substitutions) needed to transform one word into another.
  2. Words come in various forms (e.g. ‘interest’, ‘interested’, ‘interestingly’) and these forms have to be resolved into one “stem” (in this case, ‘interest’). Luckily, natural language toolkits have stemmers that do this for us. It doesn’t work all the time (e.g. ‘United States’ becomes ‘United St’ because ‘ates’ is a common suffix) but we can use various modes of spell-check trickery to fix these induced misspellings.
  3. About 5% of our books are in French, German, or Spanish. In this first iteration of the project we wanted to stick to English tags, so how do we detect if a word is English or not? We found that checking each misspelled (in English) word against all 3 foreign dictionaries would be extremely computationally intensive, so we decided to throw out all misspelled words for which the edit distance to the closest English word was greater than three. In other words, foreign words are very different from real English words, unlike misspelled words which are much closer.
  4. Several words appear very frequently in all 11 categories of images. These words were ‘great’, ‘time’, ‘large’, ‘part’, ‘good’, ‘small’, ‘long’, and ‘present’. We removed these words as they would be uninformative tags.

In the end, we ended up with between 10 and 20 tags for each image. We estimate that between 30% and 50% of the tags convey some information about the image, and the other ones are circumstantial. Even at this stage, it has been immensely helpful in some of the searches we’ve done already (check out “bird”, “dog”, “mine”, “circle”, and “arch” as examples). We are actively looking for suggestions to improve our tagging accuracy. Nevertheless, we’re extremely excited that images now have useful annotations attached to them!

SherlockNet Interface

Sherlocknet-interface
SherlockNet Interface

For the past few weeks we’ve been working on the incorporation of ~20 million tags and related images and uploading them onto our website. Luckily, Amazon Web Services provides comprehensive computing resources to take care of storing and transferring our data into databases to be queried by the front-end.

In order to make searching easier we’ve also added functionality to automatically include synonyms in your search. For example, you can type in “lady”, click on Synonym Search, and it adds “gentlewoman”, “ma'am”, “madam”, “noblewoman”, and “peeress” to your search as well. This is particularly useful in a tag-based indexing approach as we are using.

As our data gets uploaded over the coming days, you should begin to see our generated tags and related images show up on the Flickr website. You can click on each image to view it in more detail, or on each tag to re-query the website for that particular tag. This way users can easily browse relevant images or tags to find what they are interested in.

Each image is currently captioned with a default description containing information on which source the image came from. As Luda finishes up his captioning, we will begin uploading his captions as well.

We will also be working on adding more advanced search capabilities via wrapper calls to the Flickr API. Proposed functionality will include logical AND and NOT operators, as well as better filtering by machine tags.

Captioning

As mentioned in our previous post, we have been experimenting with techniques to automatically caption images with relevant natural language captions. Since an Artificial Intelligence (AI) is responsible for recognising, understanding, and learning proper language models for captions, we expected the task to be far harder than that of tagging, and although the final results we obtained may not be ready for a production-level archival purposes, we hope our work can help spark further research in this field.

Our last post left off with our usage of a pre-trained Convolutional Neural Networks - Recurrent Neural Networks (CNN-RNN) architecture to caption images. We showed that we were able to produce some interesting captions, albeit at low accuracy. The problem we pinpointed was in the training set of the model, which was derived from the Microsoft COCO dataset, consisting of photographs of modern day scenes, which differs significantly from the BL Flickr dataset.

Through collaboration with BL Labs, we were able to locate a dataset that was potentially better for our purposes: the British Museum prints and drawing online collection, consisting of over 200,000 print drawing, and illustrations, along with handwritten captions describing the image, which the British Museum has generously given us permission to use in this context. However, since the dataset is directly obtained from the public SPARQL endpoints, we needed to run some pre-processing to make it usable. For the images, we cropped them to standard 225 x 225 size and converted them to grayscale. For caption, pre-processing ranged from simple exclusion of dates and author information, to more sophisticated “normalization” procedures, aimed to lessen the size of the total vocabulary of the captions. For words that are exceeding rare (<8 occurrences), we replaced them with <UNK> (unknown) symbols denoting their rarity. We used the same neuraltalk architecture, using the features from a Very Deep Convolutional Networks for Large-Scale Visual Recognition (VGGNet) as intermediate input into the language model. As it turns out, even with aggressive filtering of words, the distribution of vocabulary in this dataset was still too diverse for the model. Despite our best efforts to tune hyperparameters, the model we trained was consistently over-sensitive to key phrases in the dataset, which results in the model converging on local minimums where the captions would stay the same and not show any variation. This seems to be a hard barrier to learning from this dataset. We will be publishing our code in the future, and we welcome anyone with any insight to continue on this research.

Captions
Although there were occasion images with delightfully detailed captions (left), our models couldn’t quite capture useful information for the vast majority of the images(right). More work is definitely needed in this area!

The British Museum dataset (Prints and Drawings from the 19th Century) however, does contain valuable contextual data, and due to our difficulty in using it to directly caption the dataset, we decided to use it in other ways. By parsing the caption and performing Part-Of-Speech (POS) tagging, we were able to extract nouns and proper nouns from each caption. We then compiled common nouns from all the images and filtered out the most common(>=500 images) as tags, resulting in over 1100 different tags. This essentially converts the British Museum dataset into a rich dataset of diverse tags, which we would be able to apply to our earlier work with tag classification. We trained a few models with some “fun” tags, such as “Napoleon”, “parrots” and “angels”, and we were able to get decent testing accuracies of over 75% on binary labels. We will be uploading a subset of these tags under the “sherlocknet:tags” prefix to the Flickr image set, as well as the previous COCO captions for a small subset of images(~100K).

You can access our interface here: bit.ly/sherlocknet or look for 'sherlocknet:tag=' and 'sherlocknet:category=' tags on the British Library Flickr Commons site, here is an example, and see the image below:

Sherlocknet tags
Example Tags on a Flickr Image generated by SherlockNet

Please check it out and let us know if you have any feedback!

We are really excited that we will be there in London in a few days time to present our findings, why don't you come and join us at the British Library Labs Symposium, between 0930 - 1730 on Monday 7th of November, 2016?

28 January 2016

Book Now! Nottingham @BL_Labs Roadshow event - Wed 3 Feb (12.30pm-4pm)

Add comment Comments (0)

Do you live in or near Nottingham and are available on Wednesday 3 Feb between 1230 - 1600? Come along to the FREE UK @BL_Labs Roadshow event at GameCity and The National Video Game Arcade, Nottingham (we have some places left and booking is essential for anyone interested) and:

 

BL Labs Roadshow in Nottingham - Wed 3 Feb (1200 - 1600)
BL Labs Roadshow at GameCity and The National Video Game Arcade, Nottingham, hosted by the Digital Humanities and Arts (DHA) Praxis project based at the University of Nottingham, Wed 3 Feb (1230 - 1600)
  • Discover the digital collections the British Library has, understand some of the challenges of using them and even take some away with you.
  • Learn how researchers found and revived forgotten Victorian jokes and Political meetings from our digital archives.
  • Understand how special games and computer code have been developed to help tag un-described images and make new art.
  • Find out about a tool that links digitised handwritten manuscripts to transcribed texts and one that creates statistically representative samples from the British Library’s book collections.
  • Consider how the intuitions of a DJ could be used to mix and perform the Library's digital collections.
  • Talk to Library staff about how you might use some of the Library's digital content innovatively.
  • Get advice, pick up tips and feedback on your ideas and projects for the 2016 BL Labs Competition (deadline 11 April) and Awards (deadline 5 September). 

Our hosts are the Digital Humanities and Arts (DHA) Praxis project at the University of Nottingham who are kindly providing food and refreshments and will be talking about two amazing projects they have been involved in:

ArtMaps: putting the Tate Collection on the map project
ArtMaps: Putting the Tate Collection on the map

Dr Laura Carletti will be talking about the ArtMaps project which is getting the public to accurately tag the locations of the Tate's 70,000 artworks.

The 'Wander Anywhere' free mobile app developed by Dr Benjamin Bedwell.
The 'Wander Anywhere' free mobile app developed by Dr Benjamin Bedwell.

Dr Benjamin Bedwell, Research Fellow at the University of Nottingham will talk about the free mobile app he developed called 'Wander Anywhere'.  The mobile software offers users new ways to experience art, culture and history by guiding them to locations where it downloads stories intersecting art, local history, architecture and anecdotes on their mobile device relevant to where they are.

For more information, a detailed programme and to book your place, visit the Labs and Digital Humanities and Arts Praxis Workshop event page.

Posted by Mahendra Mahey, Manager of BL Labs.

The BL Labs project is funded by the Andrew W. Mellon Foundation.

27 January 2016

Come to our first @BL_Labs Roadshow event at #citylis London Mon 1 Feb (5pm-7.30pm)

Add comment Comments (0)

Labs Roadshow at #citylis London, Mon 1 Feb (5pm-7.30pm)

Live in or near North-East London and are available on Monday 1 Feb between 1700 - 1930? Come along to the first FREE UK Labs Roadshow event of 2016 (we have a few places left and booking is essential for anyone interested) and:

#citylis London BL Labs London Roadshow Event Mon 1 Feb (1730 - 1930)
#citylis at the Department for Information ScienceCity University London,
the first BL Labs Roadshow event Mon 1 Feb (1700 - 1930)
  • Discover the digital collections the British Library has, understand some of the challenges of using them and even take some away with you.
  • Learn how researchers found and revived forgotten Victorian jokes and Political meetings from our digital archives.
  • Understand how special games and computer code have been developed to help tag un-described images and make new art.
  • Talk to Library staff about how you might use some of the Library's digital content innovatively.
  • Get advice, pick up tips and feedback on your ideas and projects for the 2016 BL Labs Competition (deadline 11 April) and Awards (deadline 5 September). 

Our first hosts are the Department for Information Science (#citylis) at City University London. #citylis have kindly organised some refreshments, nibbles and also an exciting student discussion panel about their experiences of working on digital projects at the British Library, who are:

#citylis student panel  Top-left, Ludi Price and Top-right, Dimitra Charalampidou Bottom-left, Alison Pope and Bottom-right, Daniel van Strien
#citylis student panel.
Top-left, Ludi Price 
Top-right, Dimitra Charalampidou
Bottom-left, Alison Pope
Bottom-right, Daniel van Strien

For more information, a detailed programme and to book your place (essential), visit the BL Labs Workshop at #citylis event page.

Posted by Mahendra Mahey, Manager of BL Labs.

The BL Labs project is funded by the Andrew W. Mellon Foundation.

22 January 2016

BL Labs Competition and Awards for 2016

Add comment Comments (0)

Today the Labs team is launching the fourth annual Competition and Awards for 2016. Please help us spread the word by tweeting, re-blogging and telling anyone who might be interested about it!

British Library Labs Competition 2016

The annual Competition is looking for transformative project ideas which use the British Library’s digital collections and data in new and exciting ways. Two Labs Competition finalists will be selected to work 'in residence' with the BL Labs team between May and early November 2016, where they will get expert help, access to the Library’s resources and financial support to realise their projects.

Winners will receive a first prize of £3000 and runners up £1000 courtesy of the Andrew W. Mellon Foundation at the Labs Symposium on 7th November 2016 at the British Library in London where they will showcase their work.

The deadline for entering is midnight British Summer Time (BST) on 11th April 2016.

Labs Competition winners from previous years have produced an amazing range of creative and innovative projects. For example:

(Top-left)  Adam Crymble's Crowdsource Arcade (Bottom-left) Katrina Navickas' Political Meetings Mapper and (Right) Bob Nicholson's Mechanical Comedian.
(Top-left) Adam Crymble's Crowdsource Arcade and some specially developed games to help with tagging images
(Bottom-left) Katrina Navickas' Political Meetings Mapper and a photo from a Chartist re-enactment 
(Right) Bob Nicholson's Mechanical Comedian

A further range of inspiring and creative ideas have been submitted in previous years and some have been developed further.

British Library Labs Awards 2016

The annual Awards, introduced in 2015, formally recognises outstanding and innovative work that has been carried out using the British Library’s digital collections and data. This year, they will be commending work in four key areas:

  • Research - A project or activity which shows the development of new knowledge, research methods, or tools.
  • Commercial - An activity that delivers or develops commercial value in the context of new products, tools, or services that build on, incorporate, or enhance the Library's digital content.
  • Artistic - An artistic or creative endeavour which inspires, stimulates, amazes and provokes.
  • Teaching / Learning - Quality learning experiences created for learners of any age and ability that use the Library's digital content.

A prize of £500 will be awarded to the winner and £100 for the runner up for each category at the Labs Symposium on 7th November 2016 at the British Library in London, again courtesy of the Andrew W. Mellon Foundation.

The deadline for entering is midnight BST on 5th September 2016.

The Awards winners for 2015 produced a remarkable and varied collection of innovative projects in  Research, Creative/Artistic, Entrepreneurship categories and a special Jury's prize:

(Top-left) Spatial Humanities research group at the University Lancaster,  (Top-right) A computer generated work of art, part of  'The Order of Things' by Mario Klingemann,  (Bottom-left) A bow tie made by Dina Malkova  and (Bottom-right) work on Geo-referenced maps at the British Library that James Heald is still involved in.
(Top-left) Spatial Humanities research group at the University Lancaster plotting mentions of disease in newspapers on a map in Victorian times,
(Top-right) A computer generated work of art, part of 'The Order of Things' by Mario Klingemann,
(Bottom-left) A bow tie made by Dina Malkova inspired by a digitised original manuscript of Alice in Wonderland
(Bottom-right) Work on Geo-referencing maps discovered from a collection of digitised books at the British Library that James Heald is still involved in.
  • Research: “Representation of disease in 19th century newspapers” by the Spatial Humanities research group at Lancaster University analysed the British Library's digitised London based newspaper, The Era through innovative and varied selections of qualitative and quantitative methods in order to determine how, when and where the Victorian era discussed disease.
  • Creative / Artistic:  “The Order of Things” by Mario Klingemann involved the use of semi-automated image classification and machine learning techniques in order to add meaningful tags to the British Library’s one million Flickr Commons images, creating thematic collections as well as new works of art.
  • Entrepreneurship: “Redesigning Alice” by Dina Malkova produced a range of bow ties and other gift products inspired by the incredible illustrations from a digitised British Library original manuscript of Alice's Adventures Under Ground by Lewis Carroll and sold them through the Etsy platform and in the Alice Pop up shop at the British Library in London.
  • Jury's Special Mention: Indexing the BL 1 million and Mapping the Maps by volunteer James Heald describes both the work he has led and his collaboration with others to produce an index of 1 million 'Mechanical Curator collection' images on Wikimedia Commons from the British Library Flickr Commons images. This gave rise to finding 50,000 maps within this collection partially through a map-tag-a-thon which are now being geo-referenced.

A further range of inspiring work has been carried out with the British Library's digital content and collections.

If you are thinking of entering, please make sure you visit our Competition and Awards pages for further details.

Finally, if you have a specific question that can't be answered through these pages, feel free to contact us at labs@bl.uk, or why not come to one of the 'BL Labs Roadshow 2016' UK events we have scheduled between February and April 2016 to learn more about our digital collections and discuss your ideas?

We really look forward to reading your entries!

Posted by Mahendra Mahey, Manager of British Library Labs.

The British Library Labs project is funded by the Andrew W. Mellon Foundation.

 

12 November 2015

The third annual British Library Labs Symposium (2015)

Add comment Comments (0)

The third annual BL Labs Symposium took place on Monday 2nd November and the event was a great success!

The Labs Symposiums showcase innovative projects which use the British Library's digital content and provide a platform for development, networking and debate in the Digital Scholarship field.

The videos for the event are available here.

This year’s Symposium commenced with a keynote from Professor David De Roure, entitled “Intersection, Scale and Social Machines: The Humanities in the digital world”, which addressed current activity in digital scholarship within multidisciplinary and interdisciplinary frameworks.

DSL_6178

 Professor David De Roure giving the Symposium keynote speech

Caroline Brazier, the Chief Librarian of the British Library, then presented awards to the two winners of the British Library Labs Competition (2015) – Dr Adam Crymble and Dr Katrina Navickas, both lecturers of Digital History at the University of Hertfordshire.  

   DSL_6204

(L-R): Caroline Brazier, Chief Librarian; Competition winners Katrina Navickas and Adam Crymble; Dr Adam Farquhar, Head of Digital Scholarship 

After receiving their awards, it was time for Adam and Katrina to showcase their winning projects.

Adam’s project, entitled “Crowdsourcing Arcade: Repurposing the 1980s arcade console for scholarly image classification”, takes the crowdsourcing experience off the web and establishes it in a 1980s-style arcade game.

PB021291

Presentation by Dr Adam Crymble, BL Labs Competition (2015)  winner 

Katrina’s project, “Political Meetings Mapper: Bringing the British Library maps to life with the history of popular protest”, has developed a tool which extracts notices of meetings from historical newspapers and plots them on layers of historical maps from the British Library's collections.

PB021332

Presentation by Dr Katrina Navickas, BL Labs Competition (2015)  winner 

After lunch, the Symposium continued with Alice's Adventures Off the Map 2015 competition, produced and presented by Stella Wisdom, Digital Curator at the British Library. Each year, Off the Map challenges budding designers to use British Library digital collections as inspiration to create exciting interactive digital media.

The winning entry was "The Wondering Lands of Alice", created by Off Our Rockers, a team of six students from De Montfort University in Leicester: Dan Bullock, Freddy Canton, Luke Day, Denzil Forde, Amber Jamieson and Braden May.

 

Video: Alice's Adventures Off the Map 2015 competition winner 'The Wondering Lands of Alice'

This was followed by the presentations of the British Library Labs Awards (2015), a session celebrating BL Labs’ collaborations with researchers, artists and entrepreneurs from around the world in the innovative use of the British Library's digital collections.

The winners were: 

BL Labs Research Award (2015) – “Combining Text Analysis and Geographic Information Systems to investigate the representation of disease in nineteenth-century newspapers”, by The Spatial Humanities project at Lancaster University: Paul Atkinson, Ian Gregory, Andrew Hardie, Amelia Joulain-Jay, Daniel Kershaw, Cat Porter and Paul Rayson.  

The award was presented to one of the project collaborators, Ian Gregory, Professor of Digital Humanities at Lancaster University.

PB021372

Professor Ian Gregory  receiving the BL Labs Research Award (2015), on behalf of the Spatial Humanties project, from Dr Aquiles Alencar-Brayner

 

BL Labs Creative/Artistic Award (2015) – “The Order of Things” by Mario Klingemann, New Media Artist.

PB021381

Mario Klingemann receiving the BL Labs Creative/Artistic Award (2015) from Nora McGregor

  

BL Labs Entrepreneurial Award (2015) –“Redesigning Alice: Etsy and the British Library joint project” by Dina Malkova, designer and entrepreneur.

PB021398

Dina Malkova receiving the BL Labs Entrepreneurial Award (2015) from Dr Rossitza Atanassova

 

Jury’s Special Mention Award – “Indexing the BL 1 million and Mapping the Maps” by James Heald, Wikipedia contributor.

PB021417

James Heald receiving the Jury's Special Mention Award (2015) from Dr Mia Ridge

The Symposium concluded with a thought provoking panel session, “The Ups and Downs of Open”, chaired by George Oates, Director of Good, Form & Spectacle Ltd. George was joined by panelists Dr Mia Ridge, Digital Curator at the British Library, Jenn Phillips-Bacher, Web Manager at the Wellcome Library, and Paul Downey, Technical Architect at the Government Digital Service (GDS). The session discussed the issues, challenges and value of memory organisations opening up their digital content for use by others. 

PB021425

Panel session (L-R): George Oates; Jenn Phillips-Bacher; Paul Downey; Mia Ridge

The BL Labs team would like to thank everyone who attended and participated in this year’s Symposium, making the event the most successful one to date – and we look forward to seeing you all at next year’s BL Labs Symposium on Monday 7th of November 2016!

Posted by Mahendra Mahey, Manager of British Library Labs.

The British Library Labs Project is funded by the Andrew W. Mellon Foundation.